veilor-os/docs/research/2026-05-05-agent-wave/09-realhw-failure-modes.md
veilor-org 4e9782a18a docs: 9-agent research wave findings — v0.5.32 blocker map
Logs the full output of the 9-agent deep-dive run on 2026-05-05 to
docs/research/2026-05-05-agent-wave/. Pulls every actionable finding
into one indexed location so v0.5.32 planning has a paper trail.

Files:
  docs/research/2026-05-05-agent-wave/README.md             — index
  docs/research/2026-05-05-agent-wave/01-...real-hardware.md — Plymouth + LUKS edge cases
  docs/research/2026-05-05-agent-wave/02-...firstboot-ux.md  — SDDM + first-boot UX
  docs/research/2026-05-05-agent-wave/03-...spike-plan.md    — bootc-image-builder 1-week spike
  docs/research/2026-05-05-agent-wave/04-...tier-2.md         — AppArmor + nftables + audit + homed
  docs/research/2026-05-05-agent-wave/05-...launch.md         — threat model + v0.7 launch checklist
  docs/research/2026-05-05-agent-wave/06-...log-capture.md    — virtio-9p host-share for anaconda logs
  docs/research/2026-05-05-agent-wave/07-...skel-branding.md  — /etc/skel gap audit
  docs/research/2026-05-05-agent-wave/08-...ci-hardening.md   — SHA-pin actions + SBOM + SLSA L3
  docs/research/2026-05-05-agent-wave/09-...failure-modes.md  — real-hardware pessimistic audit

Plus the prior linter-applied:
  docs/ROADMAP.md      — Lessons learned section, v0.5.32 active block,
                          v0.6 promotion of veilor-postinstall + veilor-doctor,
                          v0.7 bootc spike scheduled
  docs/THREAT-MODEL.md  — drafted by Agent 5; in/out scope, comparison
                          matrix, v0.7 launch checklist

Top blockers identified for v0.5.32 (cross-cited in README):
  1. Suspend/resume wifi death (kernel.modules_disabled=1)
  2. veilor-firstboot.service WantedBy=graphical.target
  3. kernel-upgrade grub drift
  4. USBGuard hash-rules problem (already learned on onyx)
  5. firewalld blocks tailscale0
  6. /etc/skel/ empty
  7. virtio-9p log capture replaces broken virtio-serial path

Wave + verifier pattern (per ROADMAP lessons learned #4) validated:
9 parallel agents on distinct topics produced converging blocker
list. The same pattern landed v0.5.31 four-bug fix from the prior
4-agent verification wave on v0.5.30 outcome.
2026-05-05 14:52:53 +01:00

167 lines
7.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Real-hardware failure mode audit (post-v0.5.31)
**Agent 9 of 9-agent wave, 2026-05-05.** Pessimistic enumeration.
## A. Boot path
### A1. Secure Boot + GRUB_DISTRIBUTOR rebrand
- shim chain itself untouched (uses `/EFI/fedora/`), but `grub2-mkconfig`
regenerates entries naming `veilor-os` while shim only trusts paths
under `/EFI/fedora/grubx64.efi`. Strict UEFI: menu boots, kernel
signatures verify via Fedora's MOK chain. Risk: `os-prober` writing
dual-boot Windows entries breaks MBR/MOK.
- **Symptom:** dual-boot with Windows shows
`Verification failed: (0x1A) Security Violation`.
- **Prob:** MED. **Fix:** S. **Target:** v0.5.32.
### A2. KMS handoff — `fbcon=nodefer` necessary but not sufficient
- On Intel Arc/iGPU late-gen + NVIDIA proprietary chains, 5-15s blank
between vt switch and SDDM start because `simpledrm` releases before
`i915`/`nvidia-drm` claim.
- **Symptom:** ~10s blank pre-SDDM; user thinks crashed.
- **Prob:** HIGH. **Fix:** S — add `i915.modeset=1
nvidia-drm.modeset=1 amdgpu.modeset=1`. SDDM `Type=simple` startup.
**Target:** v0.5.32.
### A3. USBGuard hash-based rules
- `scripts/20-harden-kernel.sh:127-131` ships **empty** rules.conf
with `ImplicitPolicyTarget=block`. First boot, admin runs
`usbguard generate-policy`. Per `feedback_usbguard_dock.md`, this
writes hash+parent-hash rules that break on dock replug.
- **Symptom:** keyboard/mouse dies on first dock unplug-replug.
- **Prob:** HIGH. **Fix:** M — patch invocation to
`--with-hash=false`, or ship `veilor-usbguard-enroll` wrapper.
**Target:** v0.5.32 (same bug we already learned).
### A4. Wifi/Bluetooth firmware
- `@hardware-support` pulls `linux-firmware` etc.
- Realtek RTL8852/MT7921 firmware ships in `linux-firmware-whence` only.
- **Prob:** LOW. **Fix:** S (add explicit `linux-firmware-whence`).
**Target:** v0.5.32.
### A5. Bluetooth disabled at boot
- `scripts/20-harden-kernel.sh:111` disables `bluetooth` service.
BT keyboards/mice don't pair until user enables service.
- **Prob:** MED (laptop users). **Fix:** S — leave bluetooth.service
enabled, mask `obex` only. **Target:** v0.6.
## B. First-boot KDE session
### B1. Plasma 6 Wayland fallback on hybrid graphics
- SDDM config doesn't pin session. NVIDIA Optimus + intel-iris
triggers Wayland → silent fallback to X11 on some HW.
- **Symptom:** screen tearing, no fractional scaling.
- **Prob:** MED. **Fix:** S — add `[Autologin] Session=plasma`
+ `[General] DefaultSession=plasma.desktop`. **Target:** v0.6.
### B2. SUSPEND/RESUME KILLS WIFI — THE BIG ONE 🚨
- `veilor-modules-lock.service` sets `kernel.modules_disabled=1` 30s
after graphical.target. `iwlwifi`, `iwlmvm`, `cfg80211` reload on
resume from S3/S0ix. With modules locked: **resume → permanent
wifi death until reboot**. Same for `nvidia` autoload, `xhci_pci`
re-init on dock attach.
- **Symptom:** close laptop lid → reopen → no wifi, no dock USB,
until reboot.
- **Prob:** VERY HIGH (every laptop user, day 1).
- **Fix:** M — gate lock on `ConditionACPower=true` + reset on
suspend, OR move from `modules_disabled` to `module.sig_enforce=1`
kernel cmdline (no runtime lock needed).
- **Target:** v0.5.32 — **BLOCKER**.
### B3. Lid-close handling
- `logind.conf` not modified. Defaults `HandleLidSwitch=suspend`.
Combined with B2, every lid close = wifi loss.
- **Prob:** HIGH. **Fix:** S. **Target:** v0.5.32 (paired with B2).
## C. Day-2 ops
### C1. `/etc/default/grub` + `/etc/kernel/cmdline` drift
- Kickstart writes `GRUB_CMDLINE_LINUX_DEFAULT=""`. Real installer
writes `/etc/kernel/cmdline` with LUKS rd.luks args. `kernel-install`
reads the latter; `grub2-mkconfig` re-reads `/etc/default/grub`.
- **Symptom:** `dnf upgrade kernel` regenerates grub.cfg from
default/grub, drops LUKS unlock args from new entry → unbootable.
- **Prob:** HIGH. **Fix:** M — sync both files in `veilor-update`,
or migrate fully to BLS without grub-mkconfig.
- **Target:** v0.5.32.
### C2. SELinux relabel on first boot
- `firstboot.sh` flips to `enforcing` and `touch /.autorelabel`. On
large /home (encrypted btrfs), relabel takes 2-5min — user sees
frozen screen with cursor.
- **Symptom:** stuck "first boot" appears hung.
- **Prob:** MED. **Fix:** S (add plymouth message). **Target:** v0.6.
### C3. F44 upgrade
- Hardcoded `python3.14` path (kickstart:334) for transaction_progress.py
patch. Survives no upgrade.
- **Prob:** certainty by Nov 2026. **Fix:** M. **Target:** v0.6.
### C4. chrony NTS unreachable from corp networks
- Cloudflare NTS over UDP 4460 blocked by many corp firewalls.
chronyd will fail-stop sync.
- **Symptom:** clock skew → TLS failures → broken everything.
- **Prob:** MED. **Fix:** S (add fallback `pool` line — already
present, verify ordering). **Target:** v0.5.32.
## D. Networking
### D1. firewalld drop zone vs Tailscale 🚨
- `tailscale up` requires UDP 41641 + tailscale0 trusted. Default
`drop` zone blocks tailscale0.
- **Prob:** HIGH (this user uses Tailscale daily).
- **Fix:** S — ship `/etc/firewalld/zones/trusted.xml` with
`tailscale0` interface.
- **Target:** v0.5.32.
### D2. systemd-resolved DoT vs corp split-DNS
- No /etc/resolved.conf.d entries shipped (overlay dir empty).
- Corp internal hostnames fail.
- **Prob:** LOW. **Fix:** M. **Target:** v0.7.
## E. Hardware diversity
### E1. NVMe vs SATA LUKS perf
- Argon2id KDF tuned to memory, not IO.
- **Prob:** cosmetic. Skip.
### E2. ARM aarch64
- Out of scope for v0.5/0.6.
### E3. TPM2 unlock
- Already on roadmap. **Target:** v0.7.
## Top 10 ranked (prob × severity)
| # | Issue | Prob | Sev | Target |
|---|-------|------|-----|--------|
| 1 | **B2 Suspend/resume wifi death** (modules_disabled) | VHIGH | CRITICAL | v0.5.32 |
| 2 | **C1 kernel-upgrade grub drift** (LUKS args lost) | HIGH | CRITICAL | v0.5.32 |
| 3 | **A3 USBGuard hash rules** (dock replug) | HIGH | HIGH | v0.5.32 |
| 4 | **D1 firewalld blocks tailscale0** | HIGH | HIGH | v0.5.32 |
| 5 | **A2 KMS blank-screen 10s** | HIGH | MED | v0.5.32 |
| 6 | **B3 Lid-close suspend** (compounds B2) | HIGH | MED | v0.5.32 |
| 7 | **A1 Secure Boot + os-prober dual-boot** | MED | HIGH | v0.6 |
| 8 | **C4 NTS blocked corp** | MED | MED | v0.5.32 |
| 9 | **B1 Plasma Wayland fallback** | MED | MED | v0.6 |
| 10 | **C3 F44 path-pinned patch** | CERTAIN | LOW (Nov) | v0.6 |
## Top 5 to preempt in v0.5.32
1. **B2 modules-lock vs resume** — gate on no-pending-suspend, OR swap
to `module.sig_enforce=1` kernel cmdline.
2. **C1 cmdline drift** — make `veilor-update` fail-loud if
`/etc/kernel/cmdline` and `/etc/default/grub` diverge; regen BLS
on every kernel install.
3. **A3 USBGuard id-based rules**`veilor-usb-enroll` wrapper that
calls `usbguard generate-policy --with-hash=false`. Same fix that
already burned us on onyx.
4. **D1 Tailscale zone** — ship `/etc/firewalld/zones/trusted.xml`
listing `tailscale0`, plus NetworkManager dispatcher to assign it.
5. **A2 KMS handoff** — append `i915.modeset=1 amdgpu.modeset=1
nvidia-drm.modeset=1` to bootloader cmdline.
**Critical insight:** B2 alone bricks the laptop for any user who
closes their lid. Without that fix, v0.5.32 is shippable on desktops
only. Same architectural class as the LUKS bug — security feature
breaks legitimate kernel state transitions.