veilor-os/docs/research/2026-05-05-agent-wave/09-realhw-failure-modes.md

168 lines
7.2 KiB
Markdown
Raw Normal View History

docs: 9-agent research wave findings — v0.5.32 blocker map Logs the full output of the 9-agent deep-dive run on 2026-05-05 to docs/research/2026-05-05-agent-wave/. Pulls every actionable finding into one indexed location so v0.5.32 planning has a paper trail. Files: docs/research/2026-05-05-agent-wave/README.md — index docs/research/2026-05-05-agent-wave/01-...real-hardware.md — Plymouth + LUKS edge cases docs/research/2026-05-05-agent-wave/02-...firstboot-ux.md — SDDM + first-boot UX docs/research/2026-05-05-agent-wave/03-...spike-plan.md — bootc-image-builder 1-week spike docs/research/2026-05-05-agent-wave/04-...tier-2.md — AppArmor + nftables + audit + homed docs/research/2026-05-05-agent-wave/05-...launch.md — threat model + v0.7 launch checklist docs/research/2026-05-05-agent-wave/06-...log-capture.md — virtio-9p host-share for anaconda logs docs/research/2026-05-05-agent-wave/07-...skel-branding.md — /etc/skel gap audit docs/research/2026-05-05-agent-wave/08-...ci-hardening.md — SHA-pin actions + SBOM + SLSA L3 docs/research/2026-05-05-agent-wave/09-...failure-modes.md — real-hardware pessimistic audit Plus the prior linter-applied: docs/ROADMAP.md — Lessons learned section, v0.5.32 active block, v0.6 promotion of veilor-postinstall + veilor-doctor, v0.7 bootc spike scheduled docs/THREAT-MODEL.md — drafted by Agent 5; in/out scope, comparison matrix, v0.7 launch checklist Top blockers identified for v0.5.32 (cross-cited in README): 1. Suspend/resume wifi death (kernel.modules_disabled=1) 2. veilor-firstboot.service WantedBy=graphical.target 3. kernel-upgrade grub drift 4. USBGuard hash-rules problem (already learned on onyx) 5. firewalld blocks tailscale0 6. /etc/skel/ empty 7. virtio-9p log capture replaces broken virtio-serial path Wave + verifier pattern (per ROADMAP lessons learned #4) validated: 9 parallel agents on distinct topics produced converging blocker list. The same pattern landed v0.5.31 four-bug fix from the prior 4-agent verification wave on v0.5.30 outcome.
2026-05-05 14:52:53 +01:00
# Real-hardware failure mode audit (post-v0.5.31)
**Agent 9 of 9-agent wave, 2026-05-05.** Pessimistic enumeration.
## A. Boot path
### A1. Secure Boot + GRUB_DISTRIBUTOR rebrand
- shim chain itself untouched (uses `/EFI/fedora/`), but `grub2-mkconfig`
regenerates entries naming `veilor-os` while shim only trusts paths
under `/EFI/fedora/grubx64.efi`. Strict UEFI: menu boots, kernel
signatures verify via Fedora's MOK chain. Risk: `os-prober` writing
dual-boot Windows entries breaks MBR/MOK.
- **Symptom:** dual-boot with Windows shows
`Verification failed: (0x1A) Security Violation`.
- **Prob:** MED. **Fix:** S. **Target:** v0.5.32.
### A2. KMS handoff — `fbcon=nodefer` necessary but not sufficient
- On Intel Arc/iGPU late-gen + NVIDIA proprietary chains, 5-15s blank
between vt switch and SDDM start because `simpledrm` releases before
`i915`/`nvidia-drm` claim.
- **Symptom:** ~10s blank pre-SDDM; user thinks crashed.
- **Prob:** HIGH. **Fix:** S — add `i915.modeset=1
nvidia-drm.modeset=1 amdgpu.modeset=1`. SDDM `Type=simple` startup.
**Target:** v0.5.32.
### A3. USBGuard hash-based rules
- `scripts/20-harden-kernel.sh:127-131` ships **empty** rules.conf
with `ImplicitPolicyTarget=block`. First boot, admin runs
`usbguard generate-policy`. Per `feedback_usbguard_dock.md`, this
writes hash+parent-hash rules that break on dock replug.
- **Symptom:** keyboard/mouse dies on first dock unplug-replug.
- **Prob:** HIGH. **Fix:** M — patch invocation to
`--with-hash=false`, or ship `veilor-usbguard-enroll` wrapper.
**Target:** v0.5.32 (same bug we already learned).
### A4. Wifi/Bluetooth firmware
- `@hardware-support` pulls `linux-firmware` etc.
- Realtek RTL8852/MT7921 firmware ships in `linux-firmware-whence` only.
- **Prob:** LOW. **Fix:** S (add explicit `linux-firmware-whence`).
**Target:** v0.5.32.
### A5. Bluetooth disabled at boot
- `scripts/20-harden-kernel.sh:111` disables `bluetooth` service.
BT keyboards/mice don't pair until user enables service.
- **Prob:** MED (laptop users). **Fix:** S — leave bluetooth.service
enabled, mask `obex` only. **Target:** v0.6.
## B. First-boot KDE session
### B1. Plasma 6 Wayland fallback on hybrid graphics
- SDDM config doesn't pin session. NVIDIA Optimus + intel-iris
triggers Wayland → silent fallback to X11 on some HW.
- **Symptom:** screen tearing, no fractional scaling.
- **Prob:** MED. **Fix:** S — add `[Autologin] Session=plasma`
+ `[General] DefaultSession=plasma.desktop`. **Target:** v0.6.
### B2. SUSPEND/RESUME KILLS WIFI — THE BIG ONE 🚨
- `veilor-modules-lock.service` sets `kernel.modules_disabled=1` 30s
after graphical.target. `iwlwifi`, `iwlmvm`, `cfg80211` reload on
resume from S3/S0ix. With modules locked: **resume → permanent
wifi death until reboot**. Same for `nvidia` autoload, `xhci_pci`
re-init on dock attach.
- **Symptom:** close laptop lid → reopen → no wifi, no dock USB,
until reboot.
- **Prob:** VERY HIGH (every laptop user, day 1).
- **Fix:** M — gate lock on `ConditionACPower=true` + reset on
suspend, OR move from `modules_disabled` to `module.sig_enforce=1`
kernel cmdline (no runtime lock needed).
- **Target:** v0.5.32 — **BLOCKER**.
### B3. Lid-close handling
- `logind.conf` not modified. Defaults `HandleLidSwitch=suspend`.
Combined with B2, every lid close = wifi loss.
- **Prob:** HIGH. **Fix:** S. **Target:** v0.5.32 (paired with B2).
## C. Day-2 ops
### C1. `/etc/default/grub` + `/etc/kernel/cmdline` drift
- Kickstart writes `GRUB_CMDLINE_LINUX_DEFAULT=""`. Real installer
writes `/etc/kernel/cmdline` with LUKS rd.luks args. `kernel-install`
reads the latter; `grub2-mkconfig` re-reads `/etc/default/grub`.
- **Symptom:** `dnf upgrade kernel` regenerates grub.cfg from
default/grub, drops LUKS unlock args from new entry → unbootable.
- **Prob:** HIGH. **Fix:** M — sync both files in `veilor-update`,
or migrate fully to BLS without grub-mkconfig.
- **Target:** v0.5.32.
### C2. SELinux relabel on first boot
- `firstboot.sh` flips to `enforcing` and `touch /.autorelabel`. On
large /home (encrypted btrfs), relabel takes 2-5min — user sees
frozen screen with cursor.
- **Symptom:** stuck "first boot" appears hung.
- **Prob:** MED. **Fix:** S (add plymouth message). **Target:** v0.6.
### C3. F44 upgrade
- Hardcoded `python3.14` path (kickstart:334) for transaction_progress.py
patch. Survives no upgrade.
- **Prob:** certainty by Nov 2026. **Fix:** M. **Target:** v0.6.
### C4. chrony NTS unreachable from corp networks
- Cloudflare NTS over UDP 4460 blocked by many corp firewalls.
chronyd will fail-stop sync.
- **Symptom:** clock skew → TLS failures → broken everything.
- **Prob:** MED. **Fix:** S (add fallback `pool` line — already
present, verify ordering). **Target:** v0.5.32.
## D. Networking
### D1. firewalld drop zone vs Tailscale 🚨
- `tailscale up` requires UDP 41641 + tailscale0 trusted. Default
`drop` zone blocks tailscale0.
- **Prob:** HIGH (this user uses Tailscale daily).
- **Fix:** S — ship `/etc/firewalld/zones/trusted.xml` with
`tailscale0` interface.
- **Target:** v0.5.32.
### D2. systemd-resolved DoT vs corp split-DNS
- No /etc/resolved.conf.d entries shipped (overlay dir empty).
- Corp internal hostnames fail.
- **Prob:** LOW. **Fix:** M. **Target:** v0.7.
## E. Hardware diversity
### E1. NVMe vs SATA LUKS perf
- Argon2id KDF tuned to memory, not IO.
- **Prob:** cosmetic. Skip.
### E2. ARM aarch64
- Out of scope for v0.5/0.6.
### E3. TPM2 unlock
- Already on roadmap. **Target:** v0.7.
## Top 10 ranked (prob × severity)
| # | Issue | Prob | Sev | Target |
|---|-------|------|-----|--------|
| 1 | **B2 Suspend/resume wifi death** (modules_disabled) | VHIGH | CRITICAL | v0.5.32 |
| 2 | **C1 kernel-upgrade grub drift** (LUKS args lost) | HIGH | CRITICAL | v0.5.32 |
| 3 | **A3 USBGuard hash rules** (dock replug) | HIGH | HIGH | v0.5.32 |
| 4 | **D1 firewalld blocks tailscale0** | HIGH | HIGH | v0.5.32 |
| 5 | **A2 KMS blank-screen 10s** | HIGH | MED | v0.5.32 |
| 6 | **B3 Lid-close suspend** (compounds B2) | HIGH | MED | v0.5.32 |
| 7 | **A1 Secure Boot + os-prober dual-boot** | MED | HIGH | v0.6 |
| 8 | **C4 NTS blocked corp** | MED | MED | v0.5.32 |
| 9 | **B1 Plasma Wayland fallback** | MED | MED | v0.6 |
| 10 | **C3 F44 path-pinned patch** | CERTAIN | LOW (Nov) | v0.6 |
## Top 5 to preempt in v0.5.32
1. **B2 modules-lock vs resume** — gate on no-pending-suspend, OR swap
to `module.sig_enforce=1` kernel cmdline.
2. **C1 cmdline drift** — make `veilor-update` fail-loud if
`/etc/kernel/cmdline` and `/etc/default/grub` diverge; regen BLS
on every kernel install.
3. **A3 USBGuard id-based rules**`veilor-usb-enroll` wrapper that
calls `usbguard generate-policy --with-hash=false`. Same fix that
already burned us on onyx.
4. **D1 Tailscale zone** — ship `/etc/firewalld/zones/trusted.xml`
listing `tailscale0`, plus NetworkManager dispatcher to assign it.
5. **A2 KMS handoff** — append `i915.modeset=1 amdgpu.modeset=1
nvidia-drm.modeset=1` to bootloader cmdline.
**Critical insight:** B2 alone bricks the laptop for any user who
closes their lid. Without that fix, v0.5.32 is shippable on desktops
only. Same architectural class as the LUKS bug — security feature
breaks legitimate kernel state transitions.