veilor-os/docs/research/2026-05-05-agent-wave/09-realhw-failure-modes.md
veilor-org 4e9782a18a docs: 9-agent research wave findings — v0.5.32 blocker map
Logs the full output of the 9-agent deep-dive run on 2026-05-05 to
docs/research/2026-05-05-agent-wave/. Pulls every actionable finding
into one indexed location so v0.5.32 planning has a paper trail.

Files:
  docs/research/2026-05-05-agent-wave/README.md             — index
  docs/research/2026-05-05-agent-wave/01-...real-hardware.md — Plymouth + LUKS edge cases
  docs/research/2026-05-05-agent-wave/02-...firstboot-ux.md  — SDDM + first-boot UX
  docs/research/2026-05-05-agent-wave/03-...spike-plan.md    — bootc-image-builder 1-week spike
  docs/research/2026-05-05-agent-wave/04-...tier-2.md         — AppArmor + nftables + audit + homed
  docs/research/2026-05-05-agent-wave/05-...launch.md         — threat model + v0.7 launch checklist
  docs/research/2026-05-05-agent-wave/06-...log-capture.md    — virtio-9p host-share for anaconda logs
  docs/research/2026-05-05-agent-wave/07-...skel-branding.md  — /etc/skel gap audit
  docs/research/2026-05-05-agent-wave/08-...ci-hardening.md   — SHA-pin actions + SBOM + SLSA L3
  docs/research/2026-05-05-agent-wave/09-...failure-modes.md  — real-hardware pessimistic audit

Plus the prior linter-applied:
  docs/ROADMAP.md      — Lessons learned section, v0.5.32 active block,
                          v0.6 promotion of veilor-postinstall + veilor-doctor,
                          v0.7 bootc spike scheduled
  docs/THREAT-MODEL.md  — drafted by Agent 5; in/out scope, comparison
                          matrix, v0.7 launch checklist

Top blockers identified for v0.5.32 (cross-cited in README):
  1. Suspend/resume wifi death (kernel.modules_disabled=1)
  2. veilor-firstboot.service WantedBy=graphical.target
  3. kernel-upgrade grub drift
  4. USBGuard hash-rules problem (already learned on onyx)
  5. firewalld blocks tailscale0
  6. /etc/skel/ empty
  7. virtio-9p log capture replaces broken virtio-serial path

Wave + verifier pattern (per ROADMAP lessons learned #4) validated:
9 parallel agents on distinct topics produced converging blocker
list. The same pattern landed v0.5.31 four-bug fix from the prior
4-agent verification wave on v0.5.30 outcome.
2026-05-05 14:52:53 +01:00

7.2 KiB
Raw Permalink Blame History

Real-hardware failure mode audit (post-v0.5.31)

Agent 9 of 9-agent wave, 2026-05-05. Pessimistic enumeration.

A. Boot path

A1. Secure Boot + GRUB_DISTRIBUTOR rebrand

  • shim chain itself untouched (uses /EFI/fedora/), but grub2-mkconfig regenerates entries naming veilor-os while shim only trusts paths under /EFI/fedora/grubx64.efi. Strict UEFI: menu boots, kernel signatures verify via Fedora's MOK chain. Risk: os-prober writing dual-boot Windows entries breaks MBR/MOK.
  • Symptom: dual-boot with Windows shows Verification failed: (0x1A) Security Violation.
  • Prob: MED. Fix: S. Target: v0.5.32.

A2. KMS handoff — fbcon=nodefer necessary but not sufficient

  • On Intel Arc/iGPU late-gen + NVIDIA proprietary chains, 5-15s blank between vt switch and SDDM start because simpledrm releases before i915/nvidia-drm claim.
  • Symptom: ~10s blank pre-SDDM; user thinks crashed.
  • Prob: HIGH. Fix: S — add i915.modeset=1 nvidia-drm.modeset=1 amdgpu.modeset=1. SDDM Type=simple startup. Target: v0.5.32.

A3. USBGuard hash-based rules

  • scripts/20-harden-kernel.sh:127-131 ships empty rules.conf with ImplicitPolicyTarget=block. First boot, admin runs usbguard generate-policy. Per feedback_usbguard_dock.md, this writes hash+parent-hash rules that break on dock replug.
  • Symptom: keyboard/mouse dies on first dock unplug-replug.
  • Prob: HIGH. Fix: M — patch invocation to --with-hash=false, or ship veilor-usbguard-enroll wrapper. Target: v0.5.32 (same bug we already learned).

A4. Wifi/Bluetooth firmware

  • @hardware-support pulls linux-firmware etc.
  • Realtek RTL8852/MT7921 firmware ships in linux-firmware-whence only.
  • Prob: LOW. Fix: S (add explicit linux-firmware-whence). Target: v0.5.32.

A5. Bluetooth disabled at boot

  • scripts/20-harden-kernel.sh:111 disables bluetooth service. BT keyboards/mice don't pair until user enables service.
  • Prob: MED (laptop users). Fix: S — leave bluetooth.service enabled, mask obex only. Target: v0.6.

B. First-boot KDE session

B1. Plasma 6 Wayland fallback on hybrid graphics

  • SDDM config doesn't pin session. NVIDIA Optimus + intel-iris triggers Wayland → silent fallback to X11 on some HW.
  • Symptom: screen tearing, no fractional scaling.
  • Prob: MED. Fix: S — add [Autologin] Session=plasma
    • [General] DefaultSession=plasma.desktop. Target: v0.6.

B2. SUSPEND/RESUME KILLS WIFI — THE BIG ONE 🚨

  • veilor-modules-lock.service sets kernel.modules_disabled=1 30s after graphical.target. iwlwifi, iwlmvm, cfg80211 reload on resume from S3/S0ix. With modules locked: resume → permanent wifi death until reboot. Same for nvidia autoload, xhci_pci re-init on dock attach.
  • Symptom: close laptop lid → reopen → no wifi, no dock USB, until reboot.
  • Prob: VERY HIGH (every laptop user, day 1).
  • Fix: M — gate lock on ConditionACPower=true + reset on suspend, OR move from modules_disabled to module.sig_enforce=1 kernel cmdline (no runtime lock needed).
  • Target: v0.5.32 — BLOCKER.

B3. Lid-close handling

  • logind.conf not modified. Defaults HandleLidSwitch=suspend. Combined with B2, every lid close = wifi loss.
  • Prob: HIGH. Fix: S. Target: v0.5.32 (paired with B2).

C. Day-2 ops

C1. /etc/default/grub + /etc/kernel/cmdline drift

  • Kickstart writes GRUB_CMDLINE_LINUX_DEFAULT="". Real installer writes /etc/kernel/cmdline with LUKS rd.luks args. kernel-install reads the latter; grub2-mkconfig re-reads /etc/default/grub.
  • Symptom: dnf upgrade kernel regenerates grub.cfg from default/grub, drops LUKS unlock args from new entry → unbootable.
  • Prob: HIGH. Fix: M — sync both files in veilor-update, or migrate fully to BLS without grub-mkconfig.
  • Target: v0.5.32.

C2. SELinux relabel on first boot

  • firstboot.sh flips to enforcing and touch /.autorelabel. On large /home (encrypted btrfs), relabel takes 2-5min — user sees frozen screen with cursor.
  • Symptom: stuck "first boot" appears hung.
  • Prob: MED. Fix: S (add plymouth message). Target: v0.6.

C3. F44 upgrade

  • Hardcoded python3.14 path (kickstart:334) for transaction_progress.py patch. Survives no upgrade.
  • Prob: certainty by Nov 2026. Fix: M. Target: v0.6.

C4. chrony NTS unreachable from corp networks

  • Cloudflare NTS over UDP 4460 blocked by many corp firewalls. chronyd will fail-stop sync.
  • Symptom: clock skew → TLS failures → broken everything.
  • Prob: MED. Fix: S (add fallback pool line — already present, verify ordering). Target: v0.5.32.

D. Networking

D1. firewalld drop zone vs Tailscale 🚨

  • tailscale up requires UDP 41641 + tailscale0 trusted. Default drop zone blocks tailscale0.
  • Prob: HIGH (this user uses Tailscale daily).
  • Fix: S — ship /etc/firewalld/zones/trusted.xml with tailscale0 interface.
  • Target: v0.5.32.

D2. systemd-resolved DoT vs corp split-DNS

  • No /etc/resolved.conf.d entries shipped (overlay dir empty).
  • Corp internal hostnames fail.
  • Prob: LOW. Fix: M. Target: v0.7.

E. Hardware diversity

E1. NVMe vs SATA LUKS perf

  • Argon2id KDF tuned to memory, not IO.
  • Prob: cosmetic. Skip.

E2. ARM aarch64

  • Out of scope for v0.5/0.6.

E3. TPM2 unlock

  • Already on roadmap. Target: v0.7.

Top 10 ranked (prob × severity)

# Issue Prob Sev Target
1 B2 Suspend/resume wifi death (modules_disabled) VHIGH CRITICAL v0.5.32
2 C1 kernel-upgrade grub drift (LUKS args lost) HIGH CRITICAL v0.5.32
3 A3 USBGuard hash rules (dock replug) HIGH HIGH v0.5.32
4 D1 firewalld blocks tailscale0 HIGH HIGH v0.5.32
5 A2 KMS blank-screen 10s HIGH MED v0.5.32
6 B3 Lid-close suspend (compounds B2) HIGH MED v0.5.32
7 A1 Secure Boot + os-prober dual-boot MED HIGH v0.6
8 C4 NTS blocked corp MED MED v0.5.32
9 B1 Plasma Wayland fallback MED MED v0.6
10 C3 F44 path-pinned patch CERTAIN LOW (Nov) v0.6

Top 5 to preempt in v0.5.32

  1. B2 modules-lock vs resume — gate on no-pending-suspend, OR swap to module.sig_enforce=1 kernel cmdline.
  2. C1 cmdline drift — make veilor-update fail-loud if /etc/kernel/cmdline and /etc/default/grub diverge; regen BLS on every kernel install.
  3. A3 USBGuard id-based rulesveilor-usb-enroll wrapper that calls usbguard generate-policy --with-hash=false. Same fix that already burned us on onyx.
  4. D1 Tailscale zone — ship /etc/firewalld/zones/trusted.xml listing tailscale0, plus NetworkManager dispatcher to assign it.
  5. A2 KMS handoff — append i915.modeset=1 amdgpu.modeset=1 nvidia-drm.modeset=1 to bootloader cmdline.

Critical insight: B2 alone bricks the laptop for any user who closes their lid. Without that fix, v0.5.32 is shippable on desktops only. Same architectural class as the LUKS bug — security feature breaks legitimate kernel state transitions.