From b2468542c033c91203acb73f36a5ae7b44d79ac9 Mon Sep 17 00:00:00 2001 From: veilor-org Date: Tue, 5 May 2026 11:59:35 +0100 Subject: [PATCH] v0.5.30: broad error suppression + manual bootloader + virtio log capture MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three-layer fix for the persistent anaconda transaction failure that killed v0.5.28 (gen_grub_cfgstub) and v0.5.29 (aggregate dnf5 error). ## Layer 1: broad error suppression in transaction_progress.py dnf5 under RPM 6.0 + cmdline anaconda emits a final aggregate `error("transaction process has ended with errors..")` at end of transaction whenever its internal failure counter > 0, regardless of whether we suppressed individual script_error events. Reproduced twice. The narrow patch in v0.5.29 suppressed per-package errors but the aggregate still raised PayloadInstallationError and aborted the install before the bootloader phase ran. v0.5.30 patch turns the `elif token == 'error':` branch in process_transaction_progress into a log.warning. All four producers (cpio_error, script_error, unpack_error, generic error) now flow through to a warning + continue. Pattern matches both the original anaconda layout AND the v0.5.29 narrow-patched layout, so re-applying on top of either is a no-op. This brings us back to v0.5.28 broad-suppression behaviour. The side effect that bit us in v0.5.28 (silent grub2-efi-x64 scriptlet failure → empty /boot/efi/EFI/fedora/ → gen_grub_cfgstub fails) is addressed by Layer 2 below. ## Layer 2: bootloader install moved out of anaconda The generated install kickstart now has `bootloader --location=none`, which tells anaconda NOT to invoke its own bootloader install code path (and therefore NOT to call gen_grub_cfgstub). All grub work moves into the chroot %post block: 1. `dnf reinstall grub2-efi-x64 grub2-pc grub2-tools shim-x64 efibootmgr` — re-runs scriptlets in the chroot with full PID 1 systemd state, so the systemd-run-style triggers that anaconda's chroot truncates actually execute. 2. `grub2-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=fedora --no-nvram` — populates /boot/efi/EFI/fedora/ 3. `gen_grub_cfgstub /boot/grub2 /boot/efi/EFI/fedora` (or `grub2-mkconfig` fallback) — writes /boot/efi/EFI/fedora/grub.cfg. 4. `efibootmgr -c -d -p -L "veilor-os" -l \EFI\fedora\shimx64.efi` — registers the NVRAM boot entry pointing at the signed shim. Each step logs to stdout and continues on failure (`set +e` block); diagnostics surface in the install log without aborting the whole %post. ## Layer 3: virtio-serial log capture in run-vm.sh Anaconda 43.x autodetects `/dev/virtio-ports/org.fedoraproject.anaconda.log.0` and streams program/packaging/storage/anaconda logs through it in real time, before any tmpfs / pivot, before networking, surviving kernel panic. Wiring it into run-vm.sh means the host gets a tail-able log file at `test/anaconda-vm-YYYYMMDD-HHMMSS.log` for every VM run. We've lost logs three times in a row to anaconda failures + tmpfs reboots. This breaks the loop. ## Diagnostic story Before this commit: VM aborts → live ISO reboots itself → /tmp/ tmpfs gone → no logs → guess what failed. Three days, two and a half false fixes. After this commit: VM aborts → host has /home/admin/ai-lab/_github/veilor-os/test/anaconda-vm-*.log with the actual scriptlet output, the actual exit codes, the actual file-trigger failures. Future debug becomes evidence-based. Files changed: kickstart/veilor-os.ks — broad error suppression patch overlay/usr/local/bin/veilor-installer — --location=none + manual grub test/run-vm.sh — virtio-serial chardev wiring Verified: bash -n clean, ksvalidator clean. --- kickstart/veilor-os.ks | 61 ++++++++++++----- overlay/usr/local/bin/veilor-installer | 90 +++++++++++++++++++++++--- test/run-vm.sh | 19 ++++++ 3 files changed, 147 insertions(+), 23 deletions(-) diff --git a/kickstart/veilor-os.ks b/kickstart/veilor-os.ks index 183d267..99a4e56 100644 --- a/kickstart/veilor-os.ks +++ b/kickstart/veilor-os.ks @@ -304,40 +304,71 @@ EOF # - cpio_error / unpack_error / generic error → unchanged, still # raise PayloadInstallationError as anaconda intends. Real # transaction-fatal events still abort install (good). +# Patch anaconda's transaction_progress.py to suppress dnf5's +# transaction-error escalation under RPM 6.0 + cmdline mode. +# +# History of this patch: +# +# v0.5.28: BROAD patch — overrode `process_transaction_progress` so all +# four 'error' token producers (cpio_error, script_error, unpack_error, +# generic error) became log warnings. man-db scriptlet stopped killing +# the install. BUT silent grub2-efi-x64 scriptlet failure left +# /boot/efi/EFI/fedora/ incomplete → gen_grub_cfgstub failed. +# +# v0.5.29: NARROW patch — overrode only `script_error` callback. Caught +# the per-package scriptlet failures cleanly. BUT dnf5 still tracks +# its own internal error counter and emits a final aggregate +# `error("transaction process has ended with errors..")` at end of +# transaction, which still raised PayloadInstallationError. Install +# aborted before bootloader install ran. +# +# v0.5.30: BROAD patch + bootloader --location=none in install ks. +# This time we silence the aggregate error too, so install completes, +# but anaconda is told NOT to install bootloader itself. The +# generated install ks's chroot %post does it explicitly via +# `dnf reinstall grub2-efi-x64 shim-x64 + grub2-install + +# grub2-mkconfig + efibootmgr`. The chroot has PID 1 systemd state +# from the live ISO (not the target), so scriptlets get a real +# environment to run in, not anaconda's truncated chroot. This +# sidesteps gen_grub_cfgstub entirely. TP=/usr/lib64/python3.14/site-packages/pyanaconda/modules/payloads/payload/dnf/transaction_progress.py if [ -f "$TP" ]; then cp -a "$TP" "${TP}.veilor-bak" - # Replace the script_error self._queue.put(('error', ...)) line with a - # warning log + return. The script_error method is uniquely identified - # by its `return_code` argument; sed targets that line specifically. - # `python3 -c` block is more robust than nested sed across multi-line - # statements; rewrite the whole script_error method body. + # Replace the entire `elif token == 'error':` branch with log+continue. + # Pattern matches the original two-line block (log.error + raise). python3 - "$TP" <<'PYEOF' import sys, re path = sys.argv[1] src = open(path).read() -# Find the script_error method and replace the queue.put(...) line at its end +# Match: elif token == 'error':\n log.error(msg)\n raise PayloadInstallationError(...) +# Or any current substitution that looks like raise/log.warning at that level. new = re.sub( - r'( def script_error\(self, item, nevra, type, return_code\):.*?)\n self\._queue\.put\(\(.error., item\.get_package\(\)\.to_string\(\)\)\)', - r'\1\n log.warning("veilor: ignoring non-fatal scriptlet failure rc=%s for %s",\n return_code,\n item.get_package().to_string() if item else "unknown")\n # do NOT enqueue \'error\' — let install continue (RPM 6.0 cmdline regression workaround)', + r"elif token == 'error':\n log\.error\(msg\)\n (?:raise PayloadInstallationError\(\"An error occurred during the transaction: \" \+ msg\)|log\.warning\(\"veilor: ignoring non-fatal transaction error: %s\", msg\))", + "elif token == 'error':\n log.warning('veilor: suppressed dnf5 transaction error (RPM 6.0 cmdline regression): %s', msg)\n # Do not raise — anaconda --cmdline + dnf5 + RPM 6.0 emits this for any scriptlet\n # failure; we handle bootloader install manually in install ks %post chroot", src, - flags=re.DOTALL, count=1, ) if new == src: - print("[ERR] script_error method not found in expected form — anaconda layout changed") + # Try fresh-anaconda layout (no veilor patch yet) + new = re.sub( + r"elif token == 'error':\n log\.error\(msg\)\n raise PayloadInstallationError\(\"An error occurred during the transaction: \" \+ msg\)", + "elif token == 'error':\n log.warning('veilor: suppressed dnf5 transaction error: %s', msg)", + src, + count=1, + ) +if new == src: + print("[ERR] transaction_progress.py error-branch not found") sys.exit(1) open(path, "w").write(new) -print("[OK] transaction_progress.py: narrowed script_error override") +print("[OK] transaction_progress.py: broad error-branch suppressed") PYEOF - if grep -q "veilor: ignoring non-fatal scriptlet" "$TP"; then - # Drop the cached .pyc so the patched .py is what runs. + if grep -q "veilor: suppressed dnf5 transaction error" "$TP"; then rm -f /usr/lib64/python3.14/site-packages/pyanaconda/modules/payloads/payload/dnf/__pycache__/transaction_progress.*.pyc 2>/dev/null || true - echo "[OK] anaconda transaction_progress.py patched in live rootfs (script_error only)" + echo "[OK] anaconda transaction_progress.py patched (broad error suppression)" else - echo "[WARN] transaction_progress.py patch did not apply — anaconda layout may have changed" + echo "[WARN] transaction_progress.py patch did not apply" fi else echo "[WARN] transaction_progress.py not found at expected path" diff --git a/overlay/usr/local/bin/veilor-installer b/overlay/usr/local/bin/veilor-installer index 10f1c32..2cfbe36 100644 --- a/overlay/usr/local/bin/veilor-installer +++ b/overlay/usr/local/bin/veilor-installer @@ -405,14 +405,23 @@ __SSHKEY_DIRECTIVE__ # - `vsyscall=none` — kill legacy vsyscall page (Position-Independent # ROP-gadget surface) # - `fbcon=nodefer` — keep linux framebuffer console alive through KMS -# handoff so plymouth LUKS prompt and any boot-time text remain -# visible on real GPU drivers (intel/amdgpu/nvidia). Already in live -# ISO cmdline; was previously missing from installed-system cmdline, -# which produced a black-screen boot on real hardware until KMS -# stabilised. -# Anaconda picks bootloader location (UEFI ESP or BIOS MBR) automatically; -# `--location=mbr` would be cargo-cult on UEFI and risky on multi-disk. -bootloader --append="lockdown=integrity slab_nomerge init_on_alloc=1 init_on_free=1 randomize_kstack_offset=on vsyscall=none fbcon=nodefer" +# handoff so plymouth LUKS prompt remains visible on real GPUs. +# +# `--location=none` — DO NOT let anaconda install the bootloader. v0.5.30 +# moved bootloader install to %post chroot below for two reasons: +# 1. Anaconda's gen_grub_cfgstub script (efi.py:194-201) runs +# against an /boot/efi/EFI/fedora/ tree that grub2-efi-x64's +# posttrans scriptlet may not have populated yet — Fedora 43's +# RPM 6.0 + dnf5 + cmdline-mode anaconda combo is brittle here. +# Reproduced as "gen_grub_cfgstub script failed" twice. +# 2. Running grub2-install + grub2-mkconfig directly in %post lets +# us pick up the env after anaconda finishes the package +# transaction, with all scriptlets' file artifacts settled, and +# gives clearer error messages if anything goes wrong. +# We still install the packages (grub2-efi-x64, shim-x64, efibootmgr) +# via %packages — anaconda just doesn't auto-invoke its bootloader code +# path. +bootloader --location=none --append="lockdown=integrity slab_nomerge init_on_alloc=1 init_on_free=1 randomize_kstack_offset=on vsyscall=none fbcon=nodefer" # Disk: zero, LUKS2 (argon2id), btrfs subvolumes (no LVM intermediary). # Native btrfs-on-LUKS matches Fedora KDE Spin defaults; LVM+btrfs combo @@ -592,6 +601,71 @@ bash $REPO/scripts/kde-theme-apply.sh # on tty1 — text "Please enter passphrase for disk... :" — works in # QEMU sendkey AND on real hardware. +# Manual bootloader install (anaconda told to skip via --location=none) +# +# Anaconda's gen_grub_cfgstub script-runner (efi.py:194-201) is +# brittle on F43 RPM 6.0 cmdline mode — grub2-efi-x64's posttrans +# scriptlet may emit non-fatal errors that anaconda treats as +# transaction-fatal even with our error suppression patch. Doing +# the work in %post chroot bypasses that whole code path and gives +# us linear, debuggable steps. +# +# Order: +# 1. Re-run grub2-efi-x64 + shim-x64 scriptlets cleanly (dnf +# reinstall in chroot has full PID 1 systemd context, so the +# systemd-run inside man-db-style triggers actually runs). Re- +# installing repopulates /boot/efi/EFI/fedora/ if it was empty. +# 2. grub2-install — generic + UEFI. UEFI path is the meaningful one +# on virtio-vga and real hardware. +# 3. grub2-mkconfig — write /boot/grub2/grub.cfg + /boot/efi/EFI/fedora/grub.cfg. +# 4. efibootmgr — register the boot entry in NVRAM. +# +# Failure of any individual step is logged but does NOT abort the +# %post (set +e bracket). On a real failure the user sees the +# diagnostic text and can fix manually post-firstboot. + +set +e +echo "════════════════════════════════════════════════════════" +echo " bootloader install (manual; anaconda skipped via --location=none)" +echo "════════════════════════════════════════════════════════" + +# Disk we're targeting — anaconda already wrote /boot/efi mount, so +# the disk is whatever holds /boot/efi. +EFI_DISK=$(findmnt -n -o SOURCE /boot/efi 2>/dev/null | sed -E 's/p?[0-9]+$//') +[ -z "$EFI_DISK" ] && EFI_DISK="/dev/$(basename "$(realpath /sys/class/block/$(findmnt -n -o SOURCE /boot/efi | sed 's|/dev/||' | sed 's|p\?[0-9]\+$||') 2>/dev/null)")" +echo "[INFO] target disk for grub: ${EFI_DISK:-}" + +# Step 1: re-run grub2 + shim scriptlets in clean chroot +dnf reinstall -y grub2-efi-x64 grub2-efi-x64-modules grub2-pc grub2-pc-modules grub2-tools grub2-tools-extra shim-x64 efibootmgr 2>&1 | tail -10 || \ + echo "[WARN] dnf reinstall of grub stack failed" + +# Step 2: install grub to ESP (UEFI path, primary) +mkdir -p /boot/efi/EFI/fedora +grub2-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=fedora --no-nvram 2>&1 | tail -5 || \ + echo "[WARN] grub2-install (efi) failed" + +# Step 3: write the EFI grub.cfg stub (what gen_grub_cfgstub would have done) +if command -v gen_grub_cfgstub >/dev/null 2>&1; then + gen_grub_cfgstub /boot/grub2 /boot/efi/EFI/fedora 2>&1 | tail -5 || \ + echo "[WARN] gen_grub_cfgstub failed (will fall back to grub2-mkconfig)" +fi +# Always also write a full grub.cfg in /boot/grub2 (BLS source) and +# regenerate the EFI cfg via grub2-mkconfig for redundancy +grub2-mkconfig -o /boot/grub2/grub.cfg 2>&1 | tail -3 +[ -f /boot/efi/EFI/fedora/grub.cfg ] || \ + grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg 2>&1 | tail -3 + +# Step 4: register NVRAM entry +if [ -n "$EFI_DISK" ] && [ -e "$EFI_DISK" ]; then + EFI_PART_NUM=$(findmnt -n -o SOURCE /boot/efi | grep -oE '[0-9]+$') + [ -n "$EFI_PART_NUM" ] && \ + efibootmgr -c -d "$EFI_DISK" -p "$EFI_PART_NUM" -L "veilor-os" -l '\EFI\fedora\shimx64.efi' 2>&1 | tail -3 || \ + echo "[WARN] efibootmgr failed (NVRAM may already be set)" +fi + +echo "[INFO] bootloader install: see above for any [WARN] lines" +set -e + # GRUB branding: replace fedora distributor with veilor-os in menu titles. # Drop rhgb quiet from default cmdline → all kernel/systemd messages # visible. Plymouth `details` theme shows scrolling text (no splash, no diff --git a/test/run-vm.sh b/test/run-vm.sh index 521e8e5..352e936 100755 --- a/test/run-vm.sh +++ b/test/run-vm.sh @@ -197,6 +197,24 @@ echo " ISO : $ISO" echo " Disk : $DISK" echo " NVRAM : $NVRAM" echo " Seed : ${SEED_ISO:-}" +# Anaconda virtio-serial log channel. +# +# Anaconda 43.x autodetects /dev/virtio-ports/org.fedoraproject.anaconda.log.0 +# and streams program/packaging/storage/anaconda logs through it in real +# time, before any tmpfs / pivot, before networking. Survives kernel +# panic. The host gets a tail-able file. No anaconda CLI flag, no +# kickstart change, just the QEMU virtio-serial wiring. +# +# We've lost logs three times in a row to anaconda failures + tmpfs +# reboots. Wiring this up so future failures auto-capture. +ANACONDA_LOG="$TEST_DIR/anaconda-vm-$(date +%Y%m%d-%H%M%S).log" +ANACONDA_LOG_ARGS=( + -chardev "file,id=anaclog,path=$ANACONDA_LOG" + -device virtio-serial-pci,id=vs1 + -device "virtserialport,chardev=anaclog,bus=vs1.0,name=org.fedoraproject.anaconda.log.0" +) +echo " AnaLog: $ANACONDA_LOG" + echo " Mode : ${SECBOOT:+secboot}${SECBOOT:-stock UEFI}" echo " Inject: ${HOST_PUBKEY:+yes}${HOST_PUBKEY:-no (no host pubkey)}" echo "════════════════════════════════════════════════════════" @@ -215,6 +233,7 @@ exec qemu-system-x86_64 \ -drive file="$ISO",media=cdrom,readonly=on \ "${SEED_ARGS[@]}" \ "${MONITOR_ARGS[@]}" \ + "${ANACONDA_LOG_ARGS[@]}" \ -boot menu=on,splash-time=2000 \ -netdev user,id=net0,hostfwd=tcp::2222-:22 \ -device virtio-net-pci,netdev=net0 \