v0.5.30: broad error suppression + manual bootloader + virtio log capture

Three-layer fix for the persistent anaconda transaction failure that
killed v0.5.28 (gen_grub_cfgstub) and v0.5.29 (aggregate dnf5 error).

## Layer 1: broad error suppression in transaction_progress.py

dnf5 under RPM 6.0 + cmdline anaconda emits a final aggregate
`error("transaction process has ended with errors..")` at end of
transaction whenever its internal failure counter > 0, regardless of
whether we suppressed individual script_error events. Reproduced
twice. The narrow patch in v0.5.29 suppressed per-package errors but
the aggregate still raised PayloadInstallationError and aborted the
install before the bootloader phase ran.

v0.5.30 patch turns the `elif token == 'error':` branch in
process_transaction_progress into a log.warning. All four producers
(cpio_error, script_error, unpack_error, generic error) now flow
through to a warning + continue. Pattern matches both the original
anaconda layout AND the v0.5.29 narrow-patched layout, so re-applying
on top of either is a no-op.

This brings us back to v0.5.28 broad-suppression behaviour. The
side effect that bit us in v0.5.28 (silent grub2-efi-x64 scriptlet
failure → empty /boot/efi/EFI/fedora/ → gen_grub_cfgstub fails)
is addressed by Layer 2 below.

## Layer 2: bootloader install moved out of anaconda

The generated install kickstart now has `bootloader --location=none`,
which tells anaconda NOT to invoke its own bootloader install code
path (and therefore NOT to call gen_grub_cfgstub). All grub work
moves into the chroot %post block:

  1. `dnf reinstall grub2-efi-x64 grub2-pc grub2-tools shim-x64
     efibootmgr` — re-runs scriptlets in the chroot with full
     PID 1 systemd state, so the systemd-run-style triggers that
     anaconda's chroot truncates actually execute.
  2. `grub2-install --target=x86_64-efi --efi-directory=/boot/efi
     --bootloader-id=fedora --no-nvram` — populates /boot/efi/EFI/fedora/
  3. `gen_grub_cfgstub /boot/grub2 /boot/efi/EFI/fedora` (or
     `grub2-mkconfig` fallback) — writes /boot/efi/EFI/fedora/grub.cfg.
  4. `efibootmgr -c -d <disk> -p <part> -L "veilor-os" -l \EFI\fedora\shimx64.efi`
     — registers the NVRAM boot entry pointing at the signed shim.

Each step logs to stdout and continues on failure (`set +e` block);
diagnostics surface in the install log without aborting the whole
%post.

## Layer 3: virtio-serial log capture in run-vm.sh

Anaconda 43.x autodetects `/dev/virtio-ports/org.fedoraproject.anaconda.log.0`
and streams program/packaging/storage/anaconda logs through it in
real time, before any tmpfs / pivot, before networking, surviving
kernel panic. Wiring it into run-vm.sh means the host gets a
tail-able log file at `test/anaconda-vm-YYYYMMDD-HHMMSS.log` for
every VM run.

We've lost logs three times in a row to anaconda failures + tmpfs
reboots. This breaks the loop.

## Diagnostic story

Before this commit: VM aborts → live ISO reboots itself → /tmp/
tmpfs gone → no logs → guess what failed. Three days, two and a
half false fixes.

After this commit: VM aborts → host has /home/admin/ai-lab/_github/veilor-os/test/anaconda-vm-*.log
with the actual scriptlet output, the actual exit codes, the
actual file-trigger failures. Future debug becomes evidence-based.

Files changed:
  kickstart/veilor-os.ks        — broad error suppression patch
  overlay/usr/local/bin/veilor-installer — --location=none + manual grub
  test/run-vm.sh                — virtio-serial chardev wiring

Verified: bash -n clean, ksvalidator clean.
This commit is contained in:
veilor-org 2026-05-05 11:59:35 +01:00
parent 613d35402e
commit e83483a077
3 changed files with 147 additions and 23 deletions

View file

@ -304,40 +304,71 @@ EOF
# - cpio_error / unpack_error / generic error → unchanged, still # - cpio_error / unpack_error / generic error → unchanged, still
# raise PayloadInstallationError as anaconda intends. Real # raise PayloadInstallationError as anaconda intends. Real
# transaction-fatal events still abort install (good). # transaction-fatal events still abort install (good).
# Patch anaconda's transaction_progress.py to suppress dnf5's
# transaction-error escalation under RPM 6.0 + cmdline mode.
#
# History of this patch:
#
# v0.5.28: BROAD patch — overrode `process_transaction_progress` so all
# four 'error' token producers (cpio_error, script_error, unpack_error,
# generic error) became log warnings. man-db scriptlet stopped killing
# the install. BUT silent grub2-efi-x64 scriptlet failure left
# /boot/efi/EFI/fedora/ incomplete → gen_grub_cfgstub failed.
#
# v0.5.29: NARROW patch — overrode only `script_error` callback. Caught
# the per-package scriptlet failures cleanly. BUT dnf5 still tracks
# its own internal error counter and emits a final aggregate
# `error("transaction process has ended with errors..")` at end of
# transaction, which still raised PayloadInstallationError. Install
# aborted before bootloader install ran.
#
# v0.5.30: BROAD patch + bootloader --location=none in install ks.
# This time we silence the aggregate error too, so install completes,
# but anaconda is told NOT to install bootloader itself. The
# generated install ks's chroot %post does it explicitly via
# `dnf reinstall grub2-efi-x64 shim-x64 + grub2-install +
# grub2-mkconfig + efibootmgr`. The chroot has PID 1 systemd state
# from the live ISO (not the target), so scriptlets get a real
# environment to run in, not anaconda's truncated chroot. This
# sidesteps gen_grub_cfgstub entirely.
TP=/usr/lib64/python3.14/site-packages/pyanaconda/modules/payloads/payload/dnf/transaction_progress.py TP=/usr/lib64/python3.14/site-packages/pyanaconda/modules/payloads/payload/dnf/transaction_progress.py
if [ -f "$TP" ]; then if [ -f "$TP" ]; then
cp -a "$TP" "${TP}.veilor-bak" cp -a "$TP" "${TP}.veilor-bak"
# Replace the script_error self._queue.put(('error', ...)) line with a # Replace the entire `elif token == 'error':` branch with log+continue.
# warning log + return. The script_error method is uniquely identified # Pattern matches the original two-line block (log.error + raise).
# by its `return_code` argument; sed targets that line specifically.
# `python3 -c` block is more robust than nested sed across multi-line
# statements; rewrite the whole script_error method body.
python3 - "$TP" <<'PYEOF' python3 - "$TP" <<'PYEOF'
import sys, re import sys, re
path = sys.argv[1] path = sys.argv[1]
src = open(path).read() src = open(path).read()
# Find the script_error method and replace the queue.put(...) line at its end # Match: elif token == 'error':\n log.error(msg)\n raise PayloadInstallationError(...)
# Or any current substitution that looks like raise/log.warning at that level.
new = re.sub( new = re.sub(
r'( def script_error\(self, item, nevra, type, return_code\):.*?)\n self\._queue\.put\(\(.error., item\.get_package\(\)\.to_string\(\)\)\)', r"elif token == 'error':\n log\.error\(msg\)\n (?:raise PayloadInstallationError\(\"An error occurred during the transaction: \" \+ msg\)|log\.warning\(\"veilor: ignoring non-fatal transaction error: %s\", msg\))",
r'\1\n log.warning("veilor: ignoring non-fatal scriptlet failure rc=%s for %s",\n return_code,\n item.get_package().to_string() if item else "unknown")\n # do NOT enqueue \'error\' — let install continue (RPM 6.0 cmdline regression workaround)', "elif token == 'error':\n log.warning('veilor: suppressed dnf5 transaction error (RPM 6.0 cmdline regression): %s', msg)\n # Do not raise — anaconda --cmdline + dnf5 + RPM 6.0 emits this for any scriptlet\n # failure; we handle bootloader install manually in install ks %post chroot",
src, src,
flags=re.DOTALL,
count=1, count=1,
) )
if new == src: if new == src:
print("[ERR] script_error method not found in expected form — anaconda layout changed") # Try fresh-anaconda layout (no veilor patch yet)
new = re.sub(
r"elif token == 'error':\n log\.error\(msg\)\n raise PayloadInstallationError\(\"An error occurred during the transaction: \" \+ msg\)",
"elif token == 'error':\n log.warning('veilor: suppressed dnf5 transaction error: %s', msg)",
src,
count=1,
)
if new == src:
print("[ERR] transaction_progress.py error-branch not found")
sys.exit(1) sys.exit(1)
open(path, "w").write(new) open(path, "w").write(new)
print("[OK] transaction_progress.py: narrowed script_error override") print("[OK] transaction_progress.py: broad error-branch suppressed")
PYEOF PYEOF
if grep -q "veilor: ignoring non-fatal scriptlet" "$TP"; then if grep -q "veilor: suppressed dnf5 transaction error" "$TP"; then
# Drop the cached .pyc so the patched .py is what runs.
rm -f /usr/lib64/python3.14/site-packages/pyanaconda/modules/payloads/payload/dnf/__pycache__/transaction_progress.*.pyc 2>/dev/null || true rm -f /usr/lib64/python3.14/site-packages/pyanaconda/modules/payloads/payload/dnf/__pycache__/transaction_progress.*.pyc 2>/dev/null || true
echo "[OK] anaconda transaction_progress.py patched in live rootfs (script_error only)" echo "[OK] anaconda transaction_progress.py patched (broad error suppression)"
else else
echo "[WARN] transaction_progress.py patch did not apply — anaconda layout may have changed" echo "[WARN] transaction_progress.py patch did not apply"
fi fi
else else
echo "[WARN] transaction_progress.py not found at expected path" echo "[WARN] transaction_progress.py not found at expected path"

View file

@ -405,14 +405,23 @@ __SSHKEY_DIRECTIVE__
# - `vsyscall=none` — kill legacy vsyscall page (Position-Independent # - `vsyscall=none` — kill legacy vsyscall page (Position-Independent
# ROP-gadget surface) # ROP-gadget surface)
# - `fbcon=nodefer` — keep linux framebuffer console alive through KMS # - `fbcon=nodefer` — keep linux framebuffer console alive through KMS
# handoff so plymouth LUKS prompt and any boot-time text remain # handoff so plymouth LUKS prompt remains visible on real GPUs.
# visible on real GPU drivers (intel/amdgpu/nvidia). Already in live #
# ISO cmdline; was previously missing from installed-system cmdline, # `--location=none` — DO NOT let anaconda install the bootloader. v0.5.30
# which produced a black-screen boot on real hardware until KMS # moved bootloader install to %post chroot below for two reasons:
# stabilised. # 1. Anaconda's gen_grub_cfgstub script (efi.py:194-201) runs
# Anaconda picks bootloader location (UEFI ESP or BIOS MBR) automatically; # against an /boot/efi/EFI/fedora/ tree that grub2-efi-x64's
# `--location=mbr` would be cargo-cult on UEFI and risky on multi-disk. # posttrans scriptlet may not have populated yet — Fedora 43's
bootloader --append="lockdown=integrity slab_nomerge init_on_alloc=1 init_on_free=1 randomize_kstack_offset=on vsyscall=none fbcon=nodefer" # RPM 6.0 + dnf5 + cmdline-mode anaconda combo is brittle here.
# Reproduced as "gen_grub_cfgstub script failed" twice.
# 2. Running grub2-install + grub2-mkconfig directly in %post lets
# us pick up the env after anaconda finishes the package
# transaction, with all scriptlets' file artifacts settled, and
# gives clearer error messages if anything goes wrong.
# We still install the packages (grub2-efi-x64, shim-x64, efibootmgr)
# via %packages — anaconda just doesn't auto-invoke its bootloader code
# path.
bootloader --location=none --append="lockdown=integrity slab_nomerge init_on_alloc=1 init_on_free=1 randomize_kstack_offset=on vsyscall=none fbcon=nodefer"
# Disk: zero, LUKS2 (argon2id), btrfs subvolumes (no LVM intermediary). # Disk: zero, LUKS2 (argon2id), btrfs subvolumes (no LVM intermediary).
# Native btrfs-on-LUKS matches Fedora KDE Spin defaults; LVM+btrfs combo # Native btrfs-on-LUKS matches Fedora KDE Spin defaults; LVM+btrfs combo
@ -592,6 +601,71 @@ bash $REPO/scripts/kde-theme-apply.sh
# on tty1 — text "Please enter passphrase for disk... :" — works in # on tty1 — text "Please enter passphrase for disk... :" — works in
# QEMU sendkey AND on real hardware. # QEMU sendkey AND on real hardware.
# Manual bootloader install (anaconda told to skip via --location=none)
#
# Anaconda's gen_grub_cfgstub script-runner (efi.py:194-201) is
# brittle on F43 RPM 6.0 cmdline mode — grub2-efi-x64's posttrans
# scriptlet may emit non-fatal errors that anaconda treats as
# transaction-fatal even with our error suppression patch. Doing
# the work in %post chroot bypasses that whole code path and gives
# us linear, debuggable steps.
#
# Order:
# 1. Re-run grub2-efi-x64 + shim-x64 scriptlets cleanly (dnf
# reinstall in chroot has full PID 1 systemd context, so the
# systemd-run inside man-db-style triggers actually runs). Re-
# installing repopulates /boot/efi/EFI/fedora/ if it was empty.
# 2. grub2-install — generic + UEFI. UEFI path is the meaningful one
# on virtio-vga and real hardware.
# 3. grub2-mkconfig — write /boot/grub2/grub.cfg + /boot/efi/EFI/fedora/grub.cfg.
# 4. efibootmgr — register the boot entry in NVRAM.
#
# Failure of any individual step is logged but does NOT abort the
# %post (set +e bracket). On a real failure the user sees the
# diagnostic text and can fix manually post-firstboot.
set +e
echo "════════════════════════════════════════════════════════"
echo " bootloader install (manual; anaconda skipped via --location=none)"
echo "════════════════════════════════════════════════════════"
# Disk we're targeting — anaconda already wrote /boot/efi mount, so
# the disk is whatever holds /boot/efi.
EFI_DISK=$(findmnt -n -o SOURCE /boot/efi 2>/dev/null | sed -E 's/p?[0-9]+$//')
[ -z "$EFI_DISK" ] && EFI_DISK="/dev/$(basename "$(realpath /sys/class/block/$(findmnt -n -o SOURCE /boot/efi | sed 's|/dev/||' | sed 's|p\?[0-9]\+$||') 2>/dev/null)")"
echo "[INFO] target disk for grub: ${EFI_DISK:-<unknown>}"
# Step 1: re-run grub2 + shim scriptlets in clean chroot
dnf reinstall -y grub2-efi-x64 grub2-efi-x64-modules grub2-pc grub2-pc-modules grub2-tools grub2-tools-extra shim-x64 efibootmgr 2>&1 | tail -10 || \
echo "[WARN] dnf reinstall of grub stack failed"
# Step 2: install grub to ESP (UEFI path, primary)
mkdir -p /boot/efi/EFI/fedora
grub2-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=fedora --no-nvram 2>&1 | tail -5 || \
echo "[WARN] grub2-install (efi) failed"
# Step 3: write the EFI grub.cfg stub (what gen_grub_cfgstub would have done)
if command -v gen_grub_cfgstub >/dev/null 2>&1; then
gen_grub_cfgstub /boot/grub2 /boot/efi/EFI/fedora 2>&1 | tail -5 || \
echo "[WARN] gen_grub_cfgstub failed (will fall back to grub2-mkconfig)"
fi
# Always also write a full grub.cfg in /boot/grub2 (BLS source) and
# regenerate the EFI cfg via grub2-mkconfig for redundancy
grub2-mkconfig -o /boot/grub2/grub.cfg 2>&1 | tail -3
[ -f /boot/efi/EFI/fedora/grub.cfg ] || \
grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg 2>&1 | tail -3
# Step 4: register NVRAM entry
if [ -n "$EFI_DISK" ] && [ -e "$EFI_DISK" ]; then
EFI_PART_NUM=$(findmnt -n -o SOURCE /boot/efi | grep -oE '[0-9]+$')
[ -n "$EFI_PART_NUM" ] && \
efibootmgr -c -d "$EFI_DISK" -p "$EFI_PART_NUM" -L "veilor-os" -l '\EFI\fedora\shimx64.efi' 2>&1 | tail -3 || \
echo "[WARN] efibootmgr failed (NVRAM may already be set)"
fi
echo "[INFO] bootloader install: see above for any [WARN] lines"
set -e
# GRUB branding: replace fedora distributor with veilor-os in menu titles. # GRUB branding: replace fedora distributor with veilor-os in menu titles.
# Drop rhgb quiet from default cmdline → all kernel/systemd messages # Drop rhgb quiet from default cmdline → all kernel/systemd messages
# visible. Plymouth `details` theme shows scrolling text (no splash, no # visible. Plymouth `details` theme shows scrolling text (no splash, no

View file

@ -197,6 +197,24 @@ echo " ISO : $ISO"
echo " Disk : $DISK" echo " Disk : $DISK"
echo " NVRAM : $NVRAM" echo " NVRAM : $NVRAM"
echo " Seed : ${SEED_ISO:-<none>}" echo " Seed : ${SEED_ISO:-<none>}"
# Anaconda virtio-serial log channel.
#
# Anaconda 43.x autodetects /dev/virtio-ports/org.fedoraproject.anaconda.log.0
# and streams program/packaging/storage/anaconda logs through it in real
# time, before any tmpfs / pivot, before networking. Survives kernel
# panic. The host gets a tail-able file. No anaconda CLI flag, no
# kickstart change, just the QEMU virtio-serial wiring.
#
# We've lost logs three times in a row to anaconda failures + tmpfs
# reboots. Wiring this up so future failures auto-capture.
ANACONDA_LOG="$TEST_DIR/anaconda-vm-$(date +%Y%m%d-%H%M%S).log"
ANACONDA_LOG_ARGS=(
-chardev "file,id=anaclog,path=$ANACONDA_LOG"
-device virtio-serial-pci,id=vs1
-device "virtserialport,chardev=anaclog,bus=vs1.0,name=org.fedoraproject.anaconda.log.0"
)
echo " AnaLog: $ANACONDA_LOG"
echo " Mode : ${SECBOOT:+secboot}${SECBOOT:-stock UEFI}" echo " Mode : ${SECBOOT:+secboot}${SECBOOT:-stock UEFI}"
echo " Inject: ${HOST_PUBKEY:+yes}${HOST_PUBKEY:-no (no host pubkey)}" echo " Inject: ${HOST_PUBKEY:+yes}${HOST_PUBKEY:-no (no host pubkey)}"
echo "════════════════════════════════════════════════════════" echo "════════════════════════════════════════════════════════"
@ -215,6 +233,7 @@ exec qemu-system-x86_64 \
-drive file="$ISO",media=cdrom,readonly=on \ -drive file="$ISO",media=cdrom,readonly=on \
"${SEED_ARGS[@]}" \ "${SEED_ARGS[@]}" \
"${MONITOR_ARGS[@]}" \ "${MONITOR_ARGS[@]}" \
"${ANACONDA_LOG_ARGS[@]}" \
-boot menu=on,splash-time=2000 \ -boot menu=on,splash-time=2000 \
-netdev user,id=net0,hostfwd=tcp::2222-:22 \ -netdev user,id=net0,hostfwd=tcp::2222-:22 \
-device virtio-net-pci,netdev=net0 \ -device virtio-net-pci,netdev=net0 \