v0.5.30: broad error suppression + manual bootloader + virtio log capture
Three-layer fix for the persistent anaconda transaction failure that
killed v0.5.28 (gen_grub_cfgstub) and v0.5.29 (aggregate dnf5 error).
## Layer 1: broad error suppression in transaction_progress.py
dnf5 under RPM 6.0 + cmdline anaconda emits a final aggregate
`error("transaction process has ended with errors..")` at end of
transaction whenever its internal failure counter > 0, regardless of
whether we suppressed individual script_error events. Reproduced
twice. The narrow patch in v0.5.29 suppressed per-package errors but
the aggregate still raised PayloadInstallationError and aborted the
install before the bootloader phase ran.
v0.5.30 patch turns the `elif token == 'error':` branch in
process_transaction_progress into a log.warning. All four producers
(cpio_error, script_error, unpack_error, generic error) now flow
through to a warning + continue. Pattern matches both the original
anaconda layout AND the v0.5.29 narrow-patched layout, so re-applying
on top of either is a no-op.
This brings us back to v0.5.28 broad-suppression behaviour. The
side effect that bit us in v0.5.28 (silent grub2-efi-x64 scriptlet
failure → empty /boot/efi/EFI/fedora/ → gen_grub_cfgstub fails)
is addressed by Layer 2 below.
## Layer 2: bootloader install moved out of anaconda
The generated install kickstart now has `bootloader --location=none`,
which tells anaconda NOT to invoke its own bootloader install code
path (and therefore NOT to call gen_grub_cfgstub). All grub work
moves into the chroot %post block:
1. `dnf reinstall grub2-efi-x64 grub2-pc grub2-tools shim-x64
efibootmgr` — re-runs scriptlets in the chroot with full
PID 1 systemd state, so the systemd-run-style triggers that
anaconda's chroot truncates actually execute.
2. `grub2-install --target=x86_64-efi --efi-directory=/boot/efi
--bootloader-id=fedora --no-nvram` — populates /boot/efi/EFI/fedora/
3. `gen_grub_cfgstub /boot/grub2 /boot/efi/EFI/fedora` (or
`grub2-mkconfig` fallback) — writes /boot/efi/EFI/fedora/grub.cfg.
4. `efibootmgr -c -d <disk> -p <part> -L "veilor-os" -l \EFI\fedora\shimx64.efi`
— registers the NVRAM boot entry pointing at the signed shim.
Each step logs to stdout and continues on failure (`set +e` block);
diagnostics surface in the install log without aborting the whole
%post.
## Layer 3: virtio-serial log capture in run-vm.sh
Anaconda 43.x autodetects `/dev/virtio-ports/org.fedoraproject.anaconda.log.0`
and streams program/packaging/storage/anaconda logs through it in
real time, before any tmpfs / pivot, before networking, surviving
kernel panic. Wiring it into run-vm.sh means the host gets a
tail-able log file at `test/anaconda-vm-YYYYMMDD-HHMMSS.log` for
every VM run.
We've lost logs three times in a row to anaconda failures + tmpfs
reboots. This breaks the loop.
## Diagnostic story
Before this commit: VM aborts → live ISO reboots itself → /tmp/
tmpfs gone → no logs → guess what failed. Three days, two and a
half false fixes.
After this commit: VM aborts → host has /home/admin/ai-lab/_github/veilor-os/test/anaconda-vm-*.log
with the actual scriptlet output, the actual exit codes, the
actual file-trigger failures. Future debug becomes evidence-based.
Files changed:
kickstart/veilor-os.ks — broad error suppression patch
overlay/usr/local/bin/veilor-installer — --location=none + manual grub
test/run-vm.sh — virtio-serial chardev wiring
Verified: bash -n clean, ksvalidator clean.
This commit is contained in:
parent
613d35402e
commit
e83483a077
3 changed files with 147 additions and 23 deletions
|
|
@ -304,40 +304,71 @@ EOF
|
||||||
# - cpio_error / unpack_error / generic error → unchanged, still
|
# - cpio_error / unpack_error / generic error → unchanged, still
|
||||||
# raise PayloadInstallationError as anaconda intends. Real
|
# raise PayloadInstallationError as anaconda intends. Real
|
||||||
# transaction-fatal events still abort install (good).
|
# transaction-fatal events still abort install (good).
|
||||||
|
# Patch anaconda's transaction_progress.py to suppress dnf5's
|
||||||
|
# transaction-error escalation under RPM 6.0 + cmdline mode.
|
||||||
|
#
|
||||||
|
# History of this patch:
|
||||||
|
#
|
||||||
|
# v0.5.28: BROAD patch — overrode `process_transaction_progress` so all
|
||||||
|
# four 'error' token producers (cpio_error, script_error, unpack_error,
|
||||||
|
# generic error) became log warnings. man-db scriptlet stopped killing
|
||||||
|
# the install. BUT silent grub2-efi-x64 scriptlet failure left
|
||||||
|
# /boot/efi/EFI/fedora/ incomplete → gen_grub_cfgstub failed.
|
||||||
|
#
|
||||||
|
# v0.5.29: NARROW patch — overrode only `script_error` callback. Caught
|
||||||
|
# the per-package scriptlet failures cleanly. BUT dnf5 still tracks
|
||||||
|
# its own internal error counter and emits a final aggregate
|
||||||
|
# `error("transaction process has ended with errors..")` at end of
|
||||||
|
# transaction, which still raised PayloadInstallationError. Install
|
||||||
|
# aborted before bootloader install ran.
|
||||||
|
#
|
||||||
|
# v0.5.30: BROAD patch + bootloader --location=none in install ks.
|
||||||
|
# This time we silence the aggregate error too, so install completes,
|
||||||
|
# but anaconda is told NOT to install bootloader itself. The
|
||||||
|
# generated install ks's chroot %post does it explicitly via
|
||||||
|
# `dnf reinstall grub2-efi-x64 shim-x64 + grub2-install +
|
||||||
|
# grub2-mkconfig + efibootmgr`. The chroot has PID 1 systemd state
|
||||||
|
# from the live ISO (not the target), so scriptlets get a real
|
||||||
|
# environment to run in, not anaconda's truncated chroot. This
|
||||||
|
# sidesteps gen_grub_cfgstub entirely.
|
||||||
TP=/usr/lib64/python3.14/site-packages/pyanaconda/modules/payloads/payload/dnf/transaction_progress.py
|
TP=/usr/lib64/python3.14/site-packages/pyanaconda/modules/payloads/payload/dnf/transaction_progress.py
|
||||||
if [ -f "$TP" ]; then
|
if [ -f "$TP" ]; then
|
||||||
cp -a "$TP" "${TP}.veilor-bak"
|
cp -a "$TP" "${TP}.veilor-bak"
|
||||||
|
|
||||||
# Replace the script_error self._queue.put(('error', ...)) line with a
|
# Replace the entire `elif token == 'error':` branch with log+continue.
|
||||||
# warning log + return. The script_error method is uniquely identified
|
# Pattern matches the original two-line block (log.error + raise).
|
||||||
# by its `return_code` argument; sed targets that line specifically.
|
|
||||||
# `python3 -c` block is more robust than nested sed across multi-line
|
|
||||||
# statements; rewrite the whole script_error method body.
|
|
||||||
python3 - "$TP" <<'PYEOF'
|
python3 - "$TP" <<'PYEOF'
|
||||||
import sys, re
|
import sys, re
|
||||||
path = sys.argv[1]
|
path = sys.argv[1]
|
||||||
src = open(path).read()
|
src = open(path).read()
|
||||||
# Find the script_error method and replace the queue.put(...) line at its end
|
# Match: elif token == 'error':\n log.error(msg)\n raise PayloadInstallationError(...)
|
||||||
|
# Or any current substitution that looks like raise/log.warning at that level.
|
||||||
new = re.sub(
|
new = re.sub(
|
||||||
r'( def script_error\(self, item, nevra, type, return_code\):.*?)\n self\._queue\.put\(\(.error., item\.get_package\(\)\.to_string\(\)\)\)',
|
r"elif token == 'error':\n log\.error\(msg\)\n (?:raise PayloadInstallationError\(\"An error occurred during the transaction: \" \+ msg\)|log\.warning\(\"veilor: ignoring non-fatal transaction error: %s\", msg\))",
|
||||||
r'\1\n log.warning("veilor: ignoring non-fatal scriptlet failure rc=%s for %s",\n return_code,\n item.get_package().to_string() if item else "unknown")\n # do NOT enqueue \'error\' — let install continue (RPM 6.0 cmdline regression workaround)',
|
"elif token == 'error':\n log.warning('veilor: suppressed dnf5 transaction error (RPM 6.0 cmdline regression): %s', msg)\n # Do not raise — anaconda --cmdline + dnf5 + RPM 6.0 emits this for any scriptlet\n # failure; we handle bootloader install manually in install ks %post chroot",
|
||||||
src,
|
src,
|
||||||
flags=re.DOTALL,
|
|
||||||
count=1,
|
count=1,
|
||||||
)
|
)
|
||||||
if new == src:
|
if new == src:
|
||||||
print("[ERR] script_error method not found in expected form — anaconda layout changed")
|
# Try fresh-anaconda layout (no veilor patch yet)
|
||||||
|
new = re.sub(
|
||||||
|
r"elif token == 'error':\n log\.error\(msg\)\n raise PayloadInstallationError\(\"An error occurred during the transaction: \" \+ msg\)",
|
||||||
|
"elif token == 'error':\n log.warning('veilor: suppressed dnf5 transaction error: %s', msg)",
|
||||||
|
src,
|
||||||
|
count=1,
|
||||||
|
)
|
||||||
|
if new == src:
|
||||||
|
print("[ERR] transaction_progress.py error-branch not found")
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
open(path, "w").write(new)
|
open(path, "w").write(new)
|
||||||
print("[OK] transaction_progress.py: narrowed script_error override")
|
print("[OK] transaction_progress.py: broad error-branch suppressed")
|
||||||
PYEOF
|
PYEOF
|
||||||
|
|
||||||
if grep -q "veilor: ignoring non-fatal scriptlet" "$TP"; then
|
if grep -q "veilor: suppressed dnf5 transaction error" "$TP"; then
|
||||||
# Drop the cached .pyc so the patched .py is what runs.
|
|
||||||
rm -f /usr/lib64/python3.14/site-packages/pyanaconda/modules/payloads/payload/dnf/__pycache__/transaction_progress.*.pyc 2>/dev/null || true
|
rm -f /usr/lib64/python3.14/site-packages/pyanaconda/modules/payloads/payload/dnf/__pycache__/transaction_progress.*.pyc 2>/dev/null || true
|
||||||
echo "[OK] anaconda transaction_progress.py patched in live rootfs (script_error only)"
|
echo "[OK] anaconda transaction_progress.py patched (broad error suppression)"
|
||||||
else
|
else
|
||||||
echo "[WARN] transaction_progress.py patch did not apply — anaconda layout may have changed"
|
echo "[WARN] transaction_progress.py patch did not apply"
|
||||||
fi
|
fi
|
||||||
else
|
else
|
||||||
echo "[WARN] transaction_progress.py not found at expected path"
|
echo "[WARN] transaction_progress.py not found at expected path"
|
||||||
|
|
|
||||||
|
|
@ -405,14 +405,23 @@ __SSHKEY_DIRECTIVE__
|
||||||
# - `vsyscall=none` — kill legacy vsyscall page (Position-Independent
|
# - `vsyscall=none` — kill legacy vsyscall page (Position-Independent
|
||||||
# ROP-gadget surface)
|
# ROP-gadget surface)
|
||||||
# - `fbcon=nodefer` — keep linux framebuffer console alive through KMS
|
# - `fbcon=nodefer` — keep linux framebuffer console alive through KMS
|
||||||
# handoff so plymouth LUKS prompt and any boot-time text remain
|
# handoff so plymouth LUKS prompt remains visible on real GPUs.
|
||||||
# visible on real GPU drivers (intel/amdgpu/nvidia). Already in live
|
#
|
||||||
# ISO cmdline; was previously missing from installed-system cmdline,
|
# `--location=none` — DO NOT let anaconda install the bootloader. v0.5.30
|
||||||
# which produced a black-screen boot on real hardware until KMS
|
# moved bootloader install to %post chroot below for two reasons:
|
||||||
# stabilised.
|
# 1. Anaconda's gen_grub_cfgstub script (efi.py:194-201) runs
|
||||||
# Anaconda picks bootloader location (UEFI ESP or BIOS MBR) automatically;
|
# against an /boot/efi/EFI/fedora/ tree that grub2-efi-x64's
|
||||||
# `--location=mbr` would be cargo-cult on UEFI and risky on multi-disk.
|
# posttrans scriptlet may not have populated yet — Fedora 43's
|
||||||
bootloader --append="lockdown=integrity slab_nomerge init_on_alloc=1 init_on_free=1 randomize_kstack_offset=on vsyscall=none fbcon=nodefer"
|
# RPM 6.0 + dnf5 + cmdline-mode anaconda combo is brittle here.
|
||||||
|
# Reproduced as "gen_grub_cfgstub script failed" twice.
|
||||||
|
# 2. Running grub2-install + grub2-mkconfig directly in %post lets
|
||||||
|
# us pick up the env after anaconda finishes the package
|
||||||
|
# transaction, with all scriptlets' file artifacts settled, and
|
||||||
|
# gives clearer error messages if anything goes wrong.
|
||||||
|
# We still install the packages (grub2-efi-x64, shim-x64, efibootmgr)
|
||||||
|
# via %packages — anaconda just doesn't auto-invoke its bootloader code
|
||||||
|
# path.
|
||||||
|
bootloader --location=none --append="lockdown=integrity slab_nomerge init_on_alloc=1 init_on_free=1 randomize_kstack_offset=on vsyscall=none fbcon=nodefer"
|
||||||
|
|
||||||
# Disk: zero, LUKS2 (argon2id), btrfs subvolumes (no LVM intermediary).
|
# Disk: zero, LUKS2 (argon2id), btrfs subvolumes (no LVM intermediary).
|
||||||
# Native btrfs-on-LUKS matches Fedora KDE Spin defaults; LVM+btrfs combo
|
# Native btrfs-on-LUKS matches Fedora KDE Spin defaults; LVM+btrfs combo
|
||||||
|
|
@ -592,6 +601,71 @@ bash $REPO/scripts/kde-theme-apply.sh
|
||||||
# on tty1 — text "Please enter passphrase for disk... :" — works in
|
# on tty1 — text "Please enter passphrase for disk... :" — works in
|
||||||
# QEMU sendkey AND on real hardware.
|
# QEMU sendkey AND on real hardware.
|
||||||
|
|
||||||
|
# Manual bootloader install (anaconda told to skip via --location=none)
|
||||||
|
#
|
||||||
|
# Anaconda's gen_grub_cfgstub script-runner (efi.py:194-201) is
|
||||||
|
# brittle on F43 RPM 6.0 cmdline mode — grub2-efi-x64's posttrans
|
||||||
|
# scriptlet may emit non-fatal errors that anaconda treats as
|
||||||
|
# transaction-fatal even with our error suppression patch. Doing
|
||||||
|
# the work in %post chroot bypasses that whole code path and gives
|
||||||
|
# us linear, debuggable steps.
|
||||||
|
#
|
||||||
|
# Order:
|
||||||
|
# 1. Re-run grub2-efi-x64 + shim-x64 scriptlets cleanly (dnf
|
||||||
|
# reinstall in chroot has full PID 1 systemd context, so the
|
||||||
|
# systemd-run inside man-db-style triggers actually runs). Re-
|
||||||
|
# installing repopulates /boot/efi/EFI/fedora/ if it was empty.
|
||||||
|
# 2. grub2-install — generic + UEFI. UEFI path is the meaningful one
|
||||||
|
# on virtio-vga and real hardware.
|
||||||
|
# 3. grub2-mkconfig — write /boot/grub2/grub.cfg + /boot/efi/EFI/fedora/grub.cfg.
|
||||||
|
# 4. efibootmgr — register the boot entry in NVRAM.
|
||||||
|
#
|
||||||
|
# Failure of any individual step is logged but does NOT abort the
|
||||||
|
# %post (set +e bracket). On a real failure the user sees the
|
||||||
|
# diagnostic text and can fix manually post-firstboot.
|
||||||
|
|
||||||
|
set +e
|
||||||
|
echo "════════════════════════════════════════════════════════"
|
||||||
|
echo " bootloader install (manual; anaconda skipped via --location=none)"
|
||||||
|
echo "════════════════════════════════════════════════════════"
|
||||||
|
|
||||||
|
# Disk we're targeting — anaconda already wrote /boot/efi mount, so
|
||||||
|
# the disk is whatever holds /boot/efi.
|
||||||
|
EFI_DISK=$(findmnt -n -o SOURCE /boot/efi 2>/dev/null | sed -E 's/p?[0-9]+$//')
|
||||||
|
[ -z "$EFI_DISK" ] && EFI_DISK="/dev/$(basename "$(realpath /sys/class/block/$(findmnt -n -o SOURCE /boot/efi | sed 's|/dev/||' | sed 's|p\?[0-9]\+$||') 2>/dev/null)")"
|
||||||
|
echo "[INFO] target disk for grub: ${EFI_DISK:-<unknown>}"
|
||||||
|
|
||||||
|
# Step 1: re-run grub2 + shim scriptlets in clean chroot
|
||||||
|
dnf reinstall -y grub2-efi-x64 grub2-efi-x64-modules grub2-pc grub2-pc-modules grub2-tools grub2-tools-extra shim-x64 efibootmgr 2>&1 | tail -10 || \
|
||||||
|
echo "[WARN] dnf reinstall of grub stack failed"
|
||||||
|
|
||||||
|
# Step 2: install grub to ESP (UEFI path, primary)
|
||||||
|
mkdir -p /boot/efi/EFI/fedora
|
||||||
|
grub2-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=fedora --no-nvram 2>&1 | tail -5 || \
|
||||||
|
echo "[WARN] grub2-install (efi) failed"
|
||||||
|
|
||||||
|
# Step 3: write the EFI grub.cfg stub (what gen_grub_cfgstub would have done)
|
||||||
|
if command -v gen_grub_cfgstub >/dev/null 2>&1; then
|
||||||
|
gen_grub_cfgstub /boot/grub2 /boot/efi/EFI/fedora 2>&1 | tail -5 || \
|
||||||
|
echo "[WARN] gen_grub_cfgstub failed (will fall back to grub2-mkconfig)"
|
||||||
|
fi
|
||||||
|
# Always also write a full grub.cfg in /boot/grub2 (BLS source) and
|
||||||
|
# regenerate the EFI cfg via grub2-mkconfig for redundancy
|
||||||
|
grub2-mkconfig -o /boot/grub2/grub.cfg 2>&1 | tail -3
|
||||||
|
[ -f /boot/efi/EFI/fedora/grub.cfg ] || \
|
||||||
|
grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg 2>&1 | tail -3
|
||||||
|
|
||||||
|
# Step 4: register NVRAM entry
|
||||||
|
if [ -n "$EFI_DISK" ] && [ -e "$EFI_DISK" ]; then
|
||||||
|
EFI_PART_NUM=$(findmnt -n -o SOURCE /boot/efi | grep -oE '[0-9]+$')
|
||||||
|
[ -n "$EFI_PART_NUM" ] && \
|
||||||
|
efibootmgr -c -d "$EFI_DISK" -p "$EFI_PART_NUM" -L "veilor-os" -l '\EFI\fedora\shimx64.efi' 2>&1 | tail -3 || \
|
||||||
|
echo "[WARN] efibootmgr failed (NVRAM may already be set)"
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[INFO] bootloader install: see above for any [WARN] lines"
|
||||||
|
set -e
|
||||||
|
|
||||||
# GRUB branding: replace fedora distributor with veilor-os in menu titles.
|
# GRUB branding: replace fedora distributor with veilor-os in menu titles.
|
||||||
# Drop rhgb quiet from default cmdline → all kernel/systemd messages
|
# Drop rhgb quiet from default cmdline → all kernel/systemd messages
|
||||||
# visible. Plymouth `details` theme shows scrolling text (no splash, no
|
# visible. Plymouth `details` theme shows scrolling text (no splash, no
|
||||||
|
|
|
||||||
|
|
@ -197,6 +197,24 @@ echo " ISO : $ISO"
|
||||||
echo " Disk : $DISK"
|
echo " Disk : $DISK"
|
||||||
echo " NVRAM : $NVRAM"
|
echo " NVRAM : $NVRAM"
|
||||||
echo " Seed : ${SEED_ISO:-<none>}"
|
echo " Seed : ${SEED_ISO:-<none>}"
|
||||||
|
# Anaconda virtio-serial log channel.
|
||||||
|
#
|
||||||
|
# Anaconda 43.x autodetects /dev/virtio-ports/org.fedoraproject.anaconda.log.0
|
||||||
|
# and streams program/packaging/storage/anaconda logs through it in real
|
||||||
|
# time, before any tmpfs / pivot, before networking. Survives kernel
|
||||||
|
# panic. The host gets a tail-able file. No anaconda CLI flag, no
|
||||||
|
# kickstart change, just the QEMU virtio-serial wiring.
|
||||||
|
#
|
||||||
|
# We've lost logs three times in a row to anaconda failures + tmpfs
|
||||||
|
# reboots. Wiring this up so future failures auto-capture.
|
||||||
|
ANACONDA_LOG="$TEST_DIR/anaconda-vm-$(date +%Y%m%d-%H%M%S).log"
|
||||||
|
ANACONDA_LOG_ARGS=(
|
||||||
|
-chardev "file,id=anaclog,path=$ANACONDA_LOG"
|
||||||
|
-device virtio-serial-pci,id=vs1
|
||||||
|
-device "virtserialport,chardev=anaclog,bus=vs1.0,name=org.fedoraproject.anaconda.log.0"
|
||||||
|
)
|
||||||
|
echo " AnaLog: $ANACONDA_LOG"
|
||||||
|
|
||||||
echo " Mode : ${SECBOOT:+secboot}${SECBOOT:-stock UEFI}"
|
echo " Mode : ${SECBOOT:+secboot}${SECBOOT:-stock UEFI}"
|
||||||
echo " Inject: ${HOST_PUBKEY:+yes}${HOST_PUBKEY:-no (no host pubkey)}"
|
echo " Inject: ${HOST_PUBKEY:+yes}${HOST_PUBKEY:-no (no host pubkey)}"
|
||||||
echo "════════════════════════════════════════════════════════"
|
echo "════════════════════════════════════════════════════════"
|
||||||
|
|
@ -215,6 +233,7 @@ exec qemu-system-x86_64 \
|
||||||
-drive file="$ISO",media=cdrom,readonly=on \
|
-drive file="$ISO",media=cdrom,readonly=on \
|
||||||
"${SEED_ARGS[@]}" \
|
"${SEED_ARGS[@]}" \
|
||||||
"${MONITOR_ARGS[@]}" \
|
"${MONITOR_ARGS[@]}" \
|
||||||
|
"${ANACONDA_LOG_ARGS[@]}" \
|
||||||
-boot menu=on,splash-time=2000 \
|
-boot menu=on,splash-time=2000 \
|
||||||
-netdev user,id=net0,hostfwd=tcp::2222-:22 \
|
-netdev user,id=net0,hostfwd=tcp::2222-:22 \
|
||||||
-device virtio-net-pci,netdev=net0 \
|
-device virtio-net-pci,netdev=net0 \
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue