Three-layer fix for the persistent anaconda transaction failure that
killed v0.5.28 (gen_grub_cfgstub) and v0.5.29 (aggregate dnf5 error).
## Layer 1: broad error suppression in transaction_progress.py
dnf5 under RPM 6.0 + cmdline anaconda emits a final aggregate
`error("transaction process has ended with errors..")` at end of
transaction whenever its internal failure counter > 0, regardless of
whether we suppressed individual script_error events. Reproduced
twice. The narrow patch in v0.5.29 suppressed per-package errors but
the aggregate still raised PayloadInstallationError and aborted the
install before the bootloader phase ran.
v0.5.30 patch turns the `elif token == 'error':` branch in
process_transaction_progress into a log.warning. All four producers
(cpio_error, script_error, unpack_error, generic error) now flow
through to a warning + continue. Pattern matches both the original
anaconda layout AND the v0.5.29 narrow-patched layout, so re-applying
on top of either is a no-op.
This brings us back to v0.5.28 broad-suppression behaviour. The
side effect that bit us in v0.5.28 (silent grub2-efi-x64 scriptlet
failure → empty /boot/efi/EFI/fedora/ → gen_grub_cfgstub fails)
is addressed by Layer 2 below.
## Layer 2: bootloader install moved out of anaconda
The generated install kickstart now has `bootloader --location=none`,
which tells anaconda NOT to invoke its own bootloader install code
path (and therefore NOT to call gen_grub_cfgstub). All grub work
moves into the chroot %post block:
1. `dnf reinstall grub2-efi-x64 grub2-pc grub2-tools shim-x64
efibootmgr` — re-runs scriptlets in the chroot with full
PID 1 systemd state, so the systemd-run-style triggers that
anaconda's chroot truncates actually execute.
2. `grub2-install --target=x86_64-efi --efi-directory=/boot/efi
--bootloader-id=fedora --no-nvram` — populates /boot/efi/EFI/fedora/
3. `gen_grub_cfgstub /boot/grub2 /boot/efi/EFI/fedora` (or
`grub2-mkconfig` fallback) — writes /boot/efi/EFI/fedora/grub.cfg.
4. `efibootmgr -c -d <disk> -p <part> -L "veilor-os" -l \EFI\fedora\shimx64.efi`
— registers the NVRAM boot entry pointing at the signed shim.
Each step logs to stdout and continues on failure (`set +e` block);
diagnostics surface in the install log without aborting the whole
%post.
## Layer 3: virtio-serial log capture in run-vm.sh
Anaconda 43.x autodetects `/dev/virtio-ports/org.fedoraproject.anaconda.log.0`
and streams program/packaging/storage/anaconda logs through it in
real time, before any tmpfs / pivot, before networking, surviving
kernel panic. Wiring it into run-vm.sh means the host gets a
tail-able log file at `test/anaconda-vm-YYYYMMDD-HHMMSS.log` for
every VM run.
We've lost logs three times in a row to anaconda failures + tmpfs
reboots. This breaks the loop.
## Diagnostic story
Before this commit: VM aborts → live ISO reboots itself → /tmp/
tmpfs gone → no logs → guess what failed. Three days, two and a
half false fixes.
After this commit: VM aborts → host has /home/admin/ai-lab/_github/veilor-os/test/anaconda-vm-*.log
with the actual scriptlet output, the actual exit codes, the
actual file-trigger failures. Future debug becomes evidence-based.
Files changed:
kickstart/veilor-os.ks — broad error suppression patch
overlay/usr/local/bin/veilor-installer — --location=none + manual grub
test/run-vm.sh — virtio-serial chardev wiring
Verified: bash -n clean, ksvalidator clean.
|
||
|---|---|---|
| .. | ||
| test-runs | ||
| auto-install-keymap.sh | ||
| auto-install.sh | ||
| boot-checklist.md | ||
| METHOD-CHANGELOG.md | ||
| README.md | ||
| run-vm.sh | ||
| TESTING.md | ||
test/
Test harnesses for veilor-os ISO builds.
Files
| File | Purpose |
|---|---|
run-vm.sh |
Manual smoke test — boot the latest ISO interactively in QEMU/KVM. SSH key injection via cloud-init seed + monitor sendkey fallback for live-image login. |
auto-install.sh |
Autonomous end-to-end install test. Boots ISO, drives the gum installer via QEMU monitor sendkey, waits for anaconda to finish + reboot, SSHs into the installed system, runs validation checklist. Prints PASS/FAIL summary. |
auto-install-keymap.sh |
Sourced helper. Provides km_send_str, km_send_chord, km_send_key, km_screendump, km_wait_socket, etc. Reusable by other automation. |
boot-checklist.md |
Manual post-install checklist (run on a real spare laptop). |
Running the autonomous installer test
./test/auto-install.sh build/out/veilor-os-*.iso
Hardcoded inputs (deterministic — do not edit during a test run):
- Disk: first
/dev/vda(the only disk in QEMU) - Hostname:
veilor(installer hardcoded since v0.5.4) - LUKS passphrase:
testpass1234 - Admin password:
adminpass1234 - Locale:
en_GB.UTF-8
Expected runtime: 20–30 minutes wall clock (anaconda dominates).
Outputs
/tmp/veilor-auto-install.log— full driver log/tmp/veilor-auto-install-NN-<step>.png— milestone screenshots/tmp/veilor-auto-install-final-ssh.txt— final SSH session capture (uname/lsblk/cmdline/failed units)
Exit codes
0— all validation checks passed1— any failure (anaconda crashed, SSH never came up, validation check failed)2— preflight failure (missing tool, bad ISO arg, missing OVMF)
Prerequisites
qemu-system-x86_64,qemu-img,socat,ssh,ssh-keygenedk2-ovmf(OVMF UEFI firmware at/usr/share/edk2/ovmf/OVMF_{CODE,VARS}.fd)mkisofsorxorriso(for cloud-init seed ISO; harness falls back to TTY1 driving if seed cannot be built or cloud-init does not run on the installed system)convertfrom ImageMagick (optional — converts PPM screendumps to PNG; harness keeps PPM if absent)- KVM access (
/dev/kvmreadable by the user)
What it validates
Post-install on the booted system:
/etc/os-release→NAME=veilor-oshostnamectl --static→veilorsystemctl is-active→activeforsshd fail2ban usbguard tuned auditd firewalld chronyd sddmgetenforce→Enforcing(preferred) orPermissive(acceptable for v0.5.x)lsblk -fshowscrypto_LUKS+btrfs/etc/crypttabhas a LUKS entrygetent passwd adminreturns the user/usr/local/bin/{veilor-power,veilor-doctor,veilor-update}are present and executable/proc/cmdlinecontainsinit_on_alloc=1
Troubleshooting
- Stuck at boot banner: ISO didn't autostart
veilor-installeron tty1. Checkserial.logandauto-install-vm-NN-*.pngscreenshots. The harness aborts after 5 minutes of identical screen frames. - SSH never up: cloud-init may not have run on the installed system (no
cidatamount). The harness falls back to TTY1 driving — typing the LUKS passphrase, logging in as admin, and hand-injecting the SSH key. If both paths fail, validation cannot proceed. screendumpproduces unreadable PPM: install ImageMagick (dnf install ImageMagick) so the harness converts to PNG.