Five-fix bundle from 7-agent research wave on the v0.5.28-final
gen_grub_cfgstub failure.
## 1. Narrow the anaconda transaction_progress patch (CRITICAL)
The v0.5.28 patch was too broad. It rewrote
`process_transaction_progress` so every 'error' token in the
transaction queue became a `log.warning`. That queue carries four
distinct error classes:
- cpio_error — payload extraction (genuinely fatal)
- script_error — RPM 6.0 cmdline-mode scriptlet warning-as-error
(the ONE we want to ignore)
- unpack_error — payload corruption (genuinely fatal)
- error — generic transaction error (genuinely fatal)
By swallowing all four we silently masked grub2-efi-x64's posttrans
failure mid-install. /boot/efi/EFI/fedora/ ended up incomplete →
gen_grub_cfgstub then failed at the bootloader install phase with
"gen_grub_cfgstub script failed" because its `set -eu` script
couldn't read the missing files.
v0.5.29 narrows the patch: override only the `script_error` callback
inside transaction_progress.py to log a warning and NOT enqueue
'error'. The consumer (`process_transaction_progress`) reverts to
upstream behaviour where cpio_error / unpack_error / error still
raise PayloadInstallationError. Real install-fatal events keep
aborting; only the F43-RPM-6.0 scriptlet regression is silenced.
The patch is applied via `python3 -c` regex rewrite (more robust
than nested sed across multi-line method bodies).
## 2. LUKS UX — `tries=5,timeout=0` (FIX)
Default cryptsetup-generator unit allows ONE passphrase try with a
1m30s wait. One typo on a long passphrase = wait 1m30s, then the
device-wait timer trips, then dracut emergency shell after 3min total.
Brutal. Adding `rd.luks.options=luks-XXX=tries=5,timeout=0` gives
five typo-friendly retries with no auto-timeout.
## 3. fbcon=nodefer on installed-system cmdline (FIX)
Live ISO cmdline already has `fbcon=nodefer` (added in v0.5.27 to fix
the real-laptop black-screen-after-dracut). The installed-system
bootloader directive in the generated install ks did NOT carry it.
Same KMS handoff happens on the installed system on the same hardware.
Now both have the flag.
## 4. /etc/crypttab fallback assertion (BELT-BRACES)
Anaconda's custom-partitioning code path normally writes /etc/crypttab
for `--encrypted` part directives. Edge cases observed in F43+ where
it doesn't. Without crypttab, systemd-cryptsetup-generator can still
work from kernel cmdline alone, but cleanup paths and second-stage
unlock both fall over. Adding a fallback `echo` that writes the
canonical line if it's missing post-anaconda.
## 5. Initramfs LUKS module assertion (DEFENSIVE)
Force-include `crypt + systemd-cryptsetup + plymouth` modules in
initramfs via /etc/dracut.conf.d/10-veilor-luks.conf. dracut autodetects
these when it sees an active LUKS mapping, but %post runs before the
LUKS state is fully observable from the chroot. Plus we wipe stale
initramfs (`rm -f /boot/initramfs-*.img`) before `--regenerate-all`
so the regen actually rewrites bytes. Final assertion runs
`lsinitrd | grep -q cryptsetup` and surfaces a [ERR] line in build
output if the module didn't make it.
## What this should fix
After the man-db fix in v0.5.28-final, install proceeded past
"Configuring xxx" cleanly but died at "Installing boot loader" with
gen_grub_cfgstub. Root-cause was the over-broad patch from #1 above.
After v0.5.29:
- Install transaction completes (man-db excluded; non-man-db
scriptlet warnings still suppressed; real errors still raise)
- gen_grub_cfgstub runs against complete /boot/efi/EFI/fedora/
- Bootloader install completes
- Reboot to disk lands at GRUB veilor-os entry
- Kernel + initramfs load (cryptsetup confirmed present)
- Plymouth LUKS prompt appears with text fallback
- User has 5 tries, no timeout
- Unlock → btrfs subvol mount → systemd → SDDM
Files: kickstart/veilor-os.ks (+45 lines), overlay/usr/local/bin/veilor-installer (+50 lines).
Verified: bash -n clean, ksvalidator clean.
References:
pyanaconda transaction_progress.py:110-136 (4 producers of 'error')
pyanaconda bootloader/efi.py:194-201 (gen_grub_cfgstub call site)
/usr/bin/gen_grub_cfgstub (set -eu wrapper for grub2-mkconfig stub)
Fedora wiki Changes/RPM-6.0
dnf5 issue #2507 (RPM 6.0 scriptlet propagation regression)
THE actual root cause of the man-db transaction failure that killed
three consecutive VM installs (v0.5.26 / v0.5.27 / v0.5.28).
Confirmed via 7-agent research wave:
- Fedora 43 ships RPM 6.0, which changed scriptlet failure
propagation. Scriptlets that previously emitted "Non-critical
error" warnings now bubble up as transaction-level errors. dnf5
issue #2507 documents the change. Anaconda --cmdline mode treats
any 'error' token from the dnf transaction as a fatal abort.
- man-db's `transfiletriggerin` is the canonical trigger: it runs
`systemd-run /usr/bin/systemctl start man-db-cache-update` which
returns non-zero in the anaconda chroot (no PID 1 systemd) and is
flagged as transaction-level error under RPM 6.0.
- We previously patched anaconda's transaction_progress.py on the
BUILD HOST so livecd-creator could finish its own transaction.
That patch lives only on the host running the build — never landed
in the live rootfs the user installs from. Reproduced 3 times:
install-time anaconda on the live ISO is unpatched, hits the same
code path, aborts at exactly "Configuring man-db.x86_64".
Two-layer fix:
1. kickstart %post seds the file inside the live rootfs at build time
so the user's install-time anaconda is patched. Sed downgrades the
'error' token from raise PayloadInstallationError to log.warning.
2. Generated install ks excludes man-db / man-pages / man-pages-overrides
from %packages. Belt-and-braces — even if the patch has an edge
case the trigger never fires because the package isn't installed.
Users install man pages post-firstboot.
Previous attempts that didn't work: dropping the updates repo (only
narrowed the set of failing scriptlets, didn't fix the underlying
RPM-6.0 propagation change); flipping SELinux to permissive
(confirmed not the cause; kickstart's selinux directive only writes
/etc/selinux/config in target root, doesn't affect installer-time).
Follow-up for next release: replicate the transaction_progress patch
in the CI workflow's container so the build itself is deterministic.
Currently the workflow has been greening on luck.
Files: kickstart/veilor-os.ks (+25 lines), overlay/usr/local/bin/veilor-installer (+10 lines).
Verified: bash -n clean, ksvalidator clean.
Critical install bug fix + cosmetic round-up + first formal test
procedure document.
## Critical: LUKS unlock on first boot
Generated installer kickstart's %post was injecting `rd.luks.uuid=…`
into `/etc/default/grub` only. Fedora 43 uses BLS (Boot Loader
Specification) entries in `/boot/loader/entries/*.conf`; those are
NOT regenerated by `grub2-mkconfig`. Result: the kernel boots without
`rd.luks.uuid=`, dracut's cryptsetup-generator never spawns the
unlock unit, plymouth has no password to ask for, and dracut-initqueue
loops on dev-disk-by-uuid for ~3min before dropping to emergency
shell.
The fix layers both write paths:
- `/etc/default/grub` — keeps the args around for future kernels
(kernel-install reads this when adding new entries).
- `grubby --update-kernel=ALL --args=...` — rewrites the `options`
line of every existing BLS entry so the kernel that boots NEXT
actually has the args.
Verified by reading `/proc/cmdline` from the dracut emergency shell
on a v0.5.26 install; old cmdline had only `root=UUID=… ro
rootflags=subvol=root` and was missing the LUKS arg entirely.
## GRUB / branding
- `/etc/default/grub` is sed'd to `GRUB_DISTRIBUTOR="veilor-os"` (was
already there, kept).
- BLS entries' `title` line is rewritten in-place to "veilor-os
(<kver>)" for every kernel — `grub2-mkconfig` does not touch BLS
titles, so this is the only path.
- `/boot/loader/entries/*-0-rescue-*.conf` is removed: the auto-built
rescue entry was leaking "Fedora Linux" into the GRUB menu and
showing a second boot option that nobody asked for. The rescue
kernel image itself is left in /boot.
- Hostname defaults to `veilor` (was inheriting the `localhost-live`
name anaconda writes when the kickstart's network directive is
ignored under cmdline mode).
- `/etc/machine-info` adds `PRETTY_HOSTNAME="veilor-os"` so
`hostnamectl status` and any consumer reading machine-info see the
brand.
## Boot UX
- `fbcon=nodefer` added to live-ISO bootloader cmdline. On real
laptops with a hardware GPU, the kernel modeset blanks the
framebuffer console mid-boot; without `nodefer` the installer
banner draws into a frozen framebuffer and the user sees a black
screen with a blinking cursor for ~30s. virtio-vga in QEMU doesn't
trigger this so it never reproduced in VM. Symptom report on
v0.5.26 was the trigger to investigate.
## Installer cosmetics
- `GUM_CHOOSE_CURSOR` and `GUM_INPUT_PROMPT` switched from `❯ ` to
`> `. The unicode arrow falls back to a fixed-width block on the
linux fbcon font and lipgloss then duplicates that block at col +23,
producing the "Install Install" double-render and the stray-T
artifact in password fields. Plain ASCII renders identically across
fbcon, virtio-vga, and X/Wayland gum runs.
- `VERSION_ID` bumped 0.5.8 → 0.5.27 in the os-release drop-in. The
installer banner reads this at runtime, so the live ISO + installed
system both now show "veilor-os 0.5.27".
## Test procedure
- `test/TESTING.md` — first canonical test procedure document. Splits
VM (cheap iteration, hybrid sendkey + human passwords) from real
hardware (mandatory for tag). Documents the standard test passwords
(`veilortest1` for both LUKS and admin), the kill-and-relaunch step
to skip CD on second boot, and the per-step pass/fail contract.
- `test/METHOD-CHANGELOG.md` — append-only audit trail for changes to
the procedure. Future releases that alter the test method must add
an entry here with the why.
- `test/test-runs/_TEMPLATE.md` — per-run report template. Each
tagged release should land a filled report alongside it.
## test/run-vm.sh
Decoupled QEMU monitor sock setup from auto-inject. Previously
`NO_INJECT=1` (used to suppress autotype noise into prompts) also
killed the monitor sock, leaving the VM undriveable. Monitor sock is
now always exposed; only the inject helper is gated on the pubkey
detection.
Live ISO boot chain showing extra step:
boot → text scroll → veilor-firstboot prompts admin pw → installer
veilor-firstboot.service was enabled in live ks but it's an INSTALLED
system feature (forces admin pw set on first real boot). Made no
sense to ask on live (no persistent admin user, throwaway VM, etc).
Live ks now: doesn't enable veilor-firstboot, masks the unit so
overlay-copied unit file can't auto-activate. Install ks chroot %post
already enables it (correct path).
After fix:
boot → text scroll → installer banner directly
User wants full chained pipeline:
GRUB veilor-os → plymouth text → branded gum installer →
install progress → reboot → installed system text-clean.
Live ISO was missing pieces from the install ks polish. v0.5.24
brings live ks into parity:
- bootloader --append: add plymouth.enable=0 (kills fedora splash,
exposes tty1 with gum installer banner immediately)
- chroot %post: GRUB_DISTRIBUTOR="veilor-os" (menu title)
- chroot %post: GRUB_CMDLINE_LINUX_DEFAULT="" (drop rhgb quiet)
- chroot %post: plymouth-set-default-theme details (text scroll
fallback if plymouth.enable=0 ignored)
- grub2-mkconfig regen with new branding
Result on next ISO build:
- Boot from ISO → GRUB shows "veilor-os" entry
- Pick veilor-os → text scroll (no fedora splash)
- TTY1 lands on gum installer banner + menu (no plymouth swallow)
- Install completes → reboot → installed system already has the
same text-mode boot + LUKS prompt config from v0.5.22-23
v0.5.21 set plymouth.enable=0 — plymouth-start.service still ran +
ate LUKS keystrokes. Boot fell to dracut emergency shell.
Better path: plymouth IS running but in TEXT mode via built-in
`details` theme (scrolling boot log, no graphics, no fedora logo).
LUKS prompt renders as text "Please enter passphrase for...:".
Plymouth still owns the prompt → keystrokes go through.
Changes:
- Drop plymouth.enable=0 from cmdline (let plymouth run)
- chroot %post: plymouth-set-default-theme details
- Drop rhgb quiet from GRUB_CMDLINE_LINUX_DEFAULT (all kernel msgs visible)
- dracut --force --regenerate-all (new theme baked into initramfs)
Result: text scroll boot → text LUKS prompt → text scroll → SDDM.
Onyx aesthetic. Branded plymouth theme deferred to v0.6.
User wants onyx-style boot: pure text scroll → LUKS prompt → text scroll
→ SDDM. No fedora splash, no plymouth UI.
Solution: keep plymouth PACKAGE installed (Fedora's dracut module
ships LUKS-prompt machinery via plymouth), but disable plymouthd at
runtime via kernel cmdline `plymouth.enable=0`.
Effect:
- plymouthd starts → reads cmdline → exits
- systemd-ask-password sees no plymouth daemon → falls back to
systemd-tty-ask-password-agent on /dev/console
- LUKS prompt rendered as text "Please enter passphrase for /dev/dm-0: "
- All kernel/systemd messages visible
- SDDM still launches at graphical.target (real install)
Applied to both:
- LIVE ks bootloader --append (so live boot text-mode + installer
visible on tty1, no splash hiding it)
- Generated install ks bootloader --append (so installed system
text-boots with LUKS prompt)
v0.6 will rebrand plymouth theme + re-enable for branded splash. For
v0.5.0 ship: minimal/text aesthetic matches user's onyx daily driver.
QEMU boot test of v0.5.1 (commit 3cbffaf) revealed both scripts
missing from /usr/local/sbin/ on running system, despite being in
overlay/usr/local/sbin/ in the source tree.
Root cause: Fedora's filesystem package (or post-install scriptlet)
rewrites /usr/local/sbin → /usr/local/bin symlink AFTER kickstart
%post --nochroot's overlay copy runs. The cp -a placed files in
/usr/local/sbin/ as a real directory; the symlink replacement
deleted them.
Confirmed via tty diagnostic: `ls -la /usr/local` shows
`lrwxrwxrwx ... sbin -> bin` with bin mtime predating sbin symlink
ctime by ~5min — overlay copy ran first, scriptlet rewrote sbin
second.
Fix: move both binaries to overlay/usr/local/bin/ where they're
safe from the symlink rewrite. Update all references:
- kickstart/veilor-os.ks chmod path + chown + diagnostic ls
- overlay/etc/systemd/system/getty@tty1.service.d/veilor-installer.conf ExecStart
- overlay/etc/systemd/system/veilor-firstboot.service ExecStart
- scripts/selinux/build-policy.sh fcontext + restorecon paths
- generated install ks template inside veilor-installer
Service drop-in stays at /etc/systemd/system/getty@tty1.service.d/
unchanged. The veilor-installer binary in /usr/local/bin/ is
discoverable via $PATH same as before.
Two user-facing commands shipped in overlay/usr/local/bin/.
Wraps dnf+flatpak update flow and read-only health diagnostic.
Uses gum if available, plain output otherwise. No kickstart wiring
yet beyond chmod — full integration in v0.6.0 release.
Co-authored-by: veilor-org <admin@veilor.org>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
- display-manager.service symlink: livecd-creator skips alias creation
vs Anaconda installer; without it sddm stays inactive at graphical.target
- admin user: replace `passwd -d` with throwaway pw `veilor` + chage -d 0
(SDDM rejects blank pw by default, breaks first-login flow)
Tested in QEMU v0.2.5: confirmed sddm enabled but inactive after boot,
and blank-pw login at SDDM returns "Login Failed".
Live ISO stalled at dracut for 5+min on KVM with init_on_alloc=1
init_on_free=1 — kernel zeroes every page on alloc/free, brutal in
virtualized memory. Keep slab_nomerge + lockdown=integrity +
randomize_kstack + vsyscall=none for live (cheap). Re-add memory
init flags on installed system via veilor-firstboot post-install
GRUB edit (planned v0.3).
- kde-theme-apply.sh: search /etc/os-release.d/veilor (where overlay
put it) before falling back to $REPO/overlay path. Rewire symlinks
cleanly: /etc/os-release → ../usr/lib/os-release.
- Kickstart: useradd admin in chroot %post since livecd-creator skips
the `user` directive (no installer phase). Blank pw + expired = forced
reset at first login same as before.
Found via debugfs: overlay copy succeeds (veilor-power, tuned profiles,
sshd-hardening, sudoers, systemd units all present in v0.2.1 rootfs) but
`mkdir + cp assets/scripts` aborted with set -eu — leaves /usr/share/
veilor-os missing → all chroot %post scripts fail. Switch to set +e on cp
plus persist trace log to /var/log/veilor-nochroot.log for next debug.
Agent A: missing livesys-scripts + anaconda-live = lorax can't build EFI/BOOT.
Agent B: livecd-creator ignores url=, only reads repo.repoList — added
explicit repo --name=fedora to feed it the base.
Both Fedora's own pipeline + livecd-creator now have what they need.
Live image plumbing in %post: enable livesys.service livesys-late.service
tmp.mount, reset machine-id.
Past grub2-mkimage. Failed at xorrisofs final ISO assembly because EFI/BOOT
dir not built — needs grub2-efi-x64-modules to compile standalone grubx64.efi.
CI made it through full install, configure, %post, squashfs build,
initrd rebuild — failed at final boot.iso wrap because grub2-mkimage
needed /usr/lib/grub/i386-pc/moddep.lst (BIOS legacy boot modules).
Hybrid BIOS+UEFI ISO requires both grub variants.