Walk every action in kickstart/veilor-os.ks %post and map to its
v0.7 atomic equivalent:
Build-time script additions:
- chmod +x /usr/share/veilor-os/scripts/* + /usr/local/bin/veilor-*
(BlueBuild type:files sometimes drops perms)
- fc-cache -f after Fira Code stamping
- os-release brand override (NAME=veilor-os, ID=veilor, ID_LIKE)
- brand-leak guard: fail the image build if any onyx/personal data
slipped through into shipped state
Layered packages:
- zram-generator (memory hygiene; replaces dnf install in kickstart)
- jq (used by veilor-doctor for `bootc status --json`)
- vim-enhanced + tmux + htop (admin essentials, parity with v0.5.x)
Systemd unit enables added:
- veilor-postinstall.service (first-login TUI; new in A3)
- veilor-doctor.timer (weekly drift check; new in A3)
Dropped: anaconda transaction_progress.py patch (build-time CI work,
not image content); SDDM display-manager symlink (kinoite ships
sddm.service already); SELinux module build (secureblue has its
own); systemctl set-default multi-user.target (kinoite is
graphical.target by design).
A3 inline (agent failed on API). Three CLIs ported / written for the
v0.7+ atomic system:
veilor-update — rewritten on bootc upgrade (was dnf upgrade --refresh).
Pre-checks bootc status, pauses auditd while staging, prints summary
and offers reboot. Returns 0/1/2/3 per legacy contract.
veilor-postinstall (NEW) — first-login TUI run via
veilor-postinstall.service oneshot. Asks once for keyboard, locale,
hostname, GPU drivers, package presets (dev/media/homelab),
bluetooth, USBGuard snapshot, then invokes veilor-doctor. Writes
/var/lib/veilor/postinstall-complete and self-disables on success.
veilor-doctor — Updates section rewritten to parse `bootc status
--json` (with jq) when available, falls back to dnf history /
check-update for legacy v0.5.x kickstart-installed systems.
Plus systemd units:
- veilor-postinstall.service (oneshot on graphical.target, gated on
absence of done-marker, runs on tty1)
- veilor-doctor.service + .timer (weekly drift check)
A1 inline (agent failed on worktree base mismatch). Adapt
build-bluebuild.yml to run on the Forgejo self-hosted runner using
the same lessons from build-iso.yml debug:
- runs-on: nullstone (resolves to veilor-build:43, fedora43+nodejs)
- BlueBuild CLI installed in-job from upstream release tarball v0.9.10
- podman/buildah/skopeo/cosign installed via dnf
- bluebuild build with podman driver + skopeo inspect + cosign signing
- Push primary to Forgejo registry git.s8n.ru/veilor-org/veilor-os
- GHCR push gated to github.server_url == 'https://github.com' only
- SBOM + attest-build-provenance gated GH-only (Forgejo has no Fulcio)
- All third-party actions remain pinned to node20-shipping versions
Secrets needed in Forgejo repo settings:
- FORGEJO_REGISTRY_TOKEN: PAT with package:write on veilor-org
- FORGEJO_REGISTRY_USER: 's8n-ru' (or org member with write scope)
Strategy pivot 2026-05-06: v0.5.32 produced a green ISO on Forgejo
runner. That's the kickstart-path proof point. Continuing v0.6
kickstart polish is sunk-cost work on tooling retired at v1.0.
Pivot:
- v0.5.0 is the FINAL kickstart-path release. Tag, freeze, ship.
- v0.6 cancelled as a milestone. Original plan kept inline as
HISTORICAL reference.
- v0.7 promoted to primary active milestone. Absorbs the v0.6
ergonomic CLI tools (veilor-postinstall / veilor-doctor /
veilor-update) with bootc upgrade replacing dnf upgrade.
- Active branch: v0.7-bluebuild-spike. All future feature work lands
there, not on main.
Single document that surfaces the depth of work behind veilor-os:
metrics, distros studied, every tool traversed in the build chain,
all 35+ failure classes hit and beaten, key engineering decisions and
why, what's in the repo beyond the kickstart, and the self-hosted
nullstone CI infrastructure built to support it.
Receipts not narrative — every claim links back to a file path,
commit, error, or config. Useful as portfolio anchor and as a single
read-this-first for anyone returning to the project after a gap.
cosign keyless sign uses Sigstore Fulcio which requires a
Fulcio-trusted OIDC issuer. Forgejo runs don't have one, so cosign
falls back to the interactive device flow and times out
(error obtaining token: expired_token). Same applies to
attest-build-provenance and the SBOM action's signed attestation.
Skip all three on Forgejo for now; ISO + sha256 are sufficient for
v0.5.x test releases. Re-add when we self-host a Sigstore stack or
sign with a key-pair instead of keyless.
We layer on their OCI image as v0.7 base; we don't redistribute their
source. Drop the AGPLv3-attribution prose — that becomes relevant only
if/when we ship a verbatim chunk of their config/policy in our repo.
secureblue (AGPLv3) is the upstream hardened atomic Fedora that the
v0.7 BlueBuild spike layers on top of. Comparison table now includes
secureblue alongside Kicksecure + stock Fedora KDE. New "Credit &
relationship to secureblue" section spells out where their work
already solves problems we don't need to reinvent (Trivalent,
SELinux policy, kernel cmdline, signed OCI), how veilor-os differs
(kickstart install path + branding + Forgejo CI), and the AGPLv3
attribution rule for any code we lift verbatim.
Build error: 'Failed to find package apparmor-parser : No match for
argument'. Fedora 43 base and updates do not ship AppArmor packages;
the prior comment was incorrect. Defer AppArmor to v0.7 secureblue OCI
hybrid (which has its own LSM stack), or land via COPR overlay later.
forgejo-runner labels nullstone -> fedora:43 image. Switching
runs-on: ubuntu-24.04 -> nullstone makes the job container itself
the build environment, eliminating the docker-in-docker workspace
bind-mount problem (host path != act-container path).
Build now runs as root in fedora:43, installs livecd-tools directly
via dnf, and writes outputs to $GITHUB_WORKSPACE which is the natural
runner workdir on host. No nested docker, no userns juggling, no
explicit -v workspace bind needed.
Prior pin was the arm64 manifest digest (linux/arm64/v8); on x86_64
host it failed with `exec /usr/bin/sh: exec format error`. Pinned to
the amd64 manifest entry from the same fat-manifest.
Forgejo runner on nullstone runs against a daemon with
userns-remap=default. addnab/docker-run-action launches the Fedora 43
build container with --privileged, which is incompatible with
userns-remap unless --userns=host is also set.
forgejo-runner v6.4.0 ships node20; floating tags @v0/@v3/@v2 now
resolve to actions whose runs.using=node24, which the runner cannot
exec. Pin to last node20-shipping release of each:
- anchore/sbom-action@v0.17.2
- sigstore/cosign-installer@v3.7.0
- actions/attest-build-provenance@v2.2.3
The build-iso workflow used softprops/action-gh-release@v2 unconditionally,
which only speaks the GitHub Releases REST API. When the workflow runs on
the Forgejo runner registered on nullstone, those steps would fail.
Add a server_url check so the GH-only path runs only on github.com, and
mirror it with a curl-based step that hits the Forgejo /api/v1/releases
endpoints. Behaviour:
- github.com: identical to before (action-gh-release@v2).
- git.s8n.ru: drop+recreate ci-latest release, upload chunked assets
via the Forgejo attachments API.
Tag-driven "Attach to release" path mirrored the same way.
Refs: A1 build-eng task — Forgejo runner adaptation.
In v0.5 the "Remove the install media" reminder was a single line
inside the green success box, and operators on both onyx and the
friend's RTX 4080 rig missed it — rebooted into the live ISO and
re-ran the installer thinking the install had silently failed.
Promote the reminder to its own loud yellow thick-bordered gum-style
box stacked directly below the success/countdown box, with three
lines of explanation. Renders for the full 10s of the countdown so
it stays in the operator's face the entire window.
v0.5 used `sleep 5` after a static "System will reboot in 5 seconds."
box, which left the operator guessing how much time was left to grab
the USB stick. The new loop runs 10 → 1, clearing + redrawing the
gum-style success box each tick with the remaining-seconds figure,
giving the operator a visible window to act.
10 seconds (vs 5) because real hardware operators were missing the
window — laptops with the USB on the far side of the dock take
4-5 seconds to physically reach. 10 is comfortable, not annoying.
A typo in the LUKS passphrase is unrecoverable — the disk is
unmountable without it and we don't escrow the key. Re-prompting
until the two reads match catches keyboard-layout surprises (the
US/UK quote-key position is the most common one) before they brick
the install.
Admin password gets the same treatment for consistency. Less
catastrophic (resettable from a recovery shell) but a mismatch
still locks the user out of their fresh install on first boot.
Loop bails on cancel/ESC and re-prompts on validate_pw failure.
Read banner.txt line by line with a 40ms sleep between each, then
clear and redraw the bordered gum-style version. 5-line banner ×
40ms = 200ms total reveal — slow enough to land an aesthetic on the
first frame, fast enough that the operator never feels it as lag.
Pure cosmetic; no functional change to the install flow.
`gum input --password` corrupts the linux fbcon since v0.5.27 — the
bubbletea screen-restore writes back the previous menu buffer because
the framebuffer terminfo entry lacks `civis/cnorm` cursor-hide
sequences, leaving a duplicate "Install" plus a stray "T" rendered on
top of the password field. The fix is a single termios echo-off via
`read -srp`: no redraw, no glitch, no dependency on gum's TUI layer
for the one screen where it broke.
Header still rendered through `gum style` so visual parity with the
disk picker / confirm box is preserved. Whiptail fallback path
unchanged (passwordbox there has always rendered cleanly).
Note that all `uses:` directives still resolve to mutable major-
version tags. SHA-pinning is the Agent 8 audit recommendation but
requires per-action web lookups that stalled the previous SRE
attempt; tracked separately so this PR can land first.
Pin registry.fedoraproject.org/fedora:43 to its current manifest
digest so a malicious or accidental tag-rewrite upstream cannot
silently change the base layer of every CI build. Digest was
captured via `skopeo inspect --raw` on 2026-05-06. Refresh
procedure documented inline.
Sign each ISO chunk with cosign keyless OIDC, generate an SPDX SBOM
of the build output, and attach an in-toto build-provenance
attestation. Sigs/certs/SBOM are uploaded alongside the ISO parts in
the ci-latest rolling prerelease so the test/auto-install.sh path
can verify before reassembling.
Action versions are major-version tags (@v3, @v0, @v2). SHA-pinning
is tracked separately to keep this PR small and avoid the long web
lookups that stalled the previous attempt.
forgejo-runner v6.4.0 ships a node20 javascript engine. v4.2+ of
actions/checkout and v2.0.5+ of softprops/action-gh-release moved to
node24, which the runner refuses to exec. Pin both to last node20
release.
Pairs with a runner-side config change (separately deployed on
nullstone /home/docker/forgejo-runner/conf/config.yaml) that adds
`-v /var/run/docker.sock:/var/run/docker.sock` to per-job container
options + whitelists the socket via valid_volumes — without that
addnab/docker-run-action@v3 inside the catthehacker/ubuntu job
container can't reach the docker engine.
- actions/checkout v4 -> v4.1.7
- softprops/action-gh-release v2 -> v2.0.4
- addnab/docker-run-action v3 unchanged (composite/docker, no node)
- ludeeus/action-shellcheck@master unchanged (docker-based)
First test-runs/ report off the new template. Records the build host
(forgejo-runner on nullstone, ubuntu-24.04 / catthehacker:act-24.04),
notes that v0.5.32 is the first ISO produced after the GH Actions
mirror was disabled, and pre-populates the Findings section with the
7 v0.5.32 blocker fixes from the 2026-05-05 9-agent wave as expected
behaviours the tester must verify.
Result is left as "pending A1 build" — the operator + A5 fill in
per-step pass/fail and hardening output once the actual VM walkthrough
runs against the produced ISO. This is intentional: the report is the
scaffold; the test is a separate step.
Per docs/research/2026-05-05-agent-wave/README.md priority list.
All 7 land together to keep iteration cycles useful — partial fixes
bury the lookahead findings agents already mapped.
## 1. CRITICAL — suspend/resume wifi death (Agent 9, B2)
`veilor-modules-lock.service` runs `kernel.modules_disabled=1` 30s
after graphical.target. iwlwifi/iwlmvm/cfg80211 reload on resume
from S3/S0ix → with modules locked, resume breaks wifi until
reboot. Same architectural class as the LUKS bug — security feature
breaks legitimate kernel state transitions.
The unit already has `ConditionKernelCommandLine=!module.sig_enforce=1`
(self-skip when signed-modules enforcement is on cmdline). Adding
`module.sig_enforce=1` to the kernel cmdline retains the security
property (no unsigned modules) without runtime lock-down → resume
works.
Files: kickstart/veilor-os.ks line 61 + overlay/usr/local/bin/veilor-installer
generated bootloader directive both gain `module.sig_enforce=1`.
## 2. veilor-firstboot.service WantedBy=graphical.target (Agent 2)
Was `WantedBy=multi-user.target` only. Real installs default to
graphical.target so the unit never ran on installed systems — admin
pw stayed at install-time + chage -d 0 expired, SDDM PAM bounced
to chauthtok screen (recoverable but ugly UX).
Now `WantedBy=graphical.target multi-user.target`. Live ISO +
multi-user installs both resolve via this list.
## 3. USBGuard hash → id-based baseline (Agent 9, A3)
Mirrors memory feedback_usbguard_dock.md — onyx had hash+parent-hash
rules that broke on dock replug; we shipped no rules.conf so first
boot blocks the USB keyboard.
Adds overlay/etc/usbguard/rules.conf with HID-class allow rule
(`allow with-interface match-all { 03:*:* }`) — covers every USB
keyboard, mouse, gamepad, fingerprint reader, NFC. Survives dock
replug + kernel-bump vendor renumeration. Mass-storage stays
implicit-block; user explicitly allows post-firstboot via
`ujust veilor-usbguard-enroll` (planned v0.6).
## 4. firewalld trusted zone with tailscale0 pre-bound (Agent 9, D1)
User uses Tailscale daily (memory: project_tailscale_mesh.md).
Default firewalld zone = drop, blocks tailnet traffic on tailscale0.
Adds overlay/etc/firewalld/zones/trusted.xml with
`<interface name="tailscale0"/>`. After `tailscale up` brings the
interface up, NetworkManager dispatcher associates it with the
trusted zone automatically — no user intervention.
Default zone stays drop. Only the tailscale0 interface gets ACCEPT.
## 5. /etc/skel branding (Agent 7)
Was completely empty. Result: per-user KDE config (~/.config/kdeglobals
etc.) pre-empty, so the moment user opened System Settings, KDE wrote
fresh ~/.config/* and silently shadowed our /etc/xdg/kdedefaults/*.
Visual brand evaporated on first click.
Seeds:
/etc/skel/.config/kdeglobals (copy of assets/kde/veilor-default.kdeglobals)
/etc/skel/.config/breezerc (copy of assets/kde/breezerc)
/etc/skel/.config/kwinrc (Plasma 6 wayland defaults: opengl, animspeed=0,
blur off, click-to-focus)
/etc/skel/.config/konsolerc (default profile = Veilor)
/etc/skel/.local/share/konsole/Veilor.profile + .colorscheme
User who opens System Settings now writes against branded baseline,
not against vanilla Breeze.
## 6. KMS modeset args + initramfs keymap (Agents 1 + 9)
Real laptop boot has a 5-15s blank between vt switch and SDDM start
because simpledrm releases before i915/nvidia-drm/amdgpu claim. Plus
non-US users get locked out at LUKS prompt because initramfs ships
en-US keymap by default (RHBZ 1405539, RHBZ 1890085).
Adds to bootloader cmdline (live + installed):
i915.modeset=1 amdgpu.modeset=1 nvidia-drm.modeset=1
rd.vconsole.keymap=us
`rd.vconsole.keymap=us` is a placeholder; the v0.6 firstboot keymap
picker will rewrite it from /etc/vconsole.conf. Until then, en-US
users get correct LUKS keyboard; non-US users still need the v0.6
fix (per Agent 1).
## 7. virtio-9p log capture (Agent 6)
The v0.5.30 virtio-serial wiring depends on rsyslog inside the live
ISO (anaconda's setupVirtio writes a rsyslog forward rule), which
the live ks doesn't install — files were 0-byte across three
install runs.
test/run-vm.sh now adds a `-virtfs local,...,mount_tag=hostlogs`
share pointing at `test/test-runs/<timestamp>/`. veilor-installer
runs `_dump_logs_to_host` via EXIT trap that mounts the share at
/mnt/hostlogs and rsyncs /tmp/{anaconda,program,storage,packaging,dnf}.log
+ /var/log/veilor-installer.log + dmesg + journalctl + the generated
ks. Runs on success AND failure AND ^C.
No-op on real hardware (9p tag absent) — VM-only debug.
## Validate
bash -n overlay/usr/local/bin/veilor-installer # OK
ksvalidator kickstart/veilor-os.ks # clean
## Out-of-scope for v0.5.32 (deferred to v0.6)
Per Agent 1 follow-ups: argon2id retune for slow CPUs, recovery key
generation in firstboot, TPM2/FIDO2 unlock helpers. Per Agent 9
follow-ups: Plasma Wayland fallback X11 install, lid-close handling,
SELinux relabel progress UX. Per Agent 4: AppArmor stack +
nftables preset + audit log shipping CLI.
Per Agent 8 (CI hardening): SHA-pin actions + dependabot + SBOM +
SLSA L3 attestation — separate workflow-only commit.
forgejo-runner v6.4.0 javascript runtime is node20. Pin every
javascript action used in the spike branch's workflows to the last
release that ships node20.
- actions/checkout v4 -> v4.1.7 (3 files)
- softprops/action-gh-release v2 -> v2.0.4 (build-iso)
- anchore/sbom-action v0 -> v0.17.2
- actions/attest-build-provenance v2 -> v2.2.3
- blue-build/github-action@v1 unchanged (TODO: SHA pin)
This is the spike-branch counterpart of the main-branch fix in
feat/runner-fix-docker-sock-and-node20.
Replace @v1 with @24d146df25adc2cf579e918efe2d9bff6adea408 (the commit
v1 currently resolves to). Tag pins on third-party actions are mutable
— a maintainer or attacker can re-point v1 at a malicious commit and
silently change what runs on every push.
Trailing comment '# v1' preserves human readability for future bumps.
Refs: 9-agent CI hardening wave (agent 8), 2026-05-05.
Refines docs/STRATEGY.md per parent-operator handoff (2026-05-05).
Locks in five things the original draft didn't cover, and corrects
one mistake.
## Refinement: ostreecontainer install path
The original draft proposed a two-step install: Anaconda partitions
+ kickstart, then on first boot a `veilor-firstboot-rebase.service`
runs `bootc rebase ghcr.io/veilor/veilor-os:43`. This commit drops
that step.
Anaconda's `ostreecontainer --url=... --transport=registry`
directive populates the root filesystem directly from the OCI image
during the install pass. No first-boot rebase, no transition
window, no second reboot. Same end state, simpler path.
Stay on `ostreecontainer` through v0.8. Do NOT migrate to the new
`bootc` kickstart command until v1.0 — it blocks multi-disk and
authenticated registries. Do NOT use `bootc-image-builder
anaconda-iso` output — deprecated in image-builder v44+. Produce
the OCI image and the bootstrap ISO as separate artifacts.
This compresses the v0.7 BlueBuild spike from 2 days → 1 day.
## Correction: keep Trivalent as default
The original strategy.md treated Trivalent (secureblue's hardened
Chromium) as an override-and-remove. That was wrong: Trivalent's
COPR tracks upstream M147+ within hours, ships hardened_malloc +
JIT-less + Drumbrake WASM. Default browser pick.
Mullvad Browser layered alongside for anti-fingerprint. Thorium
remains opt-in via `ujust install-thorium` only — its CVE lag is
months and contradicts the threat model. Never default.
## Mesh stack baked in
Three-layer warm-stack documented in STRATEGY.md:
- L3a Tailscale + Headscale (Day 1, daily driver)
- L3b Yggdrasil-go (Day 1, idle warm-fallback, AllowedPublicKeys mode)
- L3c Reticulum/RetiNet AGPL fork (opt-in via ujust install-reticulum)
Threat floor table: ISP-DNS-block (i, Day 1), ISP-Tailscale-block
(ii, Phase 2 promote Yggdrasil), internet-down (iii, opt-in RetiNet
+ RNode).
Tier model: tag:admin / tag:infra / tag:guest with failsafe pre-auth
key on yubikey + paper + Authentik OIDC group.
## Onboarding
Token paste / QR (user picks). Misskey signup mints reusable
24h-TTL pre-auth key. NOT auto-OIDC at first boot.
## Iroh seeding daemon stub (v0.8 / Phase 2)
`veilor-seed.service` documented but NOT implemented until Iroh hits
1.0 (current 0.96–0.98 RC, Q1 2026 target slipped). BLAKE3 +
iroh-gossip per-service topic. Static media only — DEFER DB
replication forever.
## External dependency tracked
nullstone Traefik `no-guest@file` ACL is currently 0.0.0.0/0
allow-all (XFF chain breakage 2026-05-03). Must be fixed before
veilor-os first-public-ISO ships, otherwise tag:guest provisioning
leaks the full vhost surface to every veilor user. Parent operator
owns the fix; explicitly out of veilor-os scope.
## Files
- docs/STRATEGY.md — full refinement
- docs/ROADMAP.md — v0.7 spike entry now reflects ostreecontainer
+ mesh stack + 1-day spike target
- README.md — drops the "v0.2.5 pre-release" badge + status box
(out of date), adds bootc/atomic trajectory paragraph
## What did NOT change
- v0.5.x main branch is untouched. The ostreecontainer swap belongs
in the v0.7 spike branch, NOT v0.5.32.
- nullstone Traefik config is untouched. Out of scope.
- The kickstart and overlay code is untouched.
Locks in the strategic decision from 2026-05-05 secureblue research
agent: pivot the technical base toward bootc/OCI, but as a layer over
secureblue's `securecore-kinoite-hardened-userns` rather than a
Containerfile-from-scratch.
## What changed
- New: `docs/STRATEGY.md` — full hybrid plan (kickstart bootstrap →
first-boot bootc rebase → bootc-only at v1.0). Documents secureblue
rationale, our overrides (drop Trivalent, restore sudo + Xwayland),
next concrete steps for v0.7 spike (BlueBuild recipe + GH Actions
workflow + `veilor-firstboot-rebase` one-shot).
- Updated: `docs/ROADMAP.md` v0.7 bootc-spike subsection — supersedes
the Agent 3 Containerfile-from-scratch plan with the BlueBuild
layering plan. Spike compresses 1 week → 2 days; hardening review
inherited from 30 secureblue contributors.
## Why hybrid, not pure pivot
- Anaconda's LUKS UX (single passphrase prompt + custom
partitioning) is mature; bootc-image-builder's installer is not yet
on par. Keep the kickstart as the bootstrap.
- bootc upgrade gets us atomic A/B + signed image chain + instant
rollback that we can't realistically build alone with our
contributor count.
- The kickstart work is not lost — it becomes the day-zero installer
through v0.7. v1.0 deprecates it entirely once bootc-image-builder
installer ISO matures.
## Why secureblue, not Athena (Arch)
| Axis | secureblue | Athena OS |
|---|---|---|
| Maintainers | 30 | 8 |
| MAC enforcing OOB | SELinux + custom policy | AppArmor active, profiles mostly unconfined |
| Atomic / immutable updates | Yes (bootc/rpm-ostree) | No (rolling) |
| Threat model published | No | Yes |
| MS-signed Secure Boot shim | Yes (Fedora shim) | Yes (with auto-MOK) |
Athena's only structural advantage is the published threat model.
We're already drafting one (Agent 5 of 2026-05-05 wave) — we get
that win regardless. secureblue's contributor count + atomic update
infrastructure is the leverage.
## Strategic credibility win
Publishing `docs/THREAT-MODEL.md` BEFORE the v0.7 launch positions
veilor-os ahead of secureblue (no threat model) and Athena (has
threat model but smaller contributor base) on the one axis that
matters most.
## Open questions documented in STRATEGY.md
- secureblue contribution acceptance for upstream patches (USBGuard
id-based-rules fix, threat model framework)
- Brave vs Mullvad-Browser pick for default browser
- bootc rebase first-boot fallback if rebase fails
- Fedora 44 transition timing follows secureblue's release tags
Four-bug fix from 4-agent verification wave on v0.5.30 outcome.
Bug 1 CRITICAL: --location=none made anaconda skip CollectKernelArgumentsTask
(installation.py:149-151). --append= args never collected, BLS entries
wrote with empty cmdline. Drop --location=none, let anaconda do its
bootloader path; broad transaction_progress patch already silences the
gen_grub_cfgstub class failure.
Bug 2 CRITICAL: kernel-install reads /etc/kernel/cmdline as source of
truth (per 90-loaderentry.install:84-95). Veilor never wrote that file
so kernel-install fell through to /proc/cmdline (live ISO's). Add
3-path write: /etc/kernel/cmdline (Path A canonical), /etc/default/grub
(Path B legacy), grubby --update-kernel=ALL (Path C last-writer guard).
Plus explicit kernel-install add per kernel after Path A write.
Bug 3: rescue BLS glob *-0-rescue-*.conf required trailing hyphen;
F43 uses *-0-rescue.conf. Fix: *-0-rescue*.conf (matches both).
Bug 4: set +e/set -e scope leak in %post. v0.5.30 closed manual
bootloader block with set -e which re-enabled errexit for the rest of
%post that was authored with set +e semantics. Result: any
non-guarded command failure aborted the LUKS args injection block.
Fix: remove the closing set -e.
Files: overlay/usr/local/bin/veilor-installer.
Verified: bash -n clean, ksvalidator clean.
Three-layer fix for the persistent anaconda transaction failure that
killed v0.5.28 (gen_grub_cfgstub) and v0.5.29 (aggregate dnf5 error).
## Layer 1: broad error suppression in transaction_progress.py
dnf5 under RPM 6.0 + cmdline anaconda emits a final aggregate
`error("transaction process has ended with errors..")` at end of
transaction whenever its internal failure counter > 0, regardless of
whether we suppressed individual script_error events. Reproduced
twice. The narrow patch in v0.5.29 suppressed per-package errors but
the aggregate still raised PayloadInstallationError and aborted the
install before the bootloader phase ran.
v0.5.30 patch turns the `elif token == 'error':` branch in
process_transaction_progress into a log.warning. All four producers
(cpio_error, script_error, unpack_error, generic error) now flow
through to a warning + continue. Pattern matches both the original
anaconda layout AND the v0.5.29 narrow-patched layout, so re-applying
on top of either is a no-op.
This brings us back to v0.5.28 broad-suppression behaviour. The
side effect that bit us in v0.5.28 (silent grub2-efi-x64 scriptlet
failure → empty /boot/efi/EFI/fedora/ → gen_grub_cfgstub fails)
is addressed by Layer 2 below.
## Layer 2: bootloader install moved out of anaconda
The generated install kickstart now has `bootloader --location=none`,
which tells anaconda NOT to invoke its own bootloader install code
path (and therefore NOT to call gen_grub_cfgstub). All grub work
moves into the chroot %post block:
1. `dnf reinstall grub2-efi-x64 grub2-pc grub2-tools shim-x64
efibootmgr` — re-runs scriptlets in the chroot with full
PID 1 systemd state, so the systemd-run-style triggers that
anaconda's chroot truncates actually execute.
2. `grub2-install --target=x86_64-efi --efi-directory=/boot/efi
--bootloader-id=fedora --no-nvram` — populates /boot/efi/EFI/fedora/
3. `gen_grub_cfgstub /boot/grub2 /boot/efi/EFI/fedora` (or
`grub2-mkconfig` fallback) — writes /boot/efi/EFI/fedora/grub.cfg.
4. `efibootmgr -c -d <disk> -p <part> -L "veilor-os" -l \EFI\fedora\shimx64.efi`
— registers the NVRAM boot entry pointing at the signed shim.
Each step logs to stdout and continues on failure (`set +e` block);
diagnostics surface in the install log without aborting the whole
%post.
## Layer 3: virtio-serial log capture in run-vm.sh
Anaconda 43.x autodetects `/dev/virtio-ports/org.fedoraproject.anaconda.log.0`
and streams program/packaging/storage/anaconda logs through it in
real time, before any tmpfs / pivot, before networking, surviving
kernel panic. Wiring it into run-vm.sh means the host gets a
tail-able log file at `test/anaconda-vm-YYYYMMDD-HHMMSS.log` for
every VM run.
We've lost logs three times in a row to anaconda failures + tmpfs
reboots. This breaks the loop.
## Diagnostic story
Before this commit: VM aborts → live ISO reboots itself → /tmp/
tmpfs gone → no logs → guess what failed. Three days, two and a
half false fixes.
After this commit: VM aborts → host has /home/admin/ai-lab/_github/veilor-os/test/anaconda-vm-*.log
with the actual scriptlet output, the actual exit codes, the
actual file-trigger failures. Future debug becomes evidence-based.
Files changed:
kickstart/veilor-os.ks — broad error suppression patch
overlay/usr/local/bin/veilor-installer — --location=none + manual grub
test/run-vm.sh — virtio-serial chardev wiring
Verified: bash -n clean, ksvalidator clean.
Five-fix bundle from 7-agent research wave on the v0.5.28-final
gen_grub_cfgstub failure.
## 1. Narrow the anaconda transaction_progress patch (CRITICAL)
The v0.5.28 patch was too broad. It rewrote
`process_transaction_progress` so every 'error' token in the
transaction queue became a `log.warning`. That queue carries four
distinct error classes:
- cpio_error — payload extraction (genuinely fatal)
- script_error — RPM 6.0 cmdline-mode scriptlet warning-as-error
(the ONE we want to ignore)
- unpack_error — payload corruption (genuinely fatal)
- error — generic transaction error (genuinely fatal)
By swallowing all four we silently masked grub2-efi-x64's posttrans
failure mid-install. /boot/efi/EFI/fedora/ ended up incomplete →
gen_grub_cfgstub then failed at the bootloader install phase with
"gen_grub_cfgstub script failed" because its `set -eu` script
couldn't read the missing files.
v0.5.29 narrows the patch: override only the `script_error` callback
inside transaction_progress.py to log a warning and NOT enqueue
'error'. The consumer (`process_transaction_progress`) reverts to
upstream behaviour where cpio_error / unpack_error / error still
raise PayloadInstallationError. Real install-fatal events keep
aborting; only the F43-RPM-6.0 scriptlet regression is silenced.
The patch is applied via `python3 -c` regex rewrite (more robust
than nested sed across multi-line method bodies).
## 2. LUKS UX — `tries=5,timeout=0` (FIX)
Default cryptsetup-generator unit allows ONE passphrase try with a
1m30s wait. One typo on a long passphrase = wait 1m30s, then the
device-wait timer trips, then dracut emergency shell after 3min total.
Brutal. Adding `rd.luks.options=luks-XXX=tries=5,timeout=0` gives
five typo-friendly retries with no auto-timeout.
## 3. fbcon=nodefer on installed-system cmdline (FIX)
Live ISO cmdline already has `fbcon=nodefer` (added in v0.5.27 to fix
the real-laptop black-screen-after-dracut). The installed-system
bootloader directive in the generated install ks did NOT carry it.
Same KMS handoff happens on the installed system on the same hardware.
Now both have the flag.
## 4. /etc/crypttab fallback assertion (BELT-BRACES)
Anaconda's custom-partitioning code path normally writes /etc/crypttab
for `--encrypted` part directives. Edge cases observed in F43+ where
it doesn't. Without crypttab, systemd-cryptsetup-generator can still
work from kernel cmdline alone, but cleanup paths and second-stage
unlock both fall over. Adding a fallback `echo` that writes the
canonical line if it's missing post-anaconda.
## 5. Initramfs LUKS module assertion (DEFENSIVE)
Force-include `crypt + systemd-cryptsetup + plymouth` modules in
initramfs via /etc/dracut.conf.d/10-veilor-luks.conf. dracut autodetects
these when it sees an active LUKS mapping, but %post runs before the
LUKS state is fully observable from the chroot. Plus we wipe stale
initramfs (`rm -f /boot/initramfs-*.img`) before `--regenerate-all`
so the regen actually rewrites bytes. Final assertion runs
`lsinitrd | grep -q cryptsetup` and surfaces a [ERR] line in build
output if the module didn't make it.
## What this should fix
After the man-db fix in v0.5.28-final, install proceeded past
"Configuring xxx" cleanly but died at "Installing boot loader" with
gen_grub_cfgstub. Root-cause was the over-broad patch from #1 above.
After v0.5.29:
- Install transaction completes (man-db excluded; non-man-db
scriptlet warnings still suppressed; real errors still raise)
- gen_grub_cfgstub runs against complete /boot/efi/EFI/fedora/
- Bootloader install completes
- Reboot to disk lands at GRUB veilor-os entry
- Kernel + initramfs load (cryptsetup confirmed present)
- Plymouth LUKS prompt appears with text fallback
- User has 5 tries, no timeout
- Unlock → btrfs subvol mount → systemd → SDDM
Files: kickstart/veilor-os.ks (+45 lines), overlay/usr/local/bin/veilor-installer (+50 lines).
Verified: bash -n clean, ksvalidator clean.
References:
pyanaconda transaction_progress.py:110-136 (4 producers of 'error')
pyanaconda bootloader/efi.py:194-201 (gen_grub_cfgstub call site)
/usr/bin/gen_grub_cfgstub (set -eu wrapper for grub2-mkconfig stub)
Fedora wiki Changes/RPM-6.0
dnf5 issue #2507 (RPM 6.0 scriptlet propagation regression)
THE actual root cause of the man-db transaction failure that killed
three consecutive VM installs (v0.5.26 / v0.5.27 / v0.5.28).
Confirmed via 7-agent research wave:
- Fedora 43 ships RPM 6.0, which changed scriptlet failure
propagation. Scriptlets that previously emitted "Non-critical
error" warnings now bubble up as transaction-level errors. dnf5
issue #2507 documents the change. Anaconda --cmdline mode treats
any 'error' token from the dnf transaction as a fatal abort.
- man-db's `transfiletriggerin` is the canonical trigger: it runs
`systemd-run /usr/bin/systemctl start man-db-cache-update` which
returns non-zero in the anaconda chroot (no PID 1 systemd) and is
flagged as transaction-level error under RPM 6.0.
- We previously patched anaconda's transaction_progress.py on the
BUILD HOST so livecd-creator could finish its own transaction.
That patch lives only on the host running the build — never landed
in the live rootfs the user installs from. Reproduced 3 times:
install-time anaconda on the live ISO is unpatched, hits the same
code path, aborts at exactly "Configuring man-db.x86_64".
Two-layer fix:
1. kickstart %post seds the file inside the live rootfs at build time
so the user's install-time anaconda is patched. Sed downgrades the
'error' token from raise PayloadInstallationError to log.warning.
2. Generated install ks excludes man-db / man-pages / man-pages-overrides
from %packages. Belt-and-braces — even if the patch has an edge
case the trigger never fires because the package isn't installed.
Users install man pages post-firstboot.
Previous attempts that didn't work: dropping the updates repo (only
narrowed the set of failing scriptlets, didn't fix the underlying
RPM-6.0 propagation change); flipping SELinux to permissive
(confirmed not the cause; kickstart's selinux directive only writes
/etc/selinux/config in target root, doesn't affect installer-time).
Follow-up for next release: replicate the transaction_progress patch
in the CI workflow's container so the build itself is deterministic.
Currently the workflow has been greening on luck.
Files: kickstart/veilor-os.ks (+25 lines), overlay/usr/local/bin/veilor-installer (+10 lines).
Verified: bash -n clean, ksvalidator clean.
Anaconda's transaction died at "Configuring man-db.x86_64" in both
v0.5.26 and v0.5.27 VM tests, reliably, days apart, against a freshly
populated package cache. Same failure pattern, same package, with
nothing in the visible error other than "The transaction process has
ended with errors..". Pattern matches the same Fedora `updates` repo
issue that the CI build kickstart already worked around by stripping
the `updates` line entirely (`.github/workflows/build-iso.yml` line
~109).
The installer-generated kickstart was adding the line back and
re-introducing the bug for every user install. This commit aligns
the install-time ks with the build-time ks: only the base `releases`
repo is consumed by anaconda. Users who want updates run `dnf
upgrade` post-install (or the v0.6 `veilor-update` wrapper).
Trade-off: first-boot package versions are frozen to the Fedora 43
release date instead of including post-release updates. Acceptable —
the alternative is "install reliably fails" which makes any
freshness conversation moot.
Verified locally: `bash -n` passes, ks template still well-formed.
End-to-end re-validation goes through the next CI ISO + VM test run.
Install-flow change + roadmap update. The roadmap entry is the
durable record; the code change is the immediate effect.
## Locale picker removed
The "[4/4] Locale" prompt is gone. Locale is hardcoded to en_US.UTF-8
for the install. Two reasons:
1. The picker only offered en_GB and en_US, both of which install
identically apart from the langtag string and a couple of date /
currency conventions that nobody who's mid-install is thinking
about. It's a fake choice that adds a screen.
2. `localectl set-locale` post-install handles every locale on earth
in one command. The v0.7 `veilor-postinstall` first-login menu (see
roadmap below) will offer a locale + keyboard layout switch with
live preview, which is the right place for that decision.
Step counters updated [1/4]→[1/3], [2/4]→[2/3], [3/4]→[3/3]. The Locale
row stays in the confirm-summary box because users still want to see
what they're getting installed.
## Roadmap
- New section v0.5.27–v0.5.28 — documents the install-path
stabilisation work explicitly so the bridge between "first green
ISO" and "looks polished" is not invisible. Calls out the LUKS BLS
fix that landed in v0.5.27 + the gum-input replacement scheduled
for v0.5.28.
- v0.6 — `veilor-doctor` description expanded: this is the
post-install audit tool. Every user runs it weekly to see drift
from baseline.
- v0.6 — new entry `veilor-postinstall`: EndeavourOS-style first-login
welcome menu, single TUI screen, asks once. Covers the "I just
installed, what do I configure" gap in one explicit step instead of
scattered docs.
Critical install bug fix + cosmetic round-up + first formal test
procedure document.
## Critical: LUKS unlock on first boot
Generated installer kickstart's %post was injecting `rd.luks.uuid=…`
into `/etc/default/grub` only. Fedora 43 uses BLS (Boot Loader
Specification) entries in `/boot/loader/entries/*.conf`; those are
NOT regenerated by `grub2-mkconfig`. Result: the kernel boots without
`rd.luks.uuid=`, dracut's cryptsetup-generator never spawns the
unlock unit, plymouth has no password to ask for, and dracut-initqueue
loops on dev-disk-by-uuid for ~3min before dropping to emergency
shell.
The fix layers both write paths:
- `/etc/default/grub` — keeps the args around for future kernels
(kernel-install reads this when adding new entries).
- `grubby --update-kernel=ALL --args=...` — rewrites the `options`
line of every existing BLS entry so the kernel that boots NEXT
actually has the args.
Verified by reading `/proc/cmdline` from the dracut emergency shell
on a v0.5.26 install; old cmdline had only `root=UUID=… ro
rootflags=subvol=root` and was missing the LUKS arg entirely.
## GRUB / branding
- `/etc/default/grub` is sed'd to `GRUB_DISTRIBUTOR="veilor-os"` (was
already there, kept).
- BLS entries' `title` line is rewritten in-place to "veilor-os
(<kver>)" for every kernel — `grub2-mkconfig` does not touch BLS
titles, so this is the only path.
- `/boot/loader/entries/*-0-rescue-*.conf` is removed: the auto-built
rescue entry was leaking "Fedora Linux" into the GRUB menu and
showing a second boot option that nobody asked for. The rescue
kernel image itself is left in /boot.
- Hostname defaults to `veilor` (was inheriting the `localhost-live`
name anaconda writes when the kickstart's network directive is
ignored under cmdline mode).
- `/etc/machine-info` adds `PRETTY_HOSTNAME="veilor-os"` so
`hostnamectl status` and any consumer reading machine-info see the
brand.
## Boot UX
- `fbcon=nodefer` added to live-ISO bootloader cmdline. On real
laptops with a hardware GPU, the kernel modeset blanks the
framebuffer console mid-boot; without `nodefer` the installer
banner draws into a frozen framebuffer and the user sees a black
screen with a blinking cursor for ~30s. virtio-vga in QEMU doesn't
trigger this so it never reproduced in VM. Symptom report on
v0.5.26 was the trigger to investigate.
## Installer cosmetics
- `GUM_CHOOSE_CURSOR` and `GUM_INPUT_PROMPT` switched from `❯ ` to
`> `. The unicode arrow falls back to a fixed-width block on the
linux fbcon font and lipgloss then duplicates that block at col +23,
producing the "Install Install" double-render and the stray-T
artifact in password fields. Plain ASCII renders identically across
fbcon, virtio-vga, and X/Wayland gum runs.
- `VERSION_ID` bumped 0.5.8 → 0.5.27 in the os-release drop-in. The
installer banner reads this at runtime, so the live ISO + installed
system both now show "veilor-os 0.5.27".
## Test procedure
- `test/TESTING.md` — first canonical test procedure document. Splits
VM (cheap iteration, hybrid sendkey + human passwords) from real
hardware (mandatory for tag). Documents the standard test passwords
(`veilortest1` for both LUKS and admin), the kill-and-relaunch step
to skip CD on second boot, and the per-step pass/fail contract.
- `test/METHOD-CHANGELOG.md` — append-only audit trail for changes to
the procedure. Future releases that alter the test method must add
an entry here with the why.
- `test/test-runs/_TEMPLATE.md` — per-run report template. Each
tagged release should land a filled report alongside it.
## test/run-vm.sh
Decoupled QEMU monitor sock setup from auto-inject. Previously
`NO_INJECT=1` (used to suppress autotype noise into prompts) also
killed the monitor sock, leaving the VM undriveable. Monitor sock is
now always exposed; only the inject helper is gated on the pubkey
detection.
User hit `/usr/local/bin/veilor-installer: line 33: /usr/bin/tee:
input/output error` on real-hardware install. Cause: LOG was
`/var/log/veilor-installer.log`, which on the live ISO is backed by an
overlay over squashfs. A bad sector / flaky USB → tee write fails →
process substitution dies → installer aborts before the menu renders.
Two changes:
1. Move LOG to /run/veilor-installer.log — pure tmpfs, never touches
the live medium. Same path also unaffected by /var fill or overlay
weirdness.
2. Wrap the `exec > >(tee -a $LOG) 2>&1` redirect in a writability
probe. If the log can't be appended to (tmpfs OOM, fd exhaustion,
anything), skip the tee and run the installer without on-disk
persistence rather than crashing.
Persistence is a nice-to-have for post-mortem debugging; the installer
running is the must-have. This inverts the priority correctly.