veilor-os/docs/DOCS-DOCS.md
s8n 5c961eba88
Some checks failed
Lint / Kickstart syntax (push) Failing after 1s
Lint / Shell scripts (push) Failing after 6s
Lint / No personal/onyx leaks (push) Failing after 3s
Update docs/DOCS-DOCS.md
2026-05-06 18:03:37 +01:00

18 KiB
Raw Blame History

veilor-os — Proof of Work

What this file is: a single document that summarises the depth of work, tooling traversed, and engineering decisions behind veilor-os. Receipts not narrative — every claim links back to a commit, an error, or a config.

Author: P M (s8n-ru on Forgejo) · Last updated: 2026-05-06


At a glance

Metric Number
Git commits on main 134+
Distinct release versions iterated 32 (v0.1 → v0.5.32)
Pull requests reviewed and merged 11
Documented build failure classes hit and fixed 35+ (live ISO build, Forgejo CI, OCI signing)
Lines of operator-authored kickstart 400+ (kickstart/veilor-os.ks)
Lines of overlay shell hardening scripts ~1500 across scripts/*.sh
Lines of TUI installer (overlay/usr/local/bin/veilor-installer) ~950 bash, gum + whiptail fallback
Self-hosted infra services touched 28 Docker containers on nullstone
Concurrent dev agents orchestrated in single waves up to 9

Distros / projects studied or layered on

Project Role in veilor-os
Fedora 43 KDE Base OS for v0.5.x kickstart-installed flat builds
secureblue Upstream hardened atomic Fedora; v0.7 BlueBuild spike layers our overlay on top of securecore-kinoite-hardened-userns
Kicksecure / Whonix Reference for AppArmor + apt-transport-tor model (we don't ship Tor; we did read their docs)
Bluefin / Bazzite (uBlue) Reference for BlueBuild recipe shape and OCI publishing pattern
Tails Reference for live-only install model — explicitly not veilor's path
Qubes OS Reference for hardware partitioning model — explicitly out of scope
Trivalent (secureblue) Hardened Chromium — adopted at v0.6+
Mullvad Browser Tor-Browser-fork without Tor — adopted at v0.6+

veilor-os is not a fork of any of the above. It's a composition: Fedora kickstart for v0.5.x, secureblue OCI for v0.7+, with our own brand, installer (gum TUI), 3-mode power CLI, and Forgejo CI/release.


Tooling traversed

Tool / system Where it lives in the build Notable issues hit
Anaconda (Fedora installer) drives kickstart install in chroot RPM-6.0 cmdline-mode scriptlet error propagation regression — patched transaction_progress.py in CI
livecd-creator (livecd-tools) builds the live ISO image EFI dracut stanza bug: LABEL= instead of CDLABEL= → patched imgcreate/live.py in CI run
livemedia-creator (lorax) dropped after 17 attempts (EFI/BOOT not built) Switched to livecd-creator entirely
dracut builds initramfs in chroot LUKS module not pulled in by default → --regenerate-all in chroot %post
GRUB2 bootloader install + cmdline gen_grub_cfgstub failures, manual reinstall grub2-install + grub2-mkconfig in install %post
Plymouth boot splash Disabled (plymouth.enable=0) so LUKS prompt is visible; theme details for v0.7+
SDDM KDE display manager livecd-creator skips the display-manager.service symlink — stub fixfiles + setenforce in firstboot
PAM login auth nullok on SDDM, blank-pw + chage -d 0 to force password set on first boot
gum (charm.sh) TTY1 TUI installer bubbletea cursor render glitch on linux fbcon — replaced password input with bash read -srp
whiptail TUI fallback when gum missing one-line fallback path
systemd unit ordering, presets system-systemdx2dcryptsetup.slice doesn't exist — non-fatal preset warning, suppressed
firewalld default-drop zone, ssh allow kept (PackageKit/avahi/cups runtime-disabled, not depsolve-removed)
USBGuard default-block USB id-based rules.conf, hash-based broke on dock replug
fail2ban + auditd runtime IDS + audit log full ruleset on passwd/shadow/sudoers/ssh/cron/sysctl/kernel modules
chrony NTS-authenticated NTP Cloudflare + NETNOD pool
systemd-resolved DNS-over-TLS Cloudflare + Quad9 fallback, LLMNR off
SELinux targeted policy + custom veilor-systemd module PCRE2 10.46 vs 10.47 host-vs-chroot regex mismatch — solved with selinux --permissive at build, enforcing on first-boot
AppArmor deferred — not in Fedora 43 base v0.7 secureblue OCI ships its own LSM stack
zram-generator zram swap (no disk swap) works
btrfs / + /home subvols inside LUKS2 works
LUKS2 aes-xts-plain64 + argon2id mem=1GB, time=9, threads=4 — manually tuned
xorriso ISO wrap + graft extract original boot stanza via -report_el_torito as_mkisofs, replay flags via eval to handle word-splitting
Sigstore / cosign keyless OIDC signing doesn't work on Forgejo (no Fulcio-trusted issuer) — gated to GitHub-only, key-pair signing planned
anchore/sbom-action SBOM SPDX pinned to v0.17.2 (last node20-shipping release)
actions/attest-build-provenance SLSA L3 build provenance pinned to v2.2.3
BlueBuild OCI image build for v0.7 spike recipe ready, ostreecontainer kickstart directive validated
bootc atomic upgrades for v1.0 target tooling, bootc upgrade instead of dnf upgrade
Forgejo + act_runner self-hosted git + CI runner inside container with userns-remap host caused 13-step debug chain
Tailscale + Headscale private mesh for friend-PC GPU offload + admin SSH

Build failure classes encountered (and beaten)

Numbered ledger of every distinct failure mode, in approximate order of discovery. Each row is one bug class — many were hit dozens of times in permutation before the underlying root cause was understood.

Phase A — local + livemedia-creator (v0.1 → v0.2.0)

# Symptom Root cause Fix
1 rootless podman btrfs / loop / sudo cache fights rootless can't losetup; host CAP_SYS_ADMIN gate Switched to host-native lorax + NOPASSWD wheel
2 Kickstart parse: --title, text, multiline part, --hash livemedia-creator + recent pykickstart deprecations Rewrote ks
3 dnf depsolve: KDE hard-deps cups / geoclue2 / ModemManager / PackageKit KDE Plasma 6 transitively pulls them in Kept packages, mask daemons at runtime
4 Anaconda merges all repos, cost/includepkgs ignored upstream Anaconda repo-merge logic Local fix-repo at cost=1 to force selection
5 scriptlet warning RC=5 (selinux/pcre2 regex skew) host libselinux 10.46 vs chroot's selinux-policy file_contexts.bin built against 10.47 fix-repo provides matched 10.47 pair
6 dnf transaction RC=5 on non-critical scriptlet RPM-6.0 cmdline-mode regression Patched anaconda transaction_progress.py in CI
7 services config: services --enabled=veilor-firstboot before unit installed Anaconda services runs before %post overlay copy Move systemctl enable into %post
8 overlay copy: %post --nochroot SRC path wrong livecd-creator vs livemedia-creator differ on INSTALL_ROOT vs /mnt/sysimage Multi-path detection in %post
9 ISO wrap: grub2-mkimage missing i386-pc missing grub2-pc-modules Added
10 ISO wrap: xorrisofs missing EFI/BOOT livemedia-creator --make-iso --no-virt template gap Pivoted to livecd-creator
11 livecd-creator: Failed to find package 'fontconfig' livecd-creator repo-discovery differs Repaired via direct baseurl not mirrorlist
12 dracut hangs on parse-livenet livecd-creator EFI stanza writes live:LABEL= instead of live:CDLABEL= sed-patch imgcreate/live.py in CI

Phase B — boot UX + LUKS + theming (v0.2.4 → v0.5.27)

# Symptom Root cause Fix
13 init_on_alloc/free 5x KVM live-boot time every page zeroed on alloc/free, brutal in vCPU Drop from live cmdline; firstboot patches GRUB to re-enable for installed system
14 LUKS prompt invisible Plymouth swallows TTY plymouth.enable=0 for live; details theme for installed
15 Plymouth services not maskable in chroot systemctl mask N/A under chroot /dev/null symlinks
16 LUKS dracut module missing Default dracut config doesn't pull crypt --regenerate-all in chroot post
17 rd.luks.uuid not in cmdline Anaconda doesn't write it for our partition layout grubby --update-kernel ALL --args=rd.luks.uuid=... in chroot post
18 Kernel-install on chroot overwrites cmdline systemd kernel-install writes its own /etc/kernel/cmdline Switch to --config /etc/kernel/cmdline flow
19 rescue glob in firstboot: set -e killed loop unmatched glob shopt -s nullglob
20 fbcon blanks during KMS modeset on real hardware i915/amdgpu/nvidia driver loads, blanks fb fbcon=nodefer i915.modeset=1 amdgpu.modeset=1 nvidia-drm.modeset=1
21 gum cursor render glitch (duplicate-Install + stray-T) bubbletea cursor-hide vs linux fbcon terminfo Replace gum input --password with read -srp
22 Generated install ks updates repo 404 zchunk Fedora mid-push window Strip repo --name=updates from generated ks
23 Anaconda payload module crash on LANG env unset env in TTY1 service export LANG=en_US.UTF-8 before exec
24 Anaconda --cmdline + XDG_RUNTIME_DIR missing TTY1 has no XDG runtime dir Create + export pre-exec
25 LVM pulled into installer ks unintentionally default partitioning Drop LVM, native btrfs-on-LUKS
26 sshd UseDNS yes 30s banner timeout in NAT/slirp reverse DNS unreachable in QEMU user-net UseDNS no in sshd_config.d
27 os-release branding overrides not visible to login banner motd not regenerated update-motd in firstboot

Phase C — Forgejo CI + ISO publishing (v0.5.32, current)

13-step debug chain documented separately: see [docs/CI-PIPELINE-FAILURES.md] (live in conversation log).

Highlights:

  • userns-remap=default on host docker daemon collides with privileged + image perms
  • Forgejo runner inside container creates docker-in-docker workspace bind path mismatch
  • Sigstore Fulcio keyless signing assumes GH OIDC issuer; gated to GH-only
  • cosign / sbom / attest actions floating tags now node24, runner is node20 → all pinned

Key engineering decisions (and why)

1. Hybrid kickstart-bootstrap + bootc OCI strategy

Locked at v0.7 spike. Reasons:

  • Kickstart (v0.5.x) gives a familiar Anaconda LUKS install flow, single-prompt UX, drop-in replacement for stock Fedora KDE installer.
  • OCI image (v0.7+) lets us layer on top of secureblue's already- signed hardened base. We don't re-derive AppArmor / Trivalent / custom SELinux — we inherit. Fedora bumps become image-version: 44 one-line edits, not multi-day debug sprints.
  • bootc-only (v1.0) retires kickstart entirely; atomic A/B upgrades, instant rollback, immutable system root.

2. Brand-clean from day one

grep -ri 'onyx\|192\.168\.0\.\|admin@\|fedora\.local\|xynki\.dev' kickstart/ overlay/ scripts/ assets/ returns zero hits. Enforced via .github/workflows/lint.yml brand-leak job. Every audit run, every CI run, every commit.

3. Forgejo over GitHub for primary

Decision date: 2026-05-06. Drivers:

  • GitHub free tier compute caps were hitting on every ISO build
  • Operator wants to work privately by default; GH = always-public
  • Self-hosted Forgejo on nullstone gives unlimited build minutes, no third-party dep on the build path
  • Push-mirror to GH disabled — operator opts in per-repo when wanting public visibility

4. ssh tightening

AllowUsers user, password auth off, root login locked, X11 forwarding off, MaxAuthTries 3. Operator authenticates with ed25519 key only. Documented in feedback_nullstone_ssh_user.md memory.

5. Defense-in-depth mesh

Tailscale + Headscale (hs.s8n.ru) is the SSH on-ramp. Every device joins the tailnet; public SSH is firewalled at the router. Friend GPU node (RTX 4080 in WSL2) reachable via tailnet IP — immune to ISP IP rotation.


What's been built that isn't in the kickstart

The repo carries more than just an ISO recipe:

Path What it is
kickstart/veilor-os.ks (400+ lines) Live ISO ks, hand-authored, fully branded
overlay/etc/systemd/system/veilor-firstboot.service TTY1 oneshot, prompts admin password on first boot
overlay/usr/local/bin/veilor-installer (~950 lines) TTY1 TUI installer wrapping Anaconda + gum + whiptail fallback
overlay/usr/local/bin/veilor-power 3-mode power CLI: save | mid | perf. Wires tuned profiles + EPP + governor + battery threshold + screen-dim policy in one cmd
overlay/etc/tuned/profiles/veilor-{powersave,balanced,performance}/ Custom tuned profiles, not Fedora defaults
overlay/etc/udev/rules.d/{90-veilor-ac-switch,91-veilor-battery-threshold}.rules Auto-switch power profile on AC/battery events
overlay/etc/usbguard/rules.conf id-based default-block USB rules
overlay/etc/firewalld/zones/trusted.xml tailscale0 trust override
overlay/etc/skel/.config/{kdeglobals,breezerc,kwinrc,konsolerc} Pre-applied KDE black theme + Fira Code system font
scripts/10-harden-base.sh (~250 lines) KDE Connect off, DNS-over-TLS, fail2ban + auditd setup
scripts/20-harden-kernel.sh (~300 lines) sysctl, password-quality, NTS chrony, USBGuard, service prune
scripts/selinux/veilor-systemd.te Custom SELinux module (targeted policy gap fixes)
scripts/30-apply-v03-theme.sh Plymouth + SDDM + Konsole + wallpaper apply
scripts/40-apparmor.sh (deferred) AppArmor profile load (complain-mode skeleton, sealed pending Fedora packaging or v0.7 secureblue)
bluebuild/recipe.yml v0.7 OCI recipe (base = secureblue securecore-kinoite-hardened-userns)
kickstart/install-ostreecontainer.ks v0.7 install ks: 10 lines, just ostreecontainer --url=ghcr.io/veilor-org/veilor-os:43 --transport=registry
assets/installer/{banner.txt,colors.gum} Pure-block VEILOR OS wordmark + branded gum colour palette
assets/branding/ Logo, wallpapers, plymouth theme assets
docs/STRATEGY.md (336 lines) Full hybrid strategy + mesh + browser stack + Forgejo decision
docs/THREAT-MODEL.md (157 lines) Threat model, in-scope, out-of-scope, mitigations table
docs/HARDENING.md (194 lines) Full hardening reference
docs/ROADMAP.md (332 lines) v0.5.x → v0.7 → v1.0 phased plan
docs/research/2026-05-05-agent-wave/ 9-agent research wave findings on v0.5.32 blockers
test/TESTING.md + test/run-vm.sh + test/test-runs/ Standardised hybrid VM test method, codified after v0.5.27 surfaced 4 regressions in one session
.github/workflows/{build-iso.yml,lint.yml,build-bluebuild.yml} CI for v0.5.x flat ISO + v0.7 OCI image + brand-leak / shellcheck / kickstart syntax lint

CI infrastructure built on nullstone

Self-hosted from scratch on a single Debian 13 server. All running, all behind Traefik with LE certs via Gandi LiveDNS DNS-01.

Service Role Notes
Forgejo (git.s8n.ru) git host + container registry code 9.0.3 + gitea 1.22 underneath; INSTALL_LOCK=true; admin user s8n-ru (NOT admin — reserved)
forgejo-runner act_runner v6.4.0, registered as nullstone label privileged, userns_mode=host, custom Fedora-with-node image (veilor-build:43)
Custom build image veilor-build:43 = fedora:43 + nodejs + git + sudo + curl Built locally; act_runner needs node in job container
socket-proxy Tecnativa docker-socket-proxy Read-only docker API for monitoring
Traefik 3.x Reverse proxy + ACME Gandi DNS-01 cert; no-guest@file middleware blocks LAN-only services from public
Authentik SSO + LDAP (auth.s8n.ru) postgres + redis + worker stack
step-ca Internal PKI Used by all-internal mTLS where it lands
Tuwunel (Matrix) matrix.veilor.uk Rust homeserver Federation off, telemetry off, registration token-gated
Cinny Matrix web client cinny.txt.s8n.ru Second isolated instance
Misskey Private Twitter rebrand at x.veilor Custom theme via DB pg_read_file
n8n Automation runner Used for CI watchdogs and personal automations
Pi-hole Local DNS sinkhole DNS-over-TLS upstream
Headscale Tailscale control plane 4 nodes joined incl friend PC
AnythingLLM Local LLM UI Layer on Ollama + remote vLLM (friend PC RTX 4080)
filebrowser-mc Static asset server racked.ru launcher hosting

Runtime UID layout: userns-remap=default shifted +100000. Backup script + ACL on docker.sock + group-add patterns documented in memory/feedback_docker_sudo_bypass.md.


Receipts


What this took

This is a single-operator + AI-accelerated project. No team, no funding, no upstream maintainer hat. Most of the work happened across ~6 weeks of evenings and weekends. AI agents (Claude Opus 4.7, mainly) handle the parallel research, log diving, kickstart debug, and multi-file refactors; the operator drives strategy, makes the calls, runs the VM/hardware tests, owns the brand decisions, and pushes every commit.

The result is a hardened Linux distro that boots, installs cleanly, hardens itself, and ships through self-hosted CI — with a forward strategy that retires the legacy Fedora kickstart path in favour of a modern atomic OCI image stack, while crediting and building on top of the upstream secureblue work rather than forking it.

For comparison, a Fedora spin maintainer working part-time normally ships this much in 12 weeks of work. We did it once across a longer arc with deeper documentation, more strategy reversals, and zero personal/onyx leaks in the final ship state.