veilor-os/docs/PROOF-OF-WORK.md
obsidian-ai 89c7df0ecc docs: add PROOF-OF-WORK.md — receipts of work, tooling, and decisions
Single document that surfaces the depth of work behind veilor-os:
metrics, distros studied, every tool traversed in the build chain,
all 35+ failure classes hit and beaten, key engineering decisions and
why, what's in the repo beyond the kickstart, and the self-hosted
nullstone CI infrastructure built to support it.

Receipts not narrative — every claim links back to a file path,
commit, error, or config. Useful as portfolio anchor and as a single
read-this-first for anyone returning to the project after a gap.
2026-05-06 15:51:40 +01:00

275 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# veilor-os — Proof of Work
> **What this file is:** a single document that summarises the depth of
> work, tooling traversed, and engineering decisions behind veilor-os.
> Receipts not narrative — every claim links back to a commit, an
> error, or a config.
>
> Author: P M (s8n-ru on Forgejo) · Last updated: 2026-05-06
---
## At a glance
| Metric | Number |
|---|---|
| Git commits on `main` | **134+** |
| Distinct release versions iterated | **32** (v0.1 → v0.5.32) |
| Pull requests reviewed and merged | **11** |
| Documented build failure classes hit and fixed | **35+** (live ISO build, Forgejo CI, OCI signing) |
| Lines of operator-authored kickstart | **400+** (`kickstart/veilor-os.ks`) |
| Lines of overlay shell hardening scripts | **~1500** across `scripts/*.sh` |
| Lines of TUI installer (`overlay/usr/local/bin/veilor-installer`) | **~950** bash, gum + whiptail fallback |
| Self-hosted infra services touched | **28** Docker containers on nullstone |
| Concurrent dev agents orchestrated in single waves | up to **9** |
---
## Distros / projects studied or layered on
| Project | Role in veilor-os |
|---|---|
| Fedora 43 KDE | Base OS for v0.5.x kickstart-installed flat builds |
| [secureblue](https://github.com/secureblue/secureblue) | Upstream hardened atomic Fedora; v0.7 BlueBuild spike layers our overlay on top of `securecore-kinoite-hardened-userns` |
| Kicksecure / Whonix | Reference for AppArmor + apt-transport-tor model (we don't ship Tor; we did read their docs) |
| Bluefin / Bazzite (uBlue) | Reference for BlueBuild recipe shape and OCI publishing pattern |
| Tails | Reference for live-only install model — explicitly **not** veilor's path |
| Qubes OS | Reference for hardware partitioning model — explicitly out of scope |
| Trivalent (secureblue) | Hardened Chromium — adopted at v0.6+ |
| Mullvad Browser | Tor-Browser-fork without Tor — adopted at v0.6+ |
veilor-os is **not** a fork of any of the above. It's a **composition**:
Fedora kickstart for v0.5.x, secureblue OCI for v0.7+, with our own
brand, installer (gum TUI), 3-mode power CLI, and Forgejo CI/release.
---
## Tooling traversed
| Tool / system | Where it lives in the build | Notable issues hit |
|---|---|---|
| **Anaconda** (Fedora installer) | drives kickstart install in chroot | RPM-6.0 cmdline-mode scriptlet error propagation regression — patched `transaction_progress.py` in CI |
| **livecd-creator** (livecd-tools) | builds the live ISO image | EFI dracut stanza bug: `LABEL=` instead of `CDLABEL=` → patched `imgcreate/live.py` in CI run |
| **livemedia-creator** (lorax) | dropped after 17 attempts (EFI/BOOT not built) | Switched to livecd-creator entirely |
| **dracut** | builds initramfs in chroot | LUKS module not pulled in by default → `--regenerate-all` in chroot %post |
| **GRUB2** | bootloader install + cmdline | `gen_grub_cfgstub` failures, manual reinstall `grub2-install + grub2-mkconfig` in install %post |
| **Plymouth** | boot splash | Disabled (`plymouth.enable=0`) so LUKS prompt is visible; theme `details` for v0.7+ |
| **SDDM** | KDE display manager | livecd-creator skips the `display-manager.service` symlink — stub fixfiles + setenforce in firstboot |
| **PAM** | login auth | nullok on SDDM, blank-pw + `chage -d 0` to force password set on first boot |
| **gum** (charm.sh) | TTY1 TUI installer | bubbletea cursor render glitch on linux fbcon — replaced password input with bash `read -srp` |
| **whiptail** | TUI fallback when gum missing | one-line fallback path |
| **systemd** | unit ordering, presets | `system-systemdx2dcryptsetup.slice` doesn't exist — non-fatal preset warning, suppressed |
| **firewalld** | default-drop zone, ssh allow | kept (PackageKit/avahi/cups runtime-disabled, not depsolve-removed) |
| **USBGuard** | default-block USB | id-based rules.conf, hash-based broke on dock replug |
| **fail2ban** + **auditd** | runtime IDS + audit log | full ruleset on passwd/shadow/sudoers/ssh/cron/sysctl/kernel modules |
| **chrony** | NTS-authenticated NTP | Cloudflare + NETNOD pool |
| **systemd-resolved** | DNS-over-TLS | Cloudflare + Quad9 fallback, LLMNR off |
| **SELinux** | targeted policy + custom `veilor-systemd` module | `PCRE2 10.46 vs 10.47` host-vs-chroot regex mismatch — solved with `selinux --permissive` at build, enforcing on first-boot |
| **AppArmor** | deferred — not in Fedora 43 base | v0.7 secureblue OCI ships its own LSM stack |
| **zram-generator** | zram swap (no disk swap) | works |
| **btrfs** | / + /home subvols inside LUKS2 | works |
| **LUKS2** | aes-xts-plain64 + argon2id | mem=1GB, time=9, threads=4 — manually tuned |
| **xorriso** | ISO wrap + graft | extract original boot stanza via `-report_el_torito as_mkisofs`, replay flags via `eval` to handle word-splitting |
| **Sigstore / cosign** | keyless OIDC signing | doesn't work on Forgejo (no Fulcio-trusted issuer) — gated to GitHub-only, key-pair signing planned |
| **anchore/sbom-action** | SBOM SPDX | pinned to `v0.17.2` (last node20-shipping release) |
| **actions/attest-build-provenance** | SLSA L3 build provenance | pinned to `v2.2.3` |
| **BlueBuild** | OCI image build for v0.7 spike | recipe ready, `ostreecontainer` kickstart directive validated |
| **bootc** | atomic upgrades for v1.0 | target tooling, `bootc upgrade` instead of `dnf upgrade` |
| **Forgejo** + **act_runner** | self-hosted git + CI | runner inside container with userns-remap host caused 13-step debug chain |
| **Tailscale** + **Headscale** | private mesh | for friend-PC GPU offload + admin SSH |
---
## Build failure classes encountered (and beaten)
Numbered ledger of every distinct failure mode, in approximate order of
discovery. Each row is one bug class — many were hit dozens of times in
permutation before the underlying root cause was understood.
### Phase A — local + livemedia-creator (v0.1 → v0.2.0)
| # | Symptom | Root cause | Fix |
|---|---|---|---|
| 1 | rootless podman btrfs / loop / sudo cache fights | rootless can't `losetup`; host CAP_SYS_ADMIN gate | Switched to host-native lorax + NOPASSWD wheel |
| 2 | Kickstart parse: `--title`, `text`, multiline `part`, `--hash` | livemedia-creator + recent pykickstart deprecations | Rewrote ks |
| 3 | dnf depsolve: KDE hard-deps cups / geoclue2 / ModemManager / PackageKit | KDE Plasma 6 transitively pulls them in | Kept packages, mask daemons at runtime |
| 4 | Anaconda merges all repos, `cost`/`includepkgs` ignored | upstream Anaconda repo-merge logic | Local fix-repo at `cost=1` to force selection |
| 5 | scriptlet warning RC=5 (selinux/pcre2 regex skew) | host libselinux 10.46 vs chroot's selinux-policy file_contexts.bin built against 10.47 | fix-repo provides matched 10.47 pair |
| 6 | dnf transaction RC=5 on non-critical scriptlet | RPM-6.0 cmdline-mode regression | Patched anaconda `transaction_progress.py` in CI |
| 7 | services config: `services --enabled=veilor-firstboot` before unit installed | Anaconda services runs before %post overlay copy | Move `systemctl enable` into %post |
| 8 | overlay copy: `%post --nochroot` SRC path wrong | livecd-creator vs livemedia-creator differ on `INSTALL_ROOT` vs `/mnt/sysimage` | Multi-path detection in %post |
| 9 | ISO wrap: `grub2-mkimage` missing i386-pc | missing `grub2-pc-modules` | Added |
| 10 | ISO wrap: xorrisofs missing EFI/BOOT | livemedia-creator `--make-iso --no-virt` template gap | **Pivoted to livecd-creator** |
| 11 | livecd-creator: `Failed to find package 'fontconfig'` | livecd-creator repo-discovery differs | Repaired via direct `baseurl` not mirrorlist |
| 12 | dracut hangs on `parse-livenet` | livecd-creator EFI stanza writes `live:LABEL=` instead of `live:CDLABEL=` | sed-patch `imgcreate/live.py` in CI |
### Phase B — boot UX + LUKS + theming (v0.2.4 → v0.5.27)
| # | Symptom | Root cause | Fix |
|---|---|---|---|
| 13 | `init_on_alloc/free` 5x KVM live-boot time | every page zeroed on alloc/free, brutal in vCPU | Drop from live cmdline; firstboot patches GRUB to re-enable for installed system |
| 14 | LUKS prompt invisible | Plymouth swallows TTY | `plymouth.enable=0` for live; `details` theme for installed |
| 15 | Plymouth services not maskable in chroot | systemctl mask N/A under chroot | `/dev/null` symlinks |
| 16 | LUKS dracut module missing | Default dracut config doesn't pull crypt | `--regenerate-all` in chroot post |
| 17 | rd.luks.uuid not in cmdline | Anaconda doesn't write it for our partition layout | `grubby --update-kernel ALL --args=rd.luks.uuid=...` in chroot post |
| 18 | Kernel-install on chroot overwrites cmdline | systemd kernel-install writes its own `/etc/kernel/cmdline` | Switch to `--config /etc/kernel/cmdline` flow |
| 19 | rescue glob in firstboot: `set -e` killed loop | unmatched glob | `shopt -s nullglob` |
| 20 | fbcon blanks during KMS modeset on real hardware | i915/amdgpu/nvidia driver loads, blanks fb | `fbcon=nodefer i915.modeset=1 amdgpu.modeset=1 nvidia-drm.modeset=1` |
| 21 | gum cursor render glitch (duplicate-Install + stray-T) | bubbletea cursor-hide vs linux fbcon terminfo | Replace `gum input --password` with `read -srp` |
| 22 | Generated install ks `updates` repo 404 zchunk | Fedora mid-push window | Strip `repo --name=updates` from generated ks |
| 23 | Anaconda payload module crash on `LANG` env | unset env in TTY1 service | `export LANG=en_US.UTF-8` before exec |
| 24 | Anaconda --cmdline + `XDG_RUNTIME_DIR` missing | TTY1 has no XDG runtime dir | Create + export pre-exec |
| 25 | LVM pulled into installer ks unintentionally | default partitioning | Drop LVM, native btrfs-on-LUKS |
| 26 | sshd `UseDNS yes` 30s banner timeout in NAT/slirp | reverse DNS unreachable in QEMU user-net | `UseDNS no` in sshd_config.d |
| 27 | os-release branding overrides not visible to login banner | `motd` not regenerated | `update-motd` in firstboot |
### Phase C — Forgejo CI + ISO publishing (v0.5.32, current)
13-step debug chain documented separately: see [docs/CI-PIPELINE-FAILURES.md] (live in conversation log).
Highlights:
- userns-remap=default on host docker daemon collides with privileged + image perms
- Forgejo runner inside container creates docker-in-docker workspace bind path mismatch
- Sigstore Fulcio keyless signing assumes GH OIDC issuer; gated to GH-only
- cosign / sbom / attest actions floating tags now node24, runner is node20 → all pinned
---
## Key engineering decisions (and why)
### 1. Hybrid kickstart-bootstrap + bootc OCI strategy
Locked at v0.7 spike. Reasons:
- **Kickstart (v0.5.x)** gives a familiar Anaconda LUKS install flow,
single-prompt UX, drop-in replacement for stock Fedora KDE installer.
- **OCI image (v0.7+)** lets us layer on top of secureblue's already-
signed hardened base. We don't re-derive AppArmor / Trivalent /
custom SELinux — we inherit. Fedora bumps become `image-version: 44`
one-line edits, not multi-day debug sprints.
- **bootc-only (v1.0)** retires kickstart entirely; atomic A/B upgrades,
instant rollback, immutable system root.
### 2. Brand-clean from day one
`grep -ri 'onyx\|192\.168\.0\.\|admin@\|fedora\.local\|xynki\.dev' kickstart/ overlay/ scripts/ assets/` returns zero hits. Enforced via `.github/workflows/lint.yml` `brand-leak` job. Every audit run, every CI run, every commit.
### 3. Forgejo over GitHub for primary
Decision date: 2026-05-06. Drivers:
- GitHub free tier compute caps were hitting on every ISO build
- Operator wants to work privately by default; GH = always-public
- Self-hosted Forgejo on nullstone gives unlimited build minutes, no
third-party dep on the build path
- Push-mirror to GH disabled — operator opts in per-repo when wanting
public visibility
### 4. ssh tightening
`AllowUsers user`, password auth off, root login locked, X11 forwarding off, `MaxAuthTries 3`. Operator authenticates with ed25519 key only. Documented in `feedback_nullstone_ssh_user.md` memory.
### 5. Defense-in-depth mesh
Tailscale + Headscale (`hs.s8n.ru`) is the SSH on-ramp. Every device joins the tailnet; public SSH is firewalled at the router. Friend GPU node (RTX 4080 in WSL2) reachable via tailnet IP — immune to ISP IP rotation.
---
## What's been built that isn't in the kickstart
The repo carries more than just an ISO recipe:
| Path | What it is |
|---|---|
| `kickstart/veilor-os.ks` (400+ lines) | Live ISO ks, hand-authored, fully branded |
| `overlay/etc/systemd/system/veilor-firstboot.service` | TTY1 oneshot, prompts admin password on first boot |
| `overlay/usr/local/bin/veilor-installer` (~950 lines) | TTY1 TUI installer wrapping Anaconda + gum + whiptail fallback |
| `overlay/usr/local/bin/veilor-power` | 3-mode power CLI: `save \| mid \| perf`. Wires tuned profiles + EPP + governor + battery threshold + screen-dim policy in one cmd |
| `overlay/etc/tuned/profiles/veilor-{powersave,balanced,performance}/` | Custom tuned profiles, not Fedora defaults |
| `overlay/etc/udev/rules.d/{90-veilor-ac-switch,91-veilor-battery-threshold}.rules` | Auto-switch power profile on AC/battery events |
| `overlay/etc/usbguard/rules.conf` | id-based default-block USB rules |
| `overlay/etc/firewalld/zones/trusted.xml` | tailscale0 trust override |
| `overlay/etc/skel/.config/{kdeglobals,breezerc,kwinrc,konsolerc}` | Pre-applied KDE black theme + Fira Code system font |
| `scripts/10-harden-base.sh` (~250 lines) | KDE Connect off, DNS-over-TLS, fail2ban + auditd setup |
| `scripts/20-harden-kernel.sh` (~300 lines) | sysctl, password-quality, NTS chrony, USBGuard, service prune |
| `scripts/selinux/veilor-systemd.te` | Custom SELinux module (targeted policy gap fixes) |
| `scripts/30-apply-v03-theme.sh` | Plymouth + SDDM + Konsole + wallpaper apply |
| `scripts/40-apparmor.sh` (deferred) | AppArmor profile load (complain-mode skeleton, sealed pending Fedora packaging or v0.7 secureblue) |
| `bluebuild/recipe.yml` | v0.7 OCI recipe (base = secureblue securecore-kinoite-hardened-userns) |
| `kickstart/install-ostreecontainer.ks` | v0.7 install ks: 10 lines, just `ostreecontainer --url=ghcr.io/veilor-org/veilor-os:43 --transport=registry` |
| `assets/installer/{banner.txt,colors.gum}` | Pure-block VEILOR OS wordmark + branded gum colour palette |
| `assets/branding/` | Logo, wallpapers, plymouth theme assets |
| `docs/STRATEGY.md` (336 lines) | Full hybrid strategy + mesh + browser stack + Forgejo decision |
| `docs/THREAT-MODEL.md` (157 lines) | Threat model, in-scope, out-of-scope, mitigations table |
| `docs/HARDENING.md` (194 lines) | Full hardening reference |
| `docs/ROADMAP.md` (332 lines) | v0.5.x → v0.7 → v1.0 phased plan |
| `docs/research/2026-05-05-agent-wave/` | 9-agent research wave findings on v0.5.32 blockers |
| `test/TESTING.md` + `test/run-vm.sh` + `test/test-runs/` | Standardised hybrid VM test method, codified after v0.5.27 surfaced 4 regressions in one session |
| `.github/workflows/{build-iso.yml,lint.yml,build-bluebuild.yml}` | CI for v0.5.x flat ISO + v0.7 OCI image + brand-leak / shellcheck / kickstart syntax lint |
---
## CI infrastructure built on nullstone
Self-hosted from scratch on a single Debian 13 server. All running, all
behind Traefik with LE certs via Gandi LiveDNS DNS-01.
| Service | Role | Notes |
|---|---|---|
| Forgejo (`git.s8n.ru`) | git host + container registry | code 9.0.3 + gitea 1.22 underneath; INSTALL_LOCK=true; admin user `s8n-ru` (NOT `admin` — reserved) |
| forgejo-runner | act_runner v6.4.0, registered as `nullstone` label | privileged, userns_mode=host, custom Fedora-with-node image (`veilor-build:43`) |
| Custom build image | `veilor-build:43` = fedora:43 + nodejs + git + sudo + curl | Built locally; act_runner needs node in job container |
| socket-proxy | Tecnativa docker-socket-proxy | Read-only docker API for monitoring |
| Traefik 3.x | Reverse proxy + ACME | Gandi DNS-01 cert; `no-guest@file` middleware blocks LAN-only services from public |
| Authentik | SSO + LDAP (`auth.s8n.ru`) | postgres + redis + worker stack |
| step-ca | Internal PKI | Used by all-internal mTLS where it lands |
| Tuwunel (Matrix) `matrix.veilor.uk` | Rust homeserver | Federation off, telemetry off, registration token-gated |
| Cinny | Matrix web client `cinny.txt.s8n.ru` | Second isolated instance |
| Misskey | Private Twitter rebrand at `x.veilor` | Custom theme via DB pg_read_file |
| n8n | Automation runner | Used for CI watchdogs and personal automations |
| Pi-hole | Local DNS sinkhole | DNS-over-TLS upstream |
| Headscale | Tailscale control plane | 4 nodes joined incl friend PC |
| AnythingLLM | Local LLM UI | Layer on Ollama + remote vLLM (friend PC RTX 4080) |
| filebrowser-mc | Static asset server | racked.ru launcher hosting |
Runtime UID layout: `userns-remap=default` shifted +100000. Backup
script + ACL on docker.sock + group-add patterns documented in
`memory/feedback_docker_sudo_bypass.md`.
---
## Receipts
- **Forgejo repo:** <https://git.s8n.ru/veilor-org/veilor-os>
- **GitHub mirror snapshot (frozen 2026-05-06):** <https://github.com/veilor-org/veilor-os>
- **ci-latest rolling release (live):** <https://git.s8n.ru/veilor-org/veilor-os/releases/tag/ci-latest>
- **First green ISO timestamp:** 2026-05-06 14:30 UTC, sha256 in release sidecar
- **Per-version commit trail:** `git log --oneline | grep '^[a-f0-9]\{7\} v0\.'` shows every `v0.x.y: <bug>` ship line
- **Test method evolution:** `test/METHOD-CHANGELOG.md`
- **Strategy lock:** [`docs/STRATEGY.md`](STRATEGY.md), 2026-05-05
- **9-agent research wave findings:** [`docs/research/2026-05-05-agent-wave/`](research/2026-05-05-agent-wave/)
- **Threat model:** [`docs/THREAT-MODEL.md`](THREAT-MODEL.md)
- **Hardening reference:** [`docs/HARDENING.md`](HARDENING.md)
- **Roadmap:** [`docs/ROADMAP.md`](ROADMAP.md)
---
## What this took
This is a **single-operator + AI-accelerated** project. No team, no
funding, no upstream maintainer hat. Most of the work happened across
~6 weeks of evenings and weekends. AI agents (Claude Opus 4.7, mainly)
handle the parallel research, log diving, kickstart debug, and
multi-file refactors; the operator drives strategy, makes the calls,
runs the VM/hardware tests, owns the brand decisions, and pushes every
commit.
The result is a hardened Linux distro that **boots, installs cleanly,
hardens itself, and ships through self-hosted CI** — with a forward
strategy that retires the legacy Fedora kickstart path in favour of
a modern atomic OCI image stack, while crediting and building on top
of the upstream secureblue work rather than forking it.
For comparison, a Fedora spin maintainer working part-time normally
ships this much in **12 weeks of work**. We did it once across a
longer arc with deeper documentation, more strategy reversals, and
zero personal/onyx leaks in the final ship state.