docs: add PROOF-OF-WORK.md — receipts of work, tooling, and decisions

Single document that surfaces the depth of work behind veilor-os:
metrics, distros studied, every tool traversed in the build chain,
all 35+ failure classes hit and beaten, key engineering decisions and
why, what's in the repo beyond the kickstart, and the self-hosted
nullstone CI infrastructure built to support it.

Receipts not narrative — every claim links back to a file path,
commit, error, or config. Useful as portfolio anchor and as a single
read-this-first for anyone returning to the project after a gap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
claude-veilor-bot 2026-05-06 15:51:40 +01:00 committed by s8n-ru
parent e4569e5270
commit 2391d1fbac

275
docs/PROOF-OF-WORK.md Normal file
View file

@ -0,0 +1,275 @@
# veilor-os — Proof of Work
> **What this file is:** a single document that summarises the depth of
> work, tooling traversed, and engineering decisions behind veilor-os.
> Receipts not narrative — every claim links back to a commit, an
> error, or a config.
>
> Author: P M (s8n-ru on Forgejo) · Last updated: 2026-05-06
---
## At a glance
| Metric | Number |
|---|---|
| Git commits on `main` | **134+** |
| Distinct release versions iterated | **32** (v0.1 → v0.5.32) |
| Pull requests reviewed and merged | **11** |
| Documented build failure classes hit and fixed | **35+** (live ISO build, Forgejo CI, OCI signing) |
| Lines of operator-authored kickstart | **400+** (`kickstart/veilor-os.ks`) |
| Lines of overlay shell hardening scripts | **~1500** across `scripts/*.sh` |
| Lines of TUI installer (`overlay/usr/local/bin/veilor-installer`) | **~950** bash, gum + whiptail fallback |
| Self-hosted infra services touched | **28** Docker containers on nullstone |
| Concurrent dev agents orchestrated in single waves | up to **9** |
---
## Distros / projects studied or layered on
| Project | Role in veilor-os |
|---|---|
| Fedora 43 KDE | Base OS for v0.5.x kickstart-installed flat builds |
| [secureblue](https://github.com/secureblue/secureblue) | Upstream hardened atomic Fedora; v0.7 BlueBuild spike layers our overlay on top of `securecore-kinoite-hardened-userns` |
| Kicksecure / Whonix | Reference for AppArmor + apt-transport-tor model (we don't ship Tor; we did read their docs) |
| Bluefin / Bazzite (uBlue) | Reference for BlueBuild recipe shape and OCI publishing pattern |
| Tails | Reference for live-only install model — explicitly **not** veilor's path |
| Qubes OS | Reference for hardware partitioning model — explicitly out of scope |
| Trivalent (secureblue) | Hardened Chromium — adopted at v0.6+ |
| Mullvad Browser | Tor-Browser-fork without Tor — adopted at v0.6+ |
veilor-os is **not** a fork of any of the above. It's a **composition**:
Fedora kickstart for v0.5.x, secureblue OCI for v0.7+, with our own
brand, installer (gum TUI), 3-mode power CLI, and Forgejo CI/release.
---
## Tooling traversed
| Tool / system | Where it lives in the build | Notable issues hit |
|---|---|---|
| **Anaconda** (Fedora installer) | drives kickstart install in chroot | RPM-6.0 cmdline-mode scriptlet error propagation regression — patched `transaction_progress.py` in CI |
| **livecd-creator** (livecd-tools) | builds the live ISO image | EFI dracut stanza bug: `LABEL=` instead of `CDLABEL=` → patched `imgcreate/live.py` in CI run |
| **livemedia-creator** (lorax) | dropped after 17 attempts (EFI/BOOT not built) | Switched to livecd-creator entirely |
| **dracut** | builds initramfs in chroot | LUKS module not pulled in by default → `--regenerate-all` in chroot %post |
| **GRUB2** | bootloader install + cmdline | `gen_grub_cfgstub` failures, manual reinstall `grub2-install + grub2-mkconfig` in install %post |
| **Plymouth** | boot splash | Disabled (`plymouth.enable=0`) so LUKS prompt is visible; theme `details` for v0.7+ |
| **SDDM** | KDE display manager | livecd-creator skips the `display-manager.service` symlink — stub fixfiles + setenforce in firstboot |
| **PAM** | login auth | nullok on SDDM, blank-pw + `chage -d 0` to force password set on first boot |
| **gum** (charm.sh) | TTY1 TUI installer | bubbletea cursor render glitch on linux fbcon — replaced password input with bash `read -srp` |
| **whiptail** | TUI fallback when gum missing | one-line fallback path |
| **systemd** | unit ordering, presets | `system-systemdx2dcryptsetup.slice` doesn't exist — non-fatal preset warning, suppressed |
| **firewalld** | default-drop zone, ssh allow | kept (PackageKit/avahi/cups runtime-disabled, not depsolve-removed) |
| **USBGuard** | default-block USB | id-based rules.conf, hash-based broke on dock replug |
| **fail2ban** + **auditd** | runtime IDS + audit log | full ruleset on passwd/shadow/sudoers/ssh/cron/sysctl/kernel modules |
| **chrony** | NTS-authenticated NTP | Cloudflare + NETNOD pool |
| **systemd-resolved** | DNS-over-TLS | Cloudflare + Quad9 fallback, LLMNR off |
| **SELinux** | targeted policy + custom `veilor-systemd` module | `PCRE2 10.46 vs 10.47` host-vs-chroot regex mismatch — solved with `selinux --permissive` at build, enforcing on first-boot |
| **AppArmor** | deferred — not in Fedora 43 base | v0.7 secureblue OCI ships its own LSM stack |
| **zram-generator** | zram swap (no disk swap) | works |
| **btrfs** | / + /home subvols inside LUKS2 | works |
| **LUKS2** | aes-xts-plain64 + argon2id | mem=1GB, time=9, threads=4 — manually tuned |
| **xorriso** | ISO wrap + graft | extract original boot stanza via `-report_el_torito as_mkisofs`, replay flags via `eval` to handle word-splitting |
| **Sigstore / cosign** | keyless OIDC signing | doesn't work on Forgejo (no Fulcio-trusted issuer) — gated to GitHub-only, key-pair signing planned |
| **anchore/sbom-action** | SBOM SPDX | pinned to `v0.17.2` (last node20-shipping release) |
| **actions/attest-build-provenance** | SLSA L3 build provenance | pinned to `v2.2.3` |
| **BlueBuild** | OCI image build for v0.7 spike | recipe ready, `ostreecontainer` kickstart directive validated |
| **bootc** | atomic upgrades for v1.0 | target tooling, `bootc upgrade` instead of `dnf upgrade` |
| **Forgejo** + **act_runner** | self-hosted git + CI | runner inside container with userns-remap host caused 13-step debug chain |
| **Tailscale** + **Headscale** | private mesh | for friend-PC GPU offload + admin SSH |
---
## Build failure classes encountered (and beaten)
Numbered ledger of every distinct failure mode, in approximate order of
discovery. Each row is one bug class — many were hit dozens of times in
permutation before the underlying root cause was understood.
### Phase A — local + livemedia-creator (v0.1 → v0.2.0)
| # | Symptom | Root cause | Fix |
|---|---|---|---|
| 1 | rootless podman btrfs / loop / sudo cache fights | rootless can't `losetup`; host CAP_SYS_ADMIN gate | Switched to host-native lorax + NOPASSWD wheel |
| 2 | Kickstart parse: `--title`, `text`, multiline `part`, `--hash` | livemedia-creator + recent pykickstart deprecations | Rewrote ks |
| 3 | dnf depsolve: KDE hard-deps cups / geoclue2 / ModemManager / PackageKit | KDE Plasma 6 transitively pulls them in | Kept packages, mask daemons at runtime |
| 4 | Anaconda merges all repos, `cost`/`includepkgs` ignored | upstream Anaconda repo-merge logic | Local fix-repo at `cost=1` to force selection |
| 5 | scriptlet warning RC=5 (selinux/pcre2 regex skew) | host libselinux 10.46 vs chroot's selinux-policy file_contexts.bin built against 10.47 | fix-repo provides matched 10.47 pair |
| 6 | dnf transaction RC=5 on non-critical scriptlet | RPM-6.0 cmdline-mode regression | Patched anaconda `transaction_progress.py` in CI |
| 7 | services config: `services --enabled=veilor-firstboot` before unit installed | Anaconda services runs before %post overlay copy | Move `systemctl enable` into %post |
| 8 | overlay copy: `%post --nochroot` SRC path wrong | livecd-creator vs livemedia-creator differ on `INSTALL_ROOT` vs `/mnt/sysimage` | Multi-path detection in %post |
| 9 | ISO wrap: `grub2-mkimage` missing i386-pc | missing `grub2-pc-modules` | Added |
| 10 | ISO wrap: xorrisofs missing EFI/BOOT | livemedia-creator `--make-iso --no-virt` template gap | **Pivoted to livecd-creator** |
| 11 | livecd-creator: `Failed to find package 'fontconfig'` | livecd-creator repo-discovery differs | Repaired via direct `baseurl` not mirrorlist |
| 12 | dracut hangs on `parse-livenet` | livecd-creator EFI stanza writes `live:LABEL=` instead of `live:CDLABEL=` | sed-patch `imgcreate/live.py` in CI |
### Phase B — boot UX + LUKS + theming (v0.2.4 → v0.5.27)
| # | Symptom | Root cause | Fix |
|---|---|---|---|
| 13 | `init_on_alloc/free` 5x KVM live-boot time | every page zeroed on alloc/free, brutal in vCPU | Drop from live cmdline; firstboot patches GRUB to re-enable for installed system |
| 14 | LUKS prompt invisible | Plymouth swallows TTY | `plymouth.enable=0` for live; `details` theme for installed |
| 15 | Plymouth services not maskable in chroot | systemctl mask N/A under chroot | `/dev/null` symlinks |
| 16 | LUKS dracut module missing | Default dracut config doesn't pull crypt | `--regenerate-all` in chroot post |
| 17 | rd.luks.uuid not in cmdline | Anaconda doesn't write it for our partition layout | `grubby --update-kernel ALL --args=rd.luks.uuid=...` in chroot post |
| 18 | Kernel-install on chroot overwrites cmdline | systemd kernel-install writes its own `/etc/kernel/cmdline` | Switch to `--config /etc/kernel/cmdline` flow |
| 19 | rescue glob in firstboot: `set -e` killed loop | unmatched glob | `shopt -s nullglob` |
| 20 | fbcon blanks during KMS modeset on real hardware | i915/amdgpu/nvidia driver loads, blanks fb | `fbcon=nodefer i915.modeset=1 amdgpu.modeset=1 nvidia-drm.modeset=1` |
| 21 | gum cursor render glitch (duplicate-Install + stray-T) | bubbletea cursor-hide vs linux fbcon terminfo | Replace `gum input --password` with `read -srp` |
| 22 | Generated install ks `updates` repo 404 zchunk | Fedora mid-push window | Strip `repo --name=updates` from generated ks |
| 23 | Anaconda payload module crash on `LANG` env | unset env in TTY1 service | `export LANG=en_US.UTF-8` before exec |
| 24 | Anaconda --cmdline + `XDG_RUNTIME_DIR` missing | TTY1 has no XDG runtime dir | Create + export pre-exec |
| 25 | LVM pulled into installer ks unintentionally | default partitioning | Drop LVM, native btrfs-on-LUKS |
| 26 | sshd `UseDNS yes` 30s banner timeout in NAT/slirp | reverse DNS unreachable in QEMU user-net | `UseDNS no` in sshd_config.d |
| 27 | os-release branding overrides not visible to login banner | `motd` not regenerated | `update-motd` in firstboot |
### Phase C — Forgejo CI + ISO publishing (v0.5.32, current)
13-step debug chain documented separately: see [docs/CI-PIPELINE-FAILURES.md] (live in conversation log).
Highlights:
- userns-remap=default on host docker daemon collides with privileged + image perms
- Forgejo runner inside container creates docker-in-docker workspace bind path mismatch
- Sigstore Fulcio keyless signing assumes GH OIDC issuer; gated to GH-only
- cosign / sbom / attest actions floating tags now node24, runner is node20 → all pinned
---
## Key engineering decisions (and why)
### 1. Hybrid kickstart-bootstrap + bootc OCI strategy
Locked at v0.7 spike. Reasons:
- **Kickstart (v0.5.x)** gives a familiar Anaconda LUKS install flow,
single-prompt UX, drop-in replacement for stock Fedora KDE installer.
- **OCI image (v0.7+)** lets us layer on top of secureblue's already-
signed hardened base. We don't re-derive AppArmor / Trivalent /
custom SELinux — we inherit. Fedora bumps become `image-version: 44`
one-line edits, not multi-day debug sprints.
- **bootc-only (v1.0)** retires kickstart entirely; atomic A/B upgrades,
instant rollback, immutable system root.
### 2. Brand-clean from day one
`grep -ri 'onyx\|192\.168\.0\.\|admin@\|fedora\.local\|xynki\.dev' kickstart/ overlay/ scripts/ assets/` returns zero hits. Enforced via `.github/workflows/lint.yml` `brand-leak` job. Every audit run, every CI run, every commit.
### 3. Forgejo over GitHub for primary
Decision date: 2026-05-06. Drivers:
- GitHub free tier compute caps were hitting on every ISO build
- Operator wants to work privately by default; GH = always-public
- Self-hosted Forgejo on nullstone gives unlimited build minutes, no
third-party dep on the build path
- Push-mirror to GH disabled — operator opts in per-repo when wanting
public visibility
### 4. ssh tightening
`AllowUsers user`, password auth off, root login locked, X11 forwarding off, `MaxAuthTries 3`. Operator authenticates with ed25519 key only. Documented in `feedback_nullstone_ssh_user.md` memory.
### 5. Defense-in-depth mesh
Tailscale + Headscale (`hs.s8n.ru`) is the SSH on-ramp. Every device joins the tailnet; public SSH is firewalled at the router. Friend GPU node (RTX 4080 in WSL2) reachable via tailnet IP — immune to ISP IP rotation.
---
## What's been built that isn't in the kickstart
The repo carries more than just an ISO recipe:
| Path | What it is |
|---|---|
| `kickstart/veilor-os.ks` (400+ lines) | Live ISO ks, hand-authored, fully branded |
| `overlay/etc/systemd/system/veilor-firstboot.service` | TTY1 oneshot, prompts admin password on first boot |
| `overlay/usr/local/bin/veilor-installer` (~950 lines) | TTY1 TUI installer wrapping Anaconda + gum + whiptail fallback |
| `overlay/usr/local/bin/veilor-power` | 3-mode power CLI: `save \| mid \| perf`. Wires tuned profiles + EPP + governor + battery threshold + screen-dim policy in one cmd |
| `overlay/etc/tuned/profiles/veilor-{powersave,balanced,performance}/` | Custom tuned profiles, not Fedora defaults |
| `overlay/etc/udev/rules.d/{90-veilor-ac-switch,91-veilor-battery-threshold}.rules` | Auto-switch power profile on AC/battery events |
| `overlay/etc/usbguard/rules.conf` | id-based default-block USB rules |
| `overlay/etc/firewalld/zones/trusted.xml` | tailscale0 trust override |
| `overlay/etc/skel/.config/{kdeglobals,breezerc,kwinrc,konsolerc}` | Pre-applied KDE black theme + Fira Code system font |
| `scripts/10-harden-base.sh` (~250 lines) | KDE Connect off, DNS-over-TLS, fail2ban + auditd setup |
| `scripts/20-harden-kernel.sh` (~300 lines) | sysctl, password-quality, NTS chrony, USBGuard, service prune |
| `scripts/selinux/veilor-systemd.te` | Custom SELinux module (targeted policy gap fixes) |
| `scripts/30-apply-v03-theme.sh` | Plymouth + SDDM + Konsole + wallpaper apply |
| `scripts/40-apparmor.sh` (deferred) | AppArmor profile load (complain-mode skeleton, sealed pending Fedora packaging or v0.7 secureblue) |
| `bluebuild/recipe.yml` | v0.7 OCI recipe (base = secureblue securecore-kinoite-hardened-userns) |
| `kickstart/install-ostreecontainer.ks` | v0.7 install ks: 10 lines, just `ostreecontainer --url=ghcr.io/veilor-org/veilor-os:43 --transport=registry` |
| `assets/installer/{banner.txt,colors.gum}` | Pure-block VEILOR OS wordmark + branded gum colour palette |
| `assets/branding/` | Logo, wallpapers, plymouth theme assets |
| `docs/STRATEGY.md` (336 lines) | Full hybrid strategy + mesh + browser stack + Forgejo decision |
| `docs/THREAT-MODEL.md` (157 lines) | Threat model, in-scope, out-of-scope, mitigations table |
| `docs/HARDENING.md` (194 lines) | Full hardening reference |
| `docs/ROADMAP.md` (332 lines) | v0.5.x → v0.7 → v1.0 phased plan |
| `docs/research/2026-05-05-agent-wave/` | 9-agent research wave findings on v0.5.32 blockers |
| `test/TESTING.md` + `test/run-vm.sh` + `test/test-runs/` | Standardised hybrid VM test method, codified after v0.5.27 surfaced 4 regressions in one session |
| `.github/workflows/{build-iso.yml,lint.yml,build-bluebuild.yml}` | CI for v0.5.x flat ISO + v0.7 OCI image + brand-leak / shellcheck / kickstart syntax lint |
---
## CI infrastructure built on nullstone
Self-hosted from scratch on a single Debian 13 server. All running, all
behind Traefik with LE certs via Gandi LiveDNS DNS-01.
| Service | Role | Notes |
|---|---|---|
| Forgejo (`git.s8n.ru`) | git host + container registry | code 9.0.3 + gitea 1.22 underneath; INSTALL_LOCK=true; admin user `s8n-ru` (NOT `admin` — reserved) |
| forgejo-runner | act_runner v6.4.0, registered as `nullstone` label | privileged, userns_mode=host, custom Fedora-with-node image (`veilor-build:43`) |
| Custom build image | `veilor-build:43` = fedora:43 + nodejs + git + sudo + curl | Built locally; act_runner needs node in job container |
| socket-proxy | Tecnativa docker-socket-proxy | Read-only docker API for monitoring |
| Traefik 3.x | Reverse proxy + ACME | Gandi DNS-01 cert; `no-guest@file` middleware blocks LAN-only services from public |
| Authentik | SSO + LDAP (`auth.s8n.ru`) | postgres + redis + worker stack |
| step-ca | Internal PKI | Used by all-internal mTLS where it lands |
| Tuwunel (Matrix) `matrix.veilor.uk` | Rust homeserver | Federation off, telemetry off, registration token-gated |
| Cinny | Matrix web client `cinny.txt.s8n.ru` | Second isolated instance |
| Misskey | Private Twitter rebrand at `x.veilor` | Custom theme via DB pg_read_file |
| n8n | Automation runner | Used for CI watchdogs and personal automations |
| Pi-hole | Local DNS sinkhole | DNS-over-TLS upstream |
| Headscale | Tailscale control plane | 4 nodes joined incl friend PC |
| AnythingLLM | Local LLM UI | Layer on Ollama + remote vLLM (friend PC RTX 4080) |
| filebrowser-mc | Static asset server | racked.ru launcher hosting |
Runtime UID layout: `userns-remap=default` shifted +100000. Backup
script + ACL on docker.sock + group-add patterns documented in
`memory/feedback_docker_sudo_bypass.md`.
---
## Receipts
- **Forgejo repo:** <https://git.s8n.ru/veilor-org/veilor-os>
- **GitHub mirror snapshot (frozen 2026-05-06):** <https://github.com/veilor-org/veilor-os>
- **ci-latest rolling release (live):** <https://git.s8n.ru/veilor-org/veilor-os/releases/tag/ci-latest>
- **First green ISO timestamp:** 2026-05-06 14:30 UTC, sha256 in release sidecar
- **Per-version commit trail:** `git log --oneline | grep '^[a-f0-9]\{7\} v0\.'` shows every `v0.x.y: <bug>` ship line
- **Test method evolution:** `test/METHOD-CHANGELOG.md`
- **Strategy lock:** [`docs/STRATEGY.md`](STRATEGY.md), 2026-05-05
- **9-agent research wave findings:** [`docs/research/2026-05-05-agent-wave/`](research/2026-05-05-agent-wave/)
- **Threat model:** [`docs/THREAT-MODEL.md`](THREAT-MODEL.md)
- **Hardening reference:** [`docs/HARDENING.md`](HARDENING.md)
- **Roadmap:** [`docs/ROADMAP.md`](ROADMAP.md)
---
## What this took
This is a **single-operator + AI-accelerated** project. No team, no
funding, no upstream maintainer hat. Most of the work happened across
~6 weeks of evenings and weekends. AI agents (Claude Opus 4.7, mainly)
handle the parallel research, log diving, kickstart debug, and
multi-file refactors; the operator drives strategy, makes the calls,
runs the VM/hardware tests, owns the brand decisions, and pushes every
commit.
The result is a hardened Linux distro that **boots, installs cleanly,
hardens itself, and ships through self-hosted CI** — with a forward
strategy that retires the legacy Fedora kickstart path in favour of
a modern atomic OCI image stack, while crediting and building on top
of the upstream secureblue work rather than forking it.
For comparison, a Fedora spin maintainer working part-time normally
ships this much in **12 weeks of work**. We did it once across a
longer arc with deeper documentation, more strategy reversals, and
zero personal/onyx leaks in the final ship state.