veilor-os/docs/STRATEGY.md
veilor-org abb67841f1 docs: STRATEGY.md — primary git host moved to git.s8n.ru (Forgejo)
Self-hosted Forgejo + forgejo-runner on nullstone now primary.
GitHub becomes public mirror (Forgejo push-mirrors every commit
+ every 8h). 0 GH Actions minutes consumed.

Runner labels:
  ubuntu-24.04 — drop-in for existing build-iso.yml workflow
  nullstone    — privileged Fedora 43 (opt-in via runs-on: nullstone)

Deploy artifacts: ~/ai-lab/nullstone-server/forgejo/.

External TODO (parent operator owns):
  - router port-forward 222 → nullstone:222 for public SSH push
  - no-guest@file allowlist update for external web UI access
2026-05-06 02:01:06 +01:00

336 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# veilor-os Strategy — Hybrid kickstart bootstrap + bootc OCI
Decision date: **2026-05-05** (refined same day from parent-operator
handoff, locks the `ostreecontainer` install path, mesh stack
bake-in, browser stack, Iroh seeding roadmap, and threat floor table).
Locked at: **v0.5.31 → v0.7 spike → v1.0**
## TL;DR
- Keep the Anaconda-driven kickstart ISO as the **bootstrap installer**
(LUKS UX is mature, single passphrase prompt, custom partitioning
works).
- Anaconda's `ostreecontainer` directive populates the root filesystem
directly from a **veilor-os OCI image** (built via BlueBuild on top
of secureblue's `securecore-kinoite-hardened-userns`) **during the
install pass — no first-boot rebase, no mutable→atomic transition**.
- All future updates flow through `bootc upgrade` — atomic A/B,
instant rollback, cosign-signed.
- The kickstart-driven mutable-root path is deprecated at v1.0; kept
alive as fallback through v0.7.
## Why hybrid, not pure pivot
Pure pivot to bootc-from-scratch (Agent 3's spike plan) was **1 week
to first ISO**. Pure pivot to layering on secureblue is **2 days to
first ISO** because the hardening work is already done. The
`ostreecontainer` refinement compresses that to **1 day** by
eliminating the first-boot rebase choreography (no
`veilor-firstboot-rebase.service`, no second reboot, no transition
window where the system is half-mutable, half-atomic).
Both pure-pivot paths require throwing away the partitioning UX we
already have working in Anaconda. Hybrid keeps it.
Hybrid:
- **Day-zero install:** Anaconda kickstart + custom partitioning +
LUKS prompt (what we have today). User experience = unchanged.
- **End of install pass:** `ostreecontainer
--url=ghcr.io/veilor/veilor-os:43 --transport=registry` populates
`/` from the OCI image. Transition is invisible.
- **First boot:** veilor OCI tree, no rebase, no special service.
- **Day-2:** `bootc upgrade` cadence for everything from then on.
We keep what works, pivot the part that doesn't.
## ostreecontainer directive (refinement, locked)
Replace the `%packages` block in the install kickstart with:
```
ostreecontainer --url=ghcr.io/veilor/veilor-os:43 --transport=registry
```
Keep the existing `part`/LUKS encryption block verbatim — Anaconda
partitions before `ostreecontainer` populates root.
**Stay on `ostreecontainer` through v0.8.** Do NOT migrate to the new
`bootc` kickstart command until v1.0 — `bootc` blocks multi-disk and
authenticated registries, both of which we'll likely need.
**Do NOT use** `bootc-image-builder anaconda-iso` output —
deprecated in image-builder v44+. Produce the OCI image and the
bootstrap ISO as **separate artifacts**:
- OCI image: BlueBuild recipe → cosign-signed image at
`ghcr.io/veilor/veilor-os:43`
- Bootstrap ISO: Anaconda kickstart with `ostreecontainer` directive
pointing at the OCI image
Reference: <https://docs.fedoraproject.org/en-US/bootc/>; pykickstart
docs for `ostreecontainer`.
## Why secureblue underneath
| Question | Answer |
|---|---|
| Maintainers | secureblue: 30 contributors, 56 commits/5wks. veilor-os: solo. |
| Hardening surface | secureblue ships sysctl + kargs + SELinux + USBGuard + hardened-malloc + DoT — far more than we'd build alone. |
| Build pipeline | BlueBuild → cosign-signed OCI in GH Actions (`build-all.yml`, `trivy.yml`). |
| Update model | bootc upgrade with A/B + instant rollback + signed image chain. |
| Variants | `kinoite-hardened-userns` is the KDE+Wayland+SELinux variant we'd want. |
| License | Apache-2.0 (compatible with our MIT). |
What we override in our recipe:
- **`run0` instead of sudo**: revert. Breaks too many workflows.
- **Xwayland disabled**: revert. Some apps still need it.
- **Veilor branding**: theme, KDE color scheme, Plymouth, SDDM, font,
os-release. All `overlay/*` ports verbatim from current repo.
(Browser stack is its own section below — Trivalent is now a *kept*
default, not an override.)
## Browser stack
| Role | Pick | Source |
|---|---|---|
| **Default browser** | **Trivalent** (secureblue's hardened Chromium) | Fedora COPR `secureblue/trivalent` — tracks upstream M147+ within hours, ships hardened_malloc + JIT-less + Drumbrake WASM |
| **Anti-fingerprint companion** | **Mullvad Browser** | Clearnet, no Tor, layered alongside Trivalent for pseudonymous browsing |
| **Optional opt-in** | **Thorium** | `ujust install-thorium` only — WARN users of months-long CVE lag (LTS Chromium base, ~9 milestones behind upstream stable as of 2026-05) |
**DO NOT default to Thorium under any circumstances** — contradicts
the threat model. Trivalent's COPR keeps us inside one-hour-of-upstream
patch latency; Thorium is multi-month-stale and is a perf/media
profile choice, not a security choice.
The earlier draft of this doc treated Trivalent as an override-and-
remove. That was wrong: Trivalent is exactly the level of hardening
we want for a default browser. Keep it. Add Mullvad alongside.
Move Thorium behind an explicit opt-in.
## Mesh stack — three-layer warm-stack
Day 1 ships layers 1 (Tailscale) and 2 (Yggdrasil idle). Layer 3
(Reticulum) is opt-in via `ujust`.
### Layer 1 — Tailscale + Headscale (daily driver)
- Already running on `nullstone`, `hs.s8n.ru`. OIDC via Authentik.
- Veilor OS ships `tailscale-1.94.2+` from official Fedora repo.
- Service unit **pre-disabled** at install time.
- First-boot prompt: "join Veilor mesh? [paste / QR]". On accept:
`tailscale up --login-server=https://hs.s8n.ru` with the user's
pre-auth key.
### Layer 2 — Yggdrasil-go (warm fallback, idle by default)
- `yggdrasil-go` 0.5.13+ from COPR / dnf.
- Decentralized IPv6 in `200::/7`.
- systemd unit **enabled** but config = empty `Listen[]`, one
`Public peer` (e.g. `vpn.itrus.su` or another EU peer),
`AllowedPublicKeys` allowlist mode (no allow-all).
- WSS:443 transport for ISP DPI evasion.
- Generates ECC keypair on first boot via systemd-tmpfiles or
firstboot script.
- Survives ISP-level Tailscale block (threat floor (ii)).
### Layer 3 — Reticulum (opt-in)
- **RetiNet AGPL fork** (NOT upstream RNS — upstream has an anti-AI
license clause incompatible with our governance). Sourced from the
Codeberg AGPL fork.
- Sideband (Android/desktop messenger built on RNS).
- Install via `ujust install-reticulum`. NOT auto-started until
RetiNet stabilizes.
- Default config when enabled: `AutoInterface` (LAN multicast) +
12 TCP backbone peers.
- RNode hardware (LoRa transceiver) bundle as separate
`ujust install-reticulum-rnode`.
- Survives total internet outage (threat floor (iii)) when paired
with RNode.
## Onboarding model
Token-based (paste OR QR, user picks). Misskey signup page mints a
**reusable pre-auth key** (TTL=24h, single-use, regenerated per
signup). First boot of Veilor ISO accepts hex paste OR QR scan of
the same key.
**NOT auto-OIDC at first boot** — too much Authentik exposure for
day-zero users.
## Tier model — three-tier
- `tag:admin` — onyx + failsafe. Full mesh, `*:*`.
- `tag:infra` — nullstone, office. Mesh among themselves; admin
inbound only.
- `tag:guest` — Veilor OS users + friend. ONLY `x.veilor:443`
reachable + future seeded service hostnames whitelisted.
- **Failsafe** — pre-baked admin pre-auth key on yubikey + printed
paper + Authentik OIDC group `tailnet-admin` as second auth path.
## Threat floor table
| Floor | Attack | Day 1 (v0.7 ship) | Phase 2 (v0.8) |
|-------|--------|---|---|
| (i) | ISP blocks `s8n.ru` DNS | Tailscale dies, Yggdrasil survives | YES (documented failover) |
| (ii) | ISP blocks Tailscale protocol | Yggdrasil-WSS:443 survives | YES |
| (iii) | Internet unreachable | RNS over LoRa survives | OPT-IN (RetiNet + RNode) |
Day 1 must hold floor (i). Floors (ii) and (iii) become P2 once
Yggdrasil is promoted from idle to documented failover.
## Iroh seeding daemon (Phase 2 / v0.8)
- `veilor-seed.service` systemd unit, runs as `_veilor-seed` user.
- Watches `/var/lib/<service>/files/` blob store directories.
- BLAKE3-hashes new blobs, registers with local iroh node.
- Publishes tickets on per-service `iroh-gossip` topic.
- LRU local cache, default 10 GB.
- Sidecar mirrors service blob stores: Misskey `/files/`, Matrix
media, `dl.veilor` downloads.
- Other Veilor nodes pull lazily on cache miss.
- **DEFER DB replication forever.** Static media only.
DOCUMENT but DO NOT IMPLEMENT until **Iroh hits 1.0** (currently
0.960.98 RC season; 1.0 target Q1 2026 slipped, watching).
Reference: <https://github.com/n0-computer/iroh-blobs/blob/main/DESIGN.md>.
## External dependency — Phase 0 (NOT veilor-os scope)
Real ACL gap on nullstone Traefik right now: friend on `tag:guest`
can reach `nullstone:443` → SNI-routes to ALL Traefik vhosts
(`sys.s8n.ru`, `pihole.s8n.ru`, `hs.s8n.ru`, `auth.s8n.ru`, n8n, rc,
mx, …). Only per-vhost auth blocks them. The `no-guest@file` Traefik
middleware that should fix this is currently an `0.0.0.0/0`
allow-all stub (neutralized 2026-05-03 from XFF chain breakage).
**veilor-os does NOT fix this.** Tracked here as an external
dependency: ACL fix on nullstone Traefik **required before veilor-os
first-public-ISO ships**, otherwise `tag:guest` provisioning leaks
the full vhost surface to every veilor user. Parent operator owns it.
## Strategic credibility win
secureblue does NOT publish a threat model. Athena OS does, and it's
their main differentiator. We've already drafted
`docs/THREAT-MODEL.md` (Agent 5 of 2026-05-05 wave). Publishing that
*before* the v0.7 launch positions veilor-os ahead of secureblue and
Athena on the one axis that matters most for a security-branded
distro: **honest, scoped, public threat model**.
## Roadmap implications
| Version | Status | Path |
|---|---|---|
| v0.5.31 | shipped | Anaconda kickstart, mutable root |
| v0.5.32 | active — top blockers from 9-agent wave | Anaconda kickstart |
| v0.5.x → v0.6 | maintenance | Anaconda kickstart, ergonomics + UX polish |
| **v0.7 spike** | **1-day BlueBuild prototype** (was 2 days; `ostreecontainer` removes first-boot-rebase work) | First veilor OCI image extending secureblue-kinoite-hardened |
| v0.7 ship | ISO bootstraps install, `ostreecontainer` populates from OCI in-pass | Hybrid path live |
| v0.8 | Iroh seeding (P2P static media), Yggdrasil promoted from idle to documented failover, RetiNet stabilization watch | bootc-only direction |
| **v1.0** | **bootc-only**, kickstart deprecated, possibly migrate `ostreecontainer` → new `bootc` kickstart command if multi-disk + auth-registry blockers resolved upstream | `bootc upgrade` for all updates |
The Containerfile-from-scratch spike plan (Agent 3 of 2026-05-05
wave) is **superseded** by this hybrid: don't build a Containerfile
from scratch on `fedora-bootc:43`. Instead, write a BlueBuild recipe
on `securecore-kinoite-hardened-userns`. With `ostreecontainer`
swap, spike compresses 1 week → 1 day.
## Next concrete steps
### v0.5.32 — current (no strategy change)
Ship the 7 blockers from `docs/research/2026-05-05-agent-wave/`:
suspend/resume wifi fix, firstboot WantedBy, USBGuard id-rules,
firewalld tailscale0 zone, KMS modeset, /etc/skel branding, virtio-9p
log capture.
`ostreecontainer` swap **does NOT land in v0.5.32 main.** It belongs
in the v0.7 spike branch only.
### v0.7-spike (1 day, separate branch)
1. New repo dir: `bluebuild/recipe.yml`.
2. `from`: `ghcr.io/secureblue/securecore-kinoite-hardened-userns:latest`.
3. Override modules:
- `type: files` — stamp our `overlay/*` tree (branding, themes,
veilor scripts, sddm theme, plymouth theme).
- `type: rpm-ostree` — install Mullvad Browser + restore Xwayland +
re-enable sudo (revert run0).
- **Keep Trivalent** as default (was wrongly marked for removal in
the first draft of this doc).
- `type: brand` — PRETTY_NAME, GRUB_DISTRIBUTOR, distributor URL.
- `type: files` — pre-disabled `tailscale.service`, idle
`yggdrasil.service`, `ujust install-reticulum` and
`ujust install-thorium` recipes.
4. `.github/workflows/build-bluebuild.yml` — pull BlueBuild action,
build + cosign sign + push to GHCR.
5. `kickstart/install.ks` — replace `%packages` block with
`ostreecontainer --url=ghcr.io/veilor/veilor-os:43
--transport=registry`. Keep existing partitioning + LUKS block
verbatim. **Drop** all planned `veilor-firstboot-rebase.service`
work — no longer needed.
### v1.0 — bootc-only
- Drop `kickstart/veilor-os.ks`, drop `livecd-creator` workflow.
- Bootstrap ISO is built as a **separate artifact** (NOT via
`bootc-image-builder anaconda-iso`, which was deprecated in
image-builder v44).
- The OCI image is the source of truth.
- `veilor-update` becomes thin `bootc upgrade --apply` wrapper.
- Migrate `ostreecontainer` directive → new `bootc` kickstart
command IF multi-disk + authenticated-registry support has landed
upstream by then.
## Open questions
- Does secureblue accept upstream contributions? If yes, send our
USBGuard id-based-rules fix and our threat-model framework.
- Recovery flow when `ostreecontainer` install pass fails — Anaconda
should abort cleanly; verify in spike that no half-installed
state is bootable.
- Iroh 1.0 timing — currently 0.960.98 RC; Q1 2026 target slipped.
Re-evaluate Phase 2 schedule when 1.0 lands.
- RetiNet upstream stabilization — track Codeberg fork for releases.
If it stalls > 6 months we re-evaluate Layer 3.
- Fedora 44 transition: secureblue tracks Fedora releases (current
`v4.9` on F44). If we follow, we get F44 for free at the same time
upstream does.
## Self-hosted git + CI (locked 2026-05-05)
Primary git host moved off github.com. **Forgejo** runs on nullstone
at `git.s8n.ru`, with **forgejo-runner** doing the build work. GH free-
tier minute quota was hammering veilor-os iteration; we self-host now.
- Primary remote: `ssh://git@192.168.0.100:222/veilor-org/veilor-os.git`
(Forgejo, LAN-only until router port-forward 222 → nullstone:222
added — TODO; or use tailnet hostname once tailscale logged in).
- Public mirror: `https://github.com/veilor-org/veilor-os.git`. Forgejo
push-mirrors every commit + every 8h, so GH stays in sync without
consuming GH minutes.
- Runner labels: `ubuntu-24.04` (catthehacker image — works for our
current build-iso.yml unmodified) and `nullstone` (privileged Fedora
43 container — opt-in via `runs-on: nullstone`).
- Build cost: 0 GH minutes. Disk: ~80 GB workspace on /home/docker.
Deploy artifacts: `~/ai-lab/nullstone-server/forgejo/`. Runbook in same
dir.
## See also
- `docs/THREAT-MODEL.md` — drafted, needs publish for v0.7
- `docs/ROADMAP.md` — updated to reflect this strategy
- `docs/research/2026-05-05-agent-wave/03-bootc-spike-plan.md`
superseded by this hybrid (kept as reference for the
Containerfile-from-scratch alternative)
- secureblue: <https://github.com/secureblue/secureblue>
- BlueBuild: <https://blue-build.org>
- bootc / ostreecontainer docs: <https://docs.fedoraproject.org/en-US/bootc/>
- Yggdrasil: <https://github.com/yggdrasil-network/yggdrasil-go>
- Reticulum manual: <https://reticulum.network/manual/>
- Iroh blobs design: <https://github.com/n0-computer/iroh-blobs/blob/main/DESIGN.md>