veilor-os/docs/STRATEGY.md
veilor-org 7060d9aa6b docs: refine strategy — ostreecontainer install + mesh stack + browser stack
Refines docs/STRATEGY.md per parent-operator handoff (2026-05-05).
Locks in five things the original draft didn't cover, and corrects
one mistake.

## Refinement: ostreecontainer install path

The original draft proposed a two-step install: Anaconda partitions
+ kickstart, then on first boot a `veilor-firstboot-rebase.service`
runs `bootc rebase ghcr.io/veilor/veilor-os:43`. This commit drops
that step.

Anaconda's `ostreecontainer --url=... --transport=registry`
directive populates the root filesystem directly from the OCI image
during the install pass. No first-boot rebase, no transition
window, no second reboot. Same end state, simpler path.

Stay on `ostreecontainer` through v0.8. Do NOT migrate to the new
`bootc` kickstart command until v1.0 — it blocks multi-disk and
authenticated registries. Do NOT use `bootc-image-builder
anaconda-iso` output — deprecated in image-builder v44+. Produce
the OCI image and the bootstrap ISO as separate artifacts.

This compresses the v0.7 BlueBuild spike from 2 days → 1 day.

## Correction: keep Trivalent as default

The original strategy.md treated Trivalent (secureblue's hardened
Chromium) as an override-and-remove. That was wrong: Trivalent's
COPR tracks upstream M147+ within hours, ships hardened_malloc +
JIT-less + Drumbrake WASM. Default browser pick.

Mullvad Browser layered alongside for anti-fingerprint. Thorium
remains opt-in via `ujust install-thorium` only — its CVE lag is
months and contradicts the threat model. Never default.

## Mesh stack baked in

Three-layer warm-stack documented in STRATEGY.md:
- L3a Tailscale + Headscale (Day 1, daily driver)
- L3b Yggdrasil-go (Day 1, idle warm-fallback, AllowedPublicKeys mode)
- L3c Reticulum/RetiNet AGPL fork (opt-in via ujust install-reticulum)

Threat floor table: ISP-DNS-block (i, Day 1), ISP-Tailscale-block
(ii, Phase 2 promote Yggdrasil), internet-down (iii, opt-in RetiNet
+ RNode).

Tier model: tag:admin / tag:infra / tag:guest with failsafe pre-auth
key on yubikey + paper + Authentik OIDC group.

## Onboarding

Token paste / QR (user picks). Misskey signup mints reusable
24h-TTL pre-auth key. NOT auto-OIDC at first boot.

## Iroh seeding daemon stub (v0.8 / Phase 2)

`veilor-seed.service` documented but NOT implemented until Iroh hits
1.0 (current 0.96–0.98 RC, Q1 2026 target slipped). BLAKE3 +
iroh-gossip per-service topic. Static media only — DEFER DB
replication forever.

## External dependency tracked

nullstone Traefik `no-guest@file` ACL is currently 0.0.0.0/0
allow-all (XFF chain breakage 2026-05-03). Must be fixed before
veilor-os first-public-ISO ships, otherwise tag:guest provisioning
leaks the full vhost surface to every veilor user. Parent operator
owns the fix; explicitly out of veilor-os scope.

## Files

- docs/STRATEGY.md — full refinement
- docs/ROADMAP.md — v0.7 spike entry now reflects ostreecontainer
  + mesh stack + 1-day spike target
- README.md — drops the "v0.2.5 pre-release" badge + status box
  (out of date), adds bootc/atomic trajectory paragraph

## What did NOT change

- v0.5.x main branch is untouched. The ostreecontainer swap belongs
  in the v0.7 spike branch, NOT v0.5.32.
- nullstone Traefik config is untouched. Out of scope.
- The kickstart and overlay code is untouched.
2026-05-05 15:15:52 +01:00

316 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# veilor-os Strategy — Hybrid kickstart bootstrap + bootc OCI
Decision date: **2026-05-05** (refined same day from parent-operator
handoff, locks the `ostreecontainer` install path, mesh stack
bake-in, browser stack, Iroh seeding roadmap, and threat floor table).
Locked at: **v0.5.31 → v0.7 spike → v1.0**
## TL;DR
- Keep the Anaconda-driven kickstart ISO as the **bootstrap installer**
(LUKS UX is mature, single passphrase prompt, custom partitioning
works).
- Anaconda's `ostreecontainer` directive populates the root filesystem
directly from a **veilor-os OCI image** (built via BlueBuild on top
of secureblue's `securecore-kinoite-hardened-userns`) **during the
install pass — no first-boot rebase, no mutable→atomic transition**.
- All future updates flow through `bootc upgrade` — atomic A/B,
instant rollback, cosign-signed.
- The kickstart-driven mutable-root path is deprecated at v1.0; kept
alive as fallback through v0.7.
## Why hybrid, not pure pivot
Pure pivot to bootc-from-scratch (Agent 3's spike plan) was **1 week
to first ISO**. Pure pivot to layering on secureblue is **2 days to
first ISO** because the hardening work is already done. The
`ostreecontainer` refinement compresses that to **1 day** by
eliminating the first-boot rebase choreography (no
`veilor-firstboot-rebase.service`, no second reboot, no transition
window where the system is half-mutable, half-atomic).
Both pure-pivot paths require throwing away the partitioning UX we
already have working in Anaconda. Hybrid keeps it.
Hybrid:
- **Day-zero install:** Anaconda kickstart + custom partitioning +
LUKS prompt (what we have today). User experience = unchanged.
- **End of install pass:** `ostreecontainer
--url=ghcr.io/veilor/veilor-os:43 --transport=registry` populates
`/` from the OCI image. Transition is invisible.
- **First boot:** veilor OCI tree, no rebase, no special service.
- **Day-2:** `bootc upgrade` cadence for everything from then on.
We keep what works, pivot the part that doesn't.
## ostreecontainer directive (refinement, locked)
Replace the `%packages` block in the install kickstart with:
```
ostreecontainer --url=ghcr.io/veilor/veilor-os:43 --transport=registry
```
Keep the existing `part`/LUKS encryption block verbatim — Anaconda
partitions before `ostreecontainer` populates root.
**Stay on `ostreecontainer` through v0.8.** Do NOT migrate to the new
`bootc` kickstart command until v1.0 — `bootc` blocks multi-disk and
authenticated registries, both of which we'll likely need.
**Do NOT use** `bootc-image-builder anaconda-iso` output —
deprecated in image-builder v44+. Produce the OCI image and the
bootstrap ISO as **separate artifacts**:
- OCI image: BlueBuild recipe → cosign-signed image at
`ghcr.io/veilor/veilor-os:43`
- Bootstrap ISO: Anaconda kickstart with `ostreecontainer` directive
pointing at the OCI image
Reference: <https://docs.fedoraproject.org/en-US/bootc/>; pykickstart
docs for `ostreecontainer`.
## Why secureblue underneath
| Question | Answer |
|---|---|
| Maintainers | secureblue: 30 contributors, 56 commits/5wks. veilor-os: solo. |
| Hardening surface | secureblue ships sysctl + kargs + SELinux + USBGuard + hardened-malloc + DoT — far more than we'd build alone. |
| Build pipeline | BlueBuild → cosign-signed OCI in GH Actions (`build-all.yml`, `trivy.yml`). |
| Update model | bootc upgrade with A/B + instant rollback + signed image chain. |
| Variants | `kinoite-hardened-userns` is the KDE+Wayland+SELinux variant we'd want. |
| License | Apache-2.0 (compatible with our MIT). |
What we override in our recipe:
- **`run0` instead of sudo**: revert. Breaks too many workflows.
- **Xwayland disabled**: revert. Some apps still need it.
- **Veilor branding**: theme, KDE color scheme, Plymouth, SDDM, font,
os-release. All `overlay/*` ports verbatim from current repo.
(Browser stack is its own section below — Trivalent is now a *kept*
default, not an override.)
## Browser stack
| Role | Pick | Source |
|---|---|---|
| **Default browser** | **Trivalent** (secureblue's hardened Chromium) | Fedora COPR `secureblue/trivalent` — tracks upstream M147+ within hours, ships hardened_malloc + JIT-less + Drumbrake WASM |
| **Anti-fingerprint companion** | **Mullvad Browser** | Clearnet, no Tor, layered alongside Trivalent for pseudonymous browsing |
| **Optional opt-in** | **Thorium** | `ujust install-thorium` only — WARN users of months-long CVE lag (LTS Chromium base, ~9 milestones behind upstream stable as of 2026-05) |
**DO NOT default to Thorium under any circumstances** — contradicts
the threat model. Trivalent's COPR keeps us inside one-hour-of-upstream
patch latency; Thorium is multi-month-stale and is a perf/media
profile choice, not a security choice.
The earlier draft of this doc treated Trivalent as an override-and-
remove. That was wrong: Trivalent is exactly the level of hardening
we want for a default browser. Keep it. Add Mullvad alongside.
Move Thorium behind an explicit opt-in.
## Mesh stack — three-layer warm-stack
Day 1 ships layers 1 (Tailscale) and 2 (Yggdrasil idle). Layer 3
(Reticulum) is opt-in via `ujust`.
### Layer 1 — Tailscale + Headscale (daily driver)
- Already running on `nullstone`, `hs.s8n.ru`. OIDC via Authentik.
- Veilor OS ships `tailscale-1.94.2+` from official Fedora repo.
- Service unit **pre-disabled** at install time.
- First-boot prompt: "join Veilor mesh? [paste / QR]". On accept:
`tailscale up --login-server=https://hs.s8n.ru` with the user's
pre-auth key.
### Layer 2 — Yggdrasil-go (warm fallback, idle by default)
- `yggdrasil-go` 0.5.13+ from COPR / dnf.
- Decentralized IPv6 in `200::/7`.
- systemd unit **enabled** but config = empty `Listen[]`, one
`Public peer` (e.g. `vpn.itrus.su` or another EU peer),
`AllowedPublicKeys` allowlist mode (no allow-all).
- WSS:443 transport for ISP DPI evasion.
- Generates ECC keypair on first boot via systemd-tmpfiles or
firstboot script.
- Survives ISP-level Tailscale block (threat floor (ii)).
### Layer 3 — Reticulum (opt-in)
- **RetiNet AGPL fork** (NOT upstream RNS — upstream has an anti-AI
license clause incompatible with our governance). Sourced from the
Codeberg AGPL fork.
- Sideband (Android/desktop messenger built on RNS).
- Install via `ujust install-reticulum`. NOT auto-started until
RetiNet stabilizes.
- Default config when enabled: `AutoInterface` (LAN multicast) +
12 TCP backbone peers.
- RNode hardware (LoRa transceiver) bundle as separate
`ujust install-reticulum-rnode`.
- Survives total internet outage (threat floor (iii)) when paired
with RNode.
## Onboarding model
Token-based (paste OR QR, user picks). Misskey signup page mints a
**reusable pre-auth key** (TTL=24h, single-use, regenerated per
signup). First boot of Veilor ISO accepts hex paste OR QR scan of
the same key.
**NOT auto-OIDC at first boot** — too much Authentik exposure for
day-zero users.
## Tier model — three-tier
- `tag:admin` — onyx + failsafe. Full mesh, `*:*`.
- `tag:infra` — nullstone, office. Mesh among themselves; admin
inbound only.
- `tag:guest` — Veilor OS users + friend. ONLY `x.veilor:443`
reachable + future seeded service hostnames whitelisted.
- **Failsafe** — pre-baked admin pre-auth key on yubikey + printed
paper + Authentik OIDC group `tailnet-admin` as second auth path.
## Threat floor table
| Floor | Attack | Day 1 (v0.7 ship) | Phase 2 (v0.8) |
|-------|--------|---|---|
| (i) | ISP blocks `s8n.ru` DNS | Tailscale dies, Yggdrasil survives | YES (documented failover) |
| (ii) | ISP blocks Tailscale protocol | Yggdrasil-WSS:443 survives | YES |
| (iii) | Internet unreachable | RNS over LoRa survives | OPT-IN (RetiNet + RNode) |
Day 1 must hold floor (i). Floors (ii) and (iii) become P2 once
Yggdrasil is promoted from idle to documented failover.
## Iroh seeding daemon (Phase 2 / v0.8)
- `veilor-seed.service` systemd unit, runs as `_veilor-seed` user.
- Watches `/var/lib/<service>/files/` blob store directories.
- BLAKE3-hashes new blobs, registers with local iroh node.
- Publishes tickets on per-service `iroh-gossip` topic.
- LRU local cache, default 10 GB.
- Sidecar mirrors service blob stores: Misskey `/files/`, Matrix
media, `dl.veilor` downloads.
- Other Veilor nodes pull lazily on cache miss.
- **DEFER DB replication forever.** Static media only.
DOCUMENT but DO NOT IMPLEMENT until **Iroh hits 1.0** (currently
0.960.98 RC season; 1.0 target Q1 2026 slipped, watching).
Reference: <https://github.com/n0-computer/iroh-blobs/blob/main/DESIGN.md>.
## External dependency — Phase 0 (NOT veilor-os scope)
Real ACL gap on nullstone Traefik right now: friend on `tag:guest`
can reach `nullstone:443` → SNI-routes to ALL Traefik vhosts
(`sys.s8n.ru`, `pihole.s8n.ru`, `hs.s8n.ru`, `auth.s8n.ru`, n8n, rc,
mx, …). Only per-vhost auth blocks them. The `no-guest@file` Traefik
middleware that should fix this is currently an `0.0.0.0/0`
allow-all stub (neutralized 2026-05-03 from XFF chain breakage).
**veilor-os does NOT fix this.** Tracked here as an external
dependency: ACL fix on nullstone Traefik **required before veilor-os
first-public-ISO ships**, otherwise `tag:guest` provisioning leaks
the full vhost surface to every veilor user. Parent operator owns it.
## Strategic credibility win
secureblue does NOT publish a threat model. Athena OS does, and it's
their main differentiator. We've already drafted
`docs/THREAT-MODEL.md` (Agent 5 of 2026-05-05 wave). Publishing that
*before* the v0.7 launch positions veilor-os ahead of secureblue and
Athena on the one axis that matters most for a security-branded
distro: **honest, scoped, public threat model**.
## Roadmap implications
| Version | Status | Path |
|---|---|---|
| v0.5.31 | shipped | Anaconda kickstart, mutable root |
| v0.5.32 | active — top blockers from 9-agent wave | Anaconda kickstart |
| v0.5.x → v0.6 | maintenance | Anaconda kickstart, ergonomics + UX polish |
| **v0.7 spike** | **1-day BlueBuild prototype** (was 2 days; `ostreecontainer` removes first-boot-rebase work) | First veilor OCI image extending secureblue-kinoite-hardened |
| v0.7 ship | ISO bootstraps install, `ostreecontainer` populates from OCI in-pass | Hybrid path live |
| v0.8 | Iroh seeding (P2P static media), Yggdrasil promoted from idle to documented failover, RetiNet stabilization watch | bootc-only direction |
| **v1.0** | **bootc-only**, kickstart deprecated, possibly migrate `ostreecontainer` → new `bootc` kickstart command if multi-disk + auth-registry blockers resolved upstream | `bootc upgrade` for all updates |
The Containerfile-from-scratch spike plan (Agent 3 of 2026-05-05
wave) is **superseded** by this hybrid: don't build a Containerfile
from scratch on `fedora-bootc:43`. Instead, write a BlueBuild recipe
on `securecore-kinoite-hardened-userns`. With `ostreecontainer`
swap, spike compresses 1 week → 1 day.
## Next concrete steps
### v0.5.32 — current (no strategy change)
Ship the 7 blockers from `docs/research/2026-05-05-agent-wave/`:
suspend/resume wifi fix, firstboot WantedBy, USBGuard id-rules,
firewalld tailscale0 zone, KMS modeset, /etc/skel branding, virtio-9p
log capture.
`ostreecontainer` swap **does NOT land in v0.5.32 main.** It belongs
in the v0.7 spike branch only.
### v0.7-spike (1 day, separate branch)
1. New repo dir: `bluebuild/recipe.yml`.
2. `from`: `ghcr.io/secureblue/securecore-kinoite-hardened-userns:latest`.
3. Override modules:
- `type: files` — stamp our `overlay/*` tree (branding, themes,
veilor scripts, sddm theme, plymouth theme).
- `type: rpm-ostree` — install Mullvad Browser + restore Xwayland +
re-enable sudo (revert run0).
- **Keep Trivalent** as default (was wrongly marked for removal in
the first draft of this doc).
- `type: brand` — PRETTY_NAME, GRUB_DISTRIBUTOR, distributor URL.
- `type: files` — pre-disabled `tailscale.service`, idle
`yggdrasil.service`, `ujust install-reticulum` and
`ujust install-thorium` recipes.
4. `.github/workflows/build-bluebuild.yml` — pull BlueBuild action,
build + cosign sign + push to GHCR.
5. `kickstart/install.ks` — replace `%packages` block with
`ostreecontainer --url=ghcr.io/veilor/veilor-os:43
--transport=registry`. Keep existing partitioning + LUKS block
verbatim. **Drop** all planned `veilor-firstboot-rebase.service`
work — no longer needed.
### v1.0 — bootc-only
- Drop `kickstart/veilor-os.ks`, drop `livecd-creator` workflow.
- Bootstrap ISO is built as a **separate artifact** (NOT via
`bootc-image-builder anaconda-iso`, which was deprecated in
image-builder v44).
- The OCI image is the source of truth.
- `veilor-update` becomes thin `bootc upgrade --apply` wrapper.
- Migrate `ostreecontainer` directive → new `bootc` kickstart
command IF multi-disk + authenticated-registry support has landed
upstream by then.
## Open questions
- Does secureblue accept upstream contributions? If yes, send our
USBGuard id-based-rules fix and our threat-model framework.
- Recovery flow when `ostreecontainer` install pass fails — Anaconda
should abort cleanly; verify in spike that no half-installed
state is bootable.
- Iroh 1.0 timing — currently 0.960.98 RC; Q1 2026 target slipped.
Re-evaluate Phase 2 schedule when 1.0 lands.
- RetiNet upstream stabilization — track Codeberg fork for releases.
If it stalls > 6 months we re-evaluate Layer 3.
- Fedora 44 transition: secureblue tracks Fedora releases (current
`v4.9` on F44). If we follow, we get F44 for free at the same time
upstream does.
## See also
- `docs/THREAT-MODEL.md` — drafted, needs publish for v0.7
- `docs/ROADMAP.md` — updated to reflect this strategy
- `docs/research/2026-05-05-agent-wave/03-bootc-spike-plan.md`
superseded by this hybrid (kept as reference for the
Containerfile-from-scratch alternative)
- secureblue: <https://github.com/secureblue/secureblue>
- BlueBuild: <https://blue-build.org>
- bootc / ostreecontainer docs: <https://docs.fedoraproject.org/en-US/bootc/>
- Yggdrasil: <https://github.com/yggdrasil-network/yggdrasil-go>
- Reticulum manual: <https://reticulum.network/manual/>
- Iroh blobs design: <https://github.com/n0-computer/iroh-blobs/blob/main/DESIGN.md>