diff --git a/runbooks/HA-CLUSTER-distribute-and-sync.md b/runbooks/HA-CLUSTER-distribute-and-sync.md new file mode 100644 index 0000000..d4691b4 --- /dev/null +++ b/runbooks/HA-CLUSTER-distribute-and-sync.md @@ -0,0 +1,174 @@ +# Runbook — distribute load + sync data across nodes + +> **Goal:** spread compute across 2+ nodes (nullstone + cobblestone + +> later) on Tailscale mesh, with stateful data replicated so any +> single-host loss = ≤15min RPO, ≤5min RTO. +> +> **Operator vision (2026-05-07):** "down to distribute the load but +> sync the data". Not a real K8s cluster — a deliberate +> primary/secondary pair with per-service replication that grows +> into a 3-node scheduler later. +> +> **Status:** PLANNING. Build cobblestone first, then phase migration. + +--- + +## North-star architecture + +``` +┌─────────────────────┐ ┌─────────────────────┐ +│ nullstone │ ◄─────► │ cobblestone │ +│ tailscale 100.64.0.2│ mesh │ tailscale 100.64.0.5│ +│ Debian 13 │ │ Debian 13 │ +│ ZFS root │ │ ZFS root │ +└─────────────────────┘ └─────────────────────┘ + ▲ ▲ + │ ZFS send/recv every 15min + │ Postgres streaming replication + │ Forgejo runner per node + │ Tailscale-mediated mTLS + └────────────► both serve from synced state + (one primary, one warm) +``` + +Phase 4 (3+ nodes) → swap manual placement for K3s or Nomad. +Until then, manual node selection via systemd unit on each host. + +--- + +## Per-layer sync mechanism + +| Layer | Tool | RPO | Notes | +|---|---|---|---| +| Network | Tailscale + Headscale (already up) | 0 | hs.s8n.ru on nullstone now; move to cobblestone phase 3 | +| Filesystem volumes | **ZFS snapshots + zfs send/recv** | 15min | `zrepl` daemon for the cron + retention policy | +| Postgres DBs (matrix, misskey, authentik, n8n) | streaming replication | <1s lag | hot standby per primary | +| Redis (misskey-redis, authentik-redis) | Redis Sentinel | <1s | auto-failover; needs 3 sentinel processes | +| Static files (assets, wallpapers, configs) | Syncthing | seconds | already a known pattern in this stack | +| Object/blob (uploads, attachments) | rclone bisync nightly + ZFS | 15min | for large media that doesn't fit Syncthing's index well | +| Secrets / .env | sops + age + git | per-commit | encrypted at rest in Forgejo, decrypted on host via systemd-creds | +| DNS | Gandi LiveDNS, TTL 60s | n/a | swing record in failover | +| Backups (offsite) | Restic to B2/Wasabi | nightly | both nodes back up to same bucket; dedupe | + +--- + +## Service placement plan + +Pinned services run on one node only; **secondary** runs warm-stopped. + +| Service | Primary | Secondary | Notes | +|---|---|---|---| +| **Forgejo (git.s8n.ru)** | cobblestone | nullstone (warm) | Move off nullstone so buildah on runner doesn't impact MC | +| **Forgejo runner #1** | nullstone | n/a | Runs alongside MC; cgroup CPU quota limit | +| **Forgejo runner #2** | cobblestone | n/a | Bigger jobs land here | +| **Tuwunel (matrix.veilor.uk)** | nullstone | cobblestone (warm) | DB streamed; warm Tuwunel on cobblestone w/ tuwunel-state-pgsync.timer | +| **Tuwunel-txt + Cinny (mx.s8n.ru)** | cobblestone | nullstone (warm) | Symmetric — operator-redundant | +| **Misskey (x.veilor)** | nullstone | cobblestone (warm) | Heavy. Move when cobblestone has >32GB | +| **Authentik (auth.s8n.ru)** | cobblestone | nullstone (warm) | SSO is critical — keeps both up | +| **Headscale (hs.s8n.ru)** | cobblestone (Phase 3) | nullstone | Control plane = single. Use cobblestone (more reliable hw target) | +| **Pi-hole DNS** | nullstone | cobblestone | DNS = critical — both must answer; client uses Pi-hole pair via DHCP | +| **Traefik** | both, separate certs | — | active/active; Cloudflare or DNS round-robin in front | +| **Step-CA** | nullstone | cold standby | Internal PKI; rare write; backup root key offline | +| **Minecraft (mc.racked.ru)** | nullstone | none | Pinned. World data ZFS-replicated for DR only | +| **anythingllm + dl-veilor** | cobblestone | nullstone (warm) | Light load; place where ZFS has more capacity | +| **n8n** | cobblestone | nullstone (warm) | Cron-driven; can run on either | +| **Filebrowser-mc** | nullstone | n/a | Pinned to MC for chunkfix volume access | + +--- + +## Phase plan + +### Phase 1 — bring cobblestone alive (1–2 evenings) + +- [ ] Install Debian 13 on cobblestone with **ZFS root + ECC RAM** +- [ ] LUKS2 + argon2id (matches our nullstone target post-migration) +- [ ] Install: openssh, tailscale, docker, zfsutils-linux, postgresql-client (for replication setup) +- [ ] Join Tailscale mesh under `100.64.0.5` via Headscale on nullstone +- [ ] SSH in via tailnet; firewall = drop except inbound 22 from tailnet +- [ ] Mirror nullstone's `/opt/docker/` layout +- [ ] Initial seed via `zfs send tank/home/docker@seed | ssh cobblestone zfs recv tank/home/docker` +- [ ] Bring up read-only services (Forgejo runner #2, optional warm Misskey) +- [ ] Verify each can resolve neighbour hostnames (use `100.64.0.x` direct, not DNS, for replication) + +### Phase 2 — replicate stateful (1 weekend) + +- [ ] Set up Postgres primary on nullstone, replica on cobblestone, for each DB: + `matrix-postgres`, `misskey-postgres`, `authentik-postgres`, `n8n-postgres` +- [ ] Test replication: `pg_stat_replication` shows < 1s lag +- [ ] Set up Redis Sentinel for `misskey-redis` and `authentik-redis` + (3 sentinel processes minimum: nullstone, cobblestone, friend-PC laptop) +- [ ] `zrepl` daemon installed both sides; replication policy 15min snapshot, hourly retention 24h, daily 30d, weekly 12w +- [ ] Syncthing for static assets / configs +- [ ] Test DR drill: stop nullstone postgres, verify cobblestone replica is read-only-promotable + +### Phase 3 — selective primary moves + +- [ ] Move Forgejo to cobblestone (operator pain: builds no longer compete with MC) +- [ ] Forgejo runner ON nullstone keeps running — registers as `nullstone` label +- [ ] Forgejo runner ON cobblestone registers as `cobblestone` label +- [ ] DNS swing: `git.s8n.ru` A record → cobblestone tailnet IP via Gandi LiveDNS API +- [ ] Move Headscale to cobblestone (control plane needs reliable host) +- [ ] Move Authentik to cobblestone + +### Phase 4 — when 3+ nodes (deferred) + +- [ ] Stand up K3s on the three nodes with embedded etcd +- [ ] Longhorn for distributed PV (or NFS w/ Heartbeat for simpler) +- [ ] Migrate stateless services to Deployment+Service +- [ ] Stateful services stay on labeled nodes (StatefulSet w/ nodeSelector) +- [ ] Drop bespoke ZFS replication for services managed by Longhorn + +--- + +## Concrete first-step commands + +When cobblestone is racked and SSH-able from nullstone: + +```bash +# On nullstone — initial ZFS seed +zfs snapshot tank/home/docker@cobblestone-seed +zfs send -R tank/home/docker@cobblestone-seed | \ + ssh root@100.64.0.5 zfs recv -F tank/home/docker + +# On both — install zrepl +curl -L https://zrepl.github.io/install.sh | bash +# Configure /etc/zrepl/zrepl.yml — primary push job + secondary sink job + +# Cron-equivalent: zrepl is its own daemon +systemctl enable --now zrepl +``` + +Postgres replication setup is per-DB; document in +`runbooks/POSTGRES-streaming-replication.md` when Phase 2 lands. + +--- + +## Failure scenarios + RTO + +| Scenario | RTO | RPO | Action | +|---|---|---|---| +| nullstone reboot (planned) | 5min | 0 | DNS swing services to cobblestone before reboot, swing back after | +| nullstone hardware fail | ~30min manual | 15min | Promote postgres replica, swing all DNS, restart warm services on cobblestone | +| cobblestone hardware fail | ~5min auto | 15min | nullstone still running primaries; just lose runner #2 + Authentik until rebuild | +| Both nodes fail | hours | hours | Restic restore from B2/Wasabi to a temp box | +| Network partition | seconds | n/a | Tailscale heals; postgres replication catches up; brief degraded UX | + +--- + +## Open decisions + +- **ECC**: cobblestone needs Pro chipset (B650E for AM5) for verified ECC. Cheap-out option = ECC unverified on consumer board (still works on AMD). +- **Distros**: stay on Debian 13 for both? OR put veilor-os on cobblestone once stable? + - **Recommendation**: cobblestone runs **Debian 13** (server OS, server kernel) with ZFS root. veilor-os is an end-user desktop spin — wrong fit for a production server role. Don't conflate. +- **Headscale move**: Phase 3, OR sooner if we don't trust nullstone? Defer until we can prove cobblestone uptime. +- **Minecraft on cobblestone instead?** No — moving MC = downtime + map mismatch risk. Stay pinned. +- **K3s vs Nomad**: revisit at Phase 4. Nomad simpler for small ops; K3s mainstream + ecosystem bigger. + +--- + +## Related + +- `runbooks/MIGRATION-nullstone-to-cobblestone.md` — original migration draft +- `runbooks/COBBLESTONE-INTAKE.md` — hardware spec template +- `STATE.md` — current node + service state +- Memory: `project_tailscale_mesh.md` — mesh shape today