runbook: distribute load + sync data (operator's HA vision)

Two-node primary/secondary architecture with per-service replication:
ZFS send/recv 15min for volumes, postgres streaming replication for
DBs, Redis Sentinel, Tailscale mesh. Phased plan from cobblestone
intake to eventual K3s/Nomad cluster at 3+ nodes. Service placement
table, failure-scenario RTO/RPO matrix, open decisions documented.
This commit is contained in:
obsidian-ai 2026-05-07 03:51:14 +01:00
parent ec3d250340
commit b109933529

View file

@ -0,0 +1,174 @@
# Runbook — distribute load + sync data across nodes
> **Goal:** spread compute across 2+ nodes (nullstone + cobblestone +
> later) on Tailscale mesh, with stateful data replicated so any
> single-host loss = ≤15min RPO, ≤5min RTO.
>
> **Operator vision (2026-05-07):** "down to distribute the load but
> sync the data". Not a real K8s cluster — a deliberate
> primary/secondary pair with per-service replication that grows
> into a 3-node scheduler later.
>
> **Status:** PLANNING. Build cobblestone first, then phase migration.
---
## North-star architecture
```
┌─────────────────────┐ ┌─────────────────────┐
│ nullstone │ ◄─────► │ cobblestone │
│ tailscale 100.64.0.2│ mesh │ tailscale 100.64.0.5│
│ Debian 13 │ │ Debian 13 │
│ ZFS root │ │ ZFS root │
└─────────────────────┘ └─────────────────────┘
▲ ▲
│ ZFS send/recv every 15min
│ Postgres streaming replication
│ Forgejo runner per node
│ Tailscale-mediated mTLS
└────────────► both serve from synced state
(one primary, one warm)
```
Phase 4 (3+ nodes) → swap manual placement for K3s or Nomad.
Until then, manual node selection via systemd unit on each host.
---
## Per-layer sync mechanism
| Layer | Tool | RPO | Notes |
|---|---|---|---|
| Network | Tailscale + Headscale (already up) | 0 | hs.s8n.ru on nullstone now; move to cobblestone phase 3 |
| Filesystem volumes | **ZFS snapshots + zfs send/recv** | 15min | `zrepl` daemon for the cron + retention policy |
| Postgres DBs (matrix, misskey, authentik, n8n) | streaming replication | <1s lag | hot standby per primary |
| Redis (misskey-redis, authentik-redis) | Redis Sentinel | <1s | auto-failover; needs 3 sentinel processes |
| Static files (assets, wallpapers, configs) | Syncthing | seconds | already a known pattern in this stack |
| Object/blob (uploads, attachments) | rclone bisync nightly + ZFS | 15min | for large media that doesn't fit Syncthing's index well |
| Secrets / .env | sops + age + git | per-commit | encrypted at rest in Forgejo, decrypted on host via systemd-creds |
| DNS | Gandi LiveDNS, TTL 60s | n/a | swing record in failover |
| Backups (offsite) | Restic to B2/Wasabi | nightly | both nodes back up to same bucket; dedupe |
---
## Service placement plan
Pinned services run on one node only; **secondary** runs warm-stopped.
| Service | Primary | Secondary | Notes |
|---|---|---|---|
| **Forgejo (git.s8n.ru)** | cobblestone | nullstone (warm) | Move off nullstone so buildah on runner doesn't impact MC |
| **Forgejo runner #1** | nullstone | n/a | Runs alongside MC; cgroup CPU quota limit |
| **Forgejo runner #2** | cobblestone | n/a | Bigger jobs land here |
| **Tuwunel (matrix.veilor.uk)** | nullstone | cobblestone (warm) | DB streamed; warm Tuwunel on cobblestone w/ tuwunel-state-pgsync.timer |
| **Tuwunel-txt + Cinny (mx.s8n.ru)** | cobblestone | nullstone (warm) | Symmetric — operator-redundant |
| **Misskey (x.veilor)** | nullstone | cobblestone (warm) | Heavy. Move when cobblestone has >32GB |
| **Authentik (auth.s8n.ru)** | cobblestone | nullstone (warm) | SSO is critical — keeps both up |
| **Headscale (hs.s8n.ru)** | cobblestone (Phase 3) | nullstone | Control plane = single. Use cobblestone (more reliable hw target) |
| **Pi-hole DNS** | nullstone | cobblestone | DNS = critical — both must answer; client uses Pi-hole pair via DHCP |
| **Traefik** | both, separate certs | — | active/active; Cloudflare or DNS round-robin in front |
| **Step-CA** | nullstone | cold standby | Internal PKI; rare write; backup root key offline |
| **Minecraft (mc.racked.ru)** | nullstone | none | Pinned. World data ZFS-replicated for DR only |
| **anythingllm + dl-veilor** | cobblestone | nullstone (warm) | Light load; place where ZFS has more capacity |
| **n8n** | cobblestone | nullstone (warm) | Cron-driven; can run on either |
| **Filebrowser-mc** | nullstone | n/a | Pinned to MC for chunkfix volume access |
---
## Phase plan
### Phase 1 — bring cobblestone alive (12 evenings)
- [ ] Install Debian 13 on cobblestone with **ZFS root + ECC RAM**
- [ ] LUKS2 + argon2id (matches our nullstone target post-migration)
- [ ] Install: openssh, tailscale, docker, zfsutils-linux, postgresql-client (for replication setup)
- [ ] Join Tailscale mesh under `100.64.0.5` via Headscale on nullstone
- [ ] SSH in via tailnet; firewall = drop except inbound 22 from tailnet
- [ ] Mirror nullstone's `/opt/docker/` layout
- [ ] Initial seed via `zfs send tank/home/docker@seed | ssh cobblestone zfs recv tank/home/docker`
- [ ] Bring up read-only services (Forgejo runner #2, optional warm Misskey)
- [ ] Verify each can resolve neighbour hostnames (use `100.64.0.x` direct, not DNS, for replication)
### Phase 2 — replicate stateful (1 weekend)
- [ ] Set up Postgres primary on nullstone, replica on cobblestone, for each DB:
`matrix-postgres`, `misskey-postgres`, `authentik-postgres`, `n8n-postgres`
- [ ] Test replication: `pg_stat_replication` shows < 1s lag
- [ ] Set up Redis Sentinel for `misskey-redis` and `authentik-redis`
(3 sentinel processes minimum: nullstone, cobblestone, friend-PC laptop)
- [ ] `zrepl` daemon installed both sides; replication policy 15min snapshot, hourly retention 24h, daily 30d, weekly 12w
- [ ] Syncthing for static assets / configs
- [ ] Test DR drill: stop nullstone postgres, verify cobblestone replica is read-only-promotable
### Phase 3 — selective primary moves
- [ ] Move Forgejo to cobblestone (operator pain: builds no longer compete with MC)
- [ ] Forgejo runner ON nullstone keeps running — registers as `nullstone` label
- [ ] Forgejo runner ON cobblestone registers as `cobblestone` label
- [ ] DNS swing: `git.s8n.ru` A record → cobblestone tailnet IP via Gandi LiveDNS API
- [ ] Move Headscale to cobblestone (control plane needs reliable host)
- [ ] Move Authentik to cobblestone
### Phase 4 — when 3+ nodes (deferred)
- [ ] Stand up K3s on the three nodes with embedded etcd
- [ ] Longhorn for distributed PV (or NFS w/ Heartbeat for simpler)
- [ ] Migrate stateless services to Deployment+Service
- [ ] Stateful services stay on labeled nodes (StatefulSet w/ nodeSelector)
- [ ] Drop bespoke ZFS replication for services managed by Longhorn
---
## Concrete first-step commands
When cobblestone is racked and SSH-able from nullstone:
```bash
# On nullstone — initial ZFS seed
zfs snapshot tank/home/docker@cobblestone-seed
zfs send -R tank/home/docker@cobblestone-seed | \
ssh root@100.64.0.5 zfs recv -F tank/home/docker
# On both — install zrepl
curl -L https://zrepl.github.io/install.sh | bash
# Configure /etc/zrepl/zrepl.yml — primary push job + secondary sink job
# Cron-equivalent: zrepl is its own daemon
systemctl enable --now zrepl
```
Postgres replication setup is per-DB; document in
`runbooks/POSTGRES-streaming-replication.md` when Phase 2 lands.
---
## Failure scenarios + RTO
| Scenario | RTO | RPO | Action |
|---|---|---|---|
| nullstone reboot (planned) | 5min | 0 | DNS swing services to cobblestone before reboot, swing back after |
| nullstone hardware fail | ~30min manual | 15min | Promote postgres replica, swing all DNS, restart warm services on cobblestone |
| cobblestone hardware fail | ~5min auto | 15min | nullstone still running primaries; just lose runner #2 + Authentik until rebuild |
| Both nodes fail | hours | hours | Restic restore from B2/Wasabi to a temp box |
| Network partition | seconds | n/a | Tailscale heals; postgres replication catches up; brief degraded UX |
---
## Open decisions
- **ECC**: cobblestone needs Pro chipset (B650E for AM5) for verified ECC. Cheap-out option = ECC unverified on consumer board (still works on AMD).
- **Distros**: stay on Debian 13 for both? OR put veilor-os on cobblestone once stable?
- **Recommendation**: cobblestone runs **Debian 13** (server OS, server kernel) with ZFS root. veilor-os is an end-user desktop spin — wrong fit for a production server role. Don't conflate.
- **Headscale move**: Phase 3, OR sooner if we don't trust nullstone? Defer until we can prove cobblestone uptime.
- **Minecraft on cobblestone instead?** No — moving MC = downtime + map mismatch risk. Stay pinned.
- **K3s vs Nomad**: revisit at Phase 4. Nomad simpler for small ops; K3s mainstream + ecosystem bigger.
---
## Related
- `runbooks/MIGRATION-nullstone-to-cobblestone.md` — original migration draft
- `runbooks/COBBLESTONE-INTAKE.md` — hardware spec template
- `STATE.md` — current node + service state
- Memory: `project_tailscale_mesh.md` — mesh shape today