infra/runbooks/HA-CLUSTER-distribute-and-sync.md
obsidian-ai b109933529 runbook: distribute load + sync data (operator's HA vision)
Two-node primary/secondary architecture with per-service replication:
ZFS send/recv 15min for volumes, postgres streaming replication for
DBs, Redis Sentinel, Tailscale mesh. Phased plan from cobblestone
intake to eventual K3s/Nomad cluster at 3+ nodes. Service placement
table, failure-scenario RTO/RPO matrix, open decisions documented.
2026-05-07 03:51:14 +01:00

8.9 KiB
Raw Permalink Blame History

Runbook — distribute load + sync data across nodes

Goal: spread compute across 2+ nodes (nullstone + cobblestone + later) on Tailscale mesh, with stateful data replicated so any single-host loss = ≤15min RPO, ≤5min RTO.

Operator vision (2026-05-07): "down to distribute the load but sync the data". Not a real K8s cluster — a deliberate primary/secondary pair with per-service replication that grows into a 3-node scheduler later.

Status: PLANNING. Build cobblestone first, then phase migration.


North-star architecture

┌─────────────────────┐         ┌─────────────────────┐
│ nullstone           │ ◄─────► │ cobblestone         │
│ tailscale 100.64.0.2│   mesh  │ tailscale 100.64.0.5│
│ Debian 13           │         │ Debian 13           │
│ ZFS root            │         │ ZFS root            │
└─────────────────────┘         └─────────────────────┘
        ▲                                ▲
        │           ZFS send/recv every 15min
        │           Postgres streaming replication
        │           Forgejo runner per node
        │           Tailscale-mediated mTLS
        └────────────► both serve from synced state
                      (one primary, one warm)

Phase 4 (3+ nodes) → swap manual placement for K3s or Nomad. Until then, manual node selection via systemd unit on each host.


Per-layer sync mechanism

Layer Tool RPO Notes
Network Tailscale + Headscale (already up) 0 hs.s8n.ru on nullstone now; move to cobblestone phase 3
Filesystem volumes ZFS snapshots + zfs send/recv 15min zrepl daemon for the cron + retention policy
Postgres DBs (matrix, misskey, authentik, n8n) streaming replication <1s lag hot standby per primary
Redis (misskey-redis, authentik-redis) Redis Sentinel <1s auto-failover; needs 3 sentinel processes
Static files (assets, wallpapers, configs) Syncthing seconds already a known pattern in this stack
Object/blob (uploads, attachments) rclone bisync nightly + ZFS 15min for large media that doesn't fit Syncthing's index well
Secrets / .env sops + age + git per-commit encrypted at rest in Forgejo, decrypted on host via systemd-creds
DNS Gandi LiveDNS, TTL 60s n/a swing record in failover
Backups (offsite) Restic to B2/Wasabi nightly both nodes back up to same bucket; dedupe

Service placement plan

Pinned services run on one node only; secondary runs warm-stopped.

Service Primary Secondary Notes
Forgejo (git.s8n.ru) cobblestone nullstone (warm) Move off nullstone so buildah on runner doesn't impact MC
Forgejo runner #1 nullstone n/a Runs alongside MC; cgroup CPU quota limit
Forgejo runner #2 cobblestone n/a Bigger jobs land here
Tuwunel (matrix.veilor.uk) nullstone cobblestone (warm) DB streamed; warm Tuwunel on cobblestone w/ tuwunel-state-pgsync.timer
Tuwunel-txt + Cinny (mx.s8n.ru) cobblestone nullstone (warm) Symmetric — operator-redundant
Misskey (x.veilor) nullstone cobblestone (warm) Heavy. Move when cobblestone has >32GB
Authentik (auth.s8n.ru) cobblestone nullstone (warm) SSO is critical — keeps both up
Headscale (hs.s8n.ru) cobblestone (Phase 3) nullstone Control plane = single. Use cobblestone (more reliable hw target)
Pi-hole DNS nullstone cobblestone DNS = critical — both must answer; client uses Pi-hole pair via DHCP
Traefik both, separate certs active/active; Cloudflare or DNS round-robin in front
Step-CA nullstone cold standby Internal PKI; rare write; backup root key offline
Minecraft (mc.racked.ru) nullstone none Pinned. World data ZFS-replicated for DR only
anythingllm + dl-veilor cobblestone nullstone (warm) Light load; place where ZFS has more capacity
n8n cobblestone nullstone (warm) Cron-driven; can run on either
Filebrowser-mc nullstone n/a Pinned to MC for chunkfix volume access

Phase plan

Phase 1 — bring cobblestone alive (12 evenings)

  • Install Debian 13 on cobblestone with ZFS root + ECC RAM
  • LUKS2 + argon2id (matches our nullstone target post-migration)
  • Install: openssh, tailscale, docker, zfsutils-linux, postgresql-client (for replication setup)
  • Join Tailscale mesh under 100.64.0.5 via Headscale on nullstone
  • SSH in via tailnet; firewall = drop except inbound 22 from tailnet
  • Mirror nullstone's /opt/docker/ layout
  • Initial seed via zfs send tank/home/docker@seed | ssh cobblestone zfs recv tank/home/docker
  • Bring up read-only services (Forgejo runner #2, optional warm Misskey)
  • Verify each can resolve neighbour hostnames (use 100.64.0.x direct, not DNS, for replication)

Phase 2 — replicate stateful (1 weekend)

  • Set up Postgres primary on nullstone, replica on cobblestone, for each DB: matrix-postgres, misskey-postgres, authentik-postgres, n8n-postgres
  • Test replication: pg_stat_replication shows < 1s lag
  • Set up Redis Sentinel for misskey-redis and authentik-redis (3 sentinel processes minimum: nullstone, cobblestone, friend-PC laptop)
  • zrepl daemon installed both sides; replication policy 15min snapshot, hourly retention 24h, daily 30d, weekly 12w
  • Syncthing for static assets / configs
  • Test DR drill: stop nullstone postgres, verify cobblestone replica is read-only-promotable

Phase 3 — selective primary moves

  • Move Forgejo to cobblestone (operator pain: builds no longer compete with MC)
  • Forgejo runner ON nullstone keeps running — registers as nullstone label
  • Forgejo runner ON cobblestone registers as cobblestone label
  • DNS swing: git.s8n.ru A record → cobblestone tailnet IP via Gandi LiveDNS API
  • Move Headscale to cobblestone (control plane needs reliable host)
  • Move Authentik to cobblestone

Phase 4 — when 3+ nodes (deferred)

  • Stand up K3s on the three nodes with embedded etcd
  • Longhorn for distributed PV (or NFS w/ Heartbeat for simpler)
  • Migrate stateless services to Deployment+Service
  • Stateful services stay on labeled nodes (StatefulSet w/ nodeSelector)
  • Drop bespoke ZFS replication for services managed by Longhorn

Concrete first-step commands

When cobblestone is racked and SSH-able from nullstone:

# On nullstone — initial ZFS seed
zfs snapshot tank/home/docker@cobblestone-seed
zfs send -R tank/home/docker@cobblestone-seed | \
  ssh root@100.64.0.5 zfs recv -F tank/home/docker

# On both — install zrepl
curl -L https://zrepl.github.io/install.sh | bash
# Configure /etc/zrepl/zrepl.yml — primary push job + secondary sink job

# Cron-equivalent: zrepl is its own daemon
systemctl enable --now zrepl

Postgres replication setup is per-DB; document in runbooks/POSTGRES-streaming-replication.md when Phase 2 lands.


Failure scenarios + RTO

Scenario RTO RPO Action
nullstone reboot (planned) 5min 0 DNS swing services to cobblestone before reboot, swing back after
nullstone hardware fail ~30min manual 15min Promote postgres replica, swing all DNS, restart warm services on cobblestone
cobblestone hardware fail ~5min auto 15min nullstone still running primaries; just lose runner #2 + Authentik until rebuild
Both nodes fail hours hours Restic restore from B2/Wasabi to a temp box
Network partition seconds n/a Tailscale heals; postgres replication catches up; brief degraded UX

Open decisions

  • ECC: cobblestone needs Pro chipset (B650E for AM5) for verified ECC. Cheap-out option = ECC unverified on consumer board (still works on AMD).
  • Distros: stay on Debian 13 for both? OR put veilor-os on cobblestone once stable?
    • Recommendation: cobblestone runs Debian 13 (server OS, server kernel) with ZFS root. veilor-os is an end-user desktop spin — wrong fit for a production server role. Don't conflate.
  • Headscale move: Phase 3, OR sooner if we don't trust nullstone? Defer until we can prove cobblestone uptime.
  • Minecraft on cobblestone instead? No — moving MC = downtime + map mismatch risk. Stay pinned.
  • K3s vs Nomad: revisit at Phase 4. Nomad simpler for small ops; K3s mainstream + ecosystem bigger.

  • runbooks/MIGRATION-nullstone-to-cobblestone.md — original migration draft
  • runbooks/COBBLESTONE-INTAKE.md — hardware spec template
  • STATE.md — current node + service state
  • Memory: project_tailscale_mesh.md — mesh shape today