infra/runbooks/HA-CLUSTER-distribute-and-sync.md
obsidian-ai b109933529 runbook: distribute load + sync data (operator's HA vision)
Two-node primary/secondary architecture with per-service replication:
ZFS send/recv 15min for volumes, postgres streaming replication for
DBs, Redis Sentinel, Tailscale mesh. Phased plan from cobblestone
intake to eventual K3s/Nomad cluster at 3+ nodes. Service placement
table, failure-scenario RTO/RPO matrix, open decisions documented.
2026-05-07 03:51:14 +01:00

174 lines
8.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Runbook — distribute load + sync data across nodes
> **Goal:** spread compute across 2+ nodes (nullstone + cobblestone +
> later) on Tailscale mesh, with stateful data replicated so any
> single-host loss = ≤15min RPO, ≤5min RTO.
>
> **Operator vision (2026-05-07):** "down to distribute the load but
> sync the data". Not a real K8s cluster — a deliberate
> primary/secondary pair with per-service replication that grows
> into a 3-node scheduler later.
>
> **Status:** PLANNING. Build cobblestone first, then phase migration.
---
## North-star architecture
```
┌─────────────────────┐ ┌─────────────────────┐
│ nullstone │ ◄─────► │ cobblestone │
│ tailscale 100.64.0.2│ mesh │ tailscale 100.64.0.5│
│ Debian 13 │ │ Debian 13 │
│ ZFS root │ │ ZFS root │
└─────────────────────┘ └─────────────────────┘
▲ ▲
│ ZFS send/recv every 15min
│ Postgres streaming replication
│ Forgejo runner per node
│ Tailscale-mediated mTLS
└────────────► both serve from synced state
(one primary, one warm)
```
Phase 4 (3+ nodes) → swap manual placement for K3s or Nomad.
Until then, manual node selection via systemd unit on each host.
---
## Per-layer sync mechanism
| Layer | Tool | RPO | Notes |
|---|---|---|---|
| Network | Tailscale + Headscale (already up) | 0 | hs.s8n.ru on nullstone now; move to cobblestone phase 3 |
| Filesystem volumes | **ZFS snapshots + zfs send/recv** | 15min | `zrepl` daemon for the cron + retention policy |
| Postgres DBs (matrix, misskey, authentik, n8n) | streaming replication | <1s lag | hot standby per primary |
| Redis (misskey-redis, authentik-redis) | Redis Sentinel | <1s | auto-failover; needs 3 sentinel processes |
| Static files (assets, wallpapers, configs) | Syncthing | seconds | already a known pattern in this stack |
| Object/blob (uploads, attachments) | rclone bisync nightly + ZFS | 15min | for large media that doesn't fit Syncthing's index well |
| Secrets / .env | sops + age + git | per-commit | encrypted at rest in Forgejo, decrypted on host via systemd-creds |
| DNS | Gandi LiveDNS, TTL 60s | n/a | swing record in failover |
| Backups (offsite) | Restic to B2/Wasabi | nightly | both nodes back up to same bucket; dedupe |
---
## Service placement plan
Pinned services run on one node only; **secondary** runs warm-stopped.
| Service | Primary | Secondary | Notes |
|---|---|---|---|
| **Forgejo (git.s8n.ru)** | cobblestone | nullstone (warm) | Move off nullstone so buildah on runner doesn't impact MC |
| **Forgejo runner #1** | nullstone | n/a | Runs alongside MC; cgroup CPU quota limit |
| **Forgejo runner #2** | cobblestone | n/a | Bigger jobs land here |
| **Tuwunel (matrix.veilor.uk)** | nullstone | cobblestone (warm) | DB streamed; warm Tuwunel on cobblestone w/ tuwunel-state-pgsync.timer |
| **Tuwunel-txt + Cinny (mx.s8n.ru)** | cobblestone | nullstone (warm) | Symmetric operator-redundant |
| **Misskey (x.veilor)** | nullstone | cobblestone (warm) | Heavy. Move when cobblestone has >32GB |
| **Authentik (auth.s8n.ru)** | cobblestone | nullstone (warm) | SSO is critical — keeps both up |
| **Headscale (hs.s8n.ru)** | cobblestone (Phase 3) | nullstone | Control plane = single. Use cobblestone (more reliable hw target) |
| **Pi-hole DNS** | nullstone | cobblestone | DNS = critical — both must answer; client uses Pi-hole pair via DHCP |
| **Traefik** | both, separate certs | — | active/active; Cloudflare or DNS round-robin in front |
| **Step-CA** | nullstone | cold standby | Internal PKI; rare write; backup root key offline |
| **Minecraft (mc.racked.ru)** | nullstone | none | Pinned. World data ZFS-replicated for DR only |
| **anythingllm + dl-veilor** | cobblestone | nullstone (warm) | Light load; place where ZFS has more capacity |
| **n8n** | cobblestone | nullstone (warm) | Cron-driven; can run on either |
| **Filebrowser-mc** | nullstone | n/a | Pinned to MC for chunkfix volume access |
---
## Phase plan
### Phase 1 — bring cobblestone alive (12 evenings)
- [ ] Install Debian 13 on cobblestone with **ZFS root + ECC RAM**
- [ ] LUKS2 + argon2id (matches our nullstone target post-migration)
- [ ] Install: openssh, tailscale, docker, zfsutils-linux, postgresql-client (for replication setup)
- [ ] Join Tailscale mesh under `100.64.0.5` via Headscale on nullstone
- [ ] SSH in via tailnet; firewall = drop except inbound 22 from tailnet
- [ ] Mirror nullstone's `/opt/docker/` layout
- [ ] Initial seed via `zfs send tank/home/docker@seed | ssh cobblestone zfs recv tank/home/docker`
- [ ] Bring up read-only services (Forgejo runner #2, optional warm Misskey)
- [ ] Verify each can resolve neighbour hostnames (use `100.64.0.x` direct, not DNS, for replication)
### Phase 2 — replicate stateful (1 weekend)
- [ ] Set up Postgres primary on nullstone, replica on cobblestone, for each DB:
`matrix-postgres`, `misskey-postgres`, `authentik-postgres`, `n8n-postgres`
- [ ] Test replication: `pg_stat_replication` shows < 1s lag
- [ ] Set up Redis Sentinel for `misskey-redis` and `authentik-redis`
(3 sentinel processes minimum: nullstone, cobblestone, friend-PC laptop)
- [ ] `zrepl` daemon installed both sides; replication policy 15min snapshot, hourly retention 24h, daily 30d, weekly 12w
- [ ] Syncthing for static assets / configs
- [ ] Test DR drill: stop nullstone postgres, verify cobblestone replica is read-only-promotable
### Phase 3 — selective primary moves
- [ ] Move Forgejo to cobblestone (operator pain: builds no longer compete with MC)
- [ ] Forgejo runner ON nullstone keeps running registers as `nullstone` label
- [ ] Forgejo runner ON cobblestone registers as `cobblestone` label
- [ ] DNS swing: `git.s8n.ru` A record cobblestone tailnet IP via Gandi LiveDNS API
- [ ] Move Headscale to cobblestone (control plane needs reliable host)
- [ ] Move Authentik to cobblestone
### Phase 4 — when 3+ nodes (deferred)
- [ ] Stand up K3s on the three nodes with embedded etcd
- [ ] Longhorn for distributed PV (or NFS w/ Heartbeat for simpler)
- [ ] Migrate stateless services to Deployment+Service
- [ ] Stateful services stay on labeled nodes (StatefulSet w/ nodeSelector)
- [ ] Drop bespoke ZFS replication for services managed by Longhorn
---
## Concrete first-step commands
When cobblestone is racked and SSH-able from nullstone:
```bash
# On nullstone — initial ZFS seed
zfs snapshot tank/home/docker@cobblestone-seed
zfs send -R tank/home/docker@cobblestone-seed | \
ssh root@100.64.0.5 zfs recv -F tank/home/docker
# On both — install zrepl
curl -L https://zrepl.github.io/install.sh | bash
# Configure /etc/zrepl/zrepl.yml — primary push job + secondary sink job
# Cron-equivalent: zrepl is its own daemon
systemctl enable --now zrepl
```
Postgres replication setup is per-DB; document in
`runbooks/POSTGRES-streaming-replication.md` when Phase 2 lands.
---
## Failure scenarios + RTO
| Scenario | RTO | RPO | Action |
|---|---|---|---|
| nullstone reboot (planned) | 5min | 0 | DNS swing services to cobblestone before reboot, swing back after |
| nullstone hardware fail | ~30min manual | 15min | Promote postgres replica, swing all DNS, restart warm services on cobblestone |
| cobblestone hardware fail | ~5min auto | 15min | nullstone still running primaries; just lose runner #2 + Authentik until rebuild |
| Both nodes fail | hours | hours | Restic restore from B2/Wasabi to a temp box |
| Network partition | seconds | n/a | Tailscale heals; postgres replication catches up; brief degraded UX |
---
## Open decisions
- **ECC**: cobblestone needs Pro chipset (B650E for AM5) for verified ECC. Cheap-out option = ECC unverified on consumer board (still works on AMD).
- **Distros**: stay on Debian 13 for both? OR put veilor-os on cobblestone once stable?
- **Recommendation**: cobblestone runs **Debian 13** (server OS, server kernel) with ZFS root. veilor-os is an end-user desktop spin wrong fit for a production server role. Don't conflate.
- **Headscale move**: Phase 3, OR sooner if we don't trust nullstone? Defer until we can prove cobblestone uptime.
- **Minecraft on cobblestone instead?** No moving MC = downtime + map mismatch risk. Stay pinned.
- **K3s vs Nomad**: revisit at Phase 4. Nomad simpler for small ops; K3s mainstream + ecosystem bigger.
---
## Related
- `runbooks/MIGRATION-nullstone-to-cobblestone.md` original migration draft
- `runbooks/COBBLESTONE-INTAKE.md` hardware spec template
- `STATE.md` current node + service state
- Memory: `project_tailscale_mesh.md` mesh shape today