minecraft-server/BACKUP-STRATEGY.md
s8n a1cc3940cf docs: 2026-05-07 incident audit + backup strategy
Player YOU500 lost full inventory to AuthLimbo void-death at 17:13:39.
Investigation revealed deployed /opt/docker/backup.sh is an 88-line stub
missing the Minecraft block; last successful world backup 2026-05-02
(already pruned). No recoverable .dat exists.

Files:
- AUDIT-2026-05-07.md — server-side findings F-01..F-06 (P0 backups,
  no-keepInventory, AuthLimbo silent failure, chunk preload race,
  Xmx > container headroom, container hardening gaps)
- BACKUP-HUNT-2026-05-07.md — exhaustive backup scan; only 6-week-old
  archive at _archive/minecraft-old-2026-04-27.tar.gz
- BACKUP-STRATEGY.md — restic-based plan; 5min/hourly/daily classes,
  off-host to onyx via Tailscale, monthly drill
- CROSS-REFERENCE-2026-05-07.md — repo+doc landing map; flags
  pre-existing infra/STATE.md backup-broken note + HA-CLUSTER restic
  draft to extend rather than duplicate
- docs/RUNBOOK-BACKUP-RESTORE.md — operator runbook for .dat restore,
  full-world restore, host-loss restore, drill log
2026-05-07 17:33:24 +01:00

393 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Minecraft Backup Strategy — racked.ru on nullstone
**Status:** PROPOSAL (2026-05-07) — not yet implemented.
**Author trigger:** Player lost full inventory to void death today; rollback impossible because the existing 02:00 daily backup had **silently failed for 5 of the last 7 days** and there is **zero off-host copy**.
**Owner:** `s8n` (operator).
**Target host:** `nullstone` (192.168.0.100, Debian 13 trixie).
---
## 0. Current state (audited 2026-05-07)
Existing system in `/opt/docker/backup.sh` + `cron.d/docker-backup` (02:00 daily, 7-day retention in `/opt/backups/`).
Findings from `/opt/backups/backup.log`:
| Date | MC world result | Backup dir total |
|------|-----------------|------------------|
| 2026-04-26 | FAILED | — |
| 2026-04-27 | FAILED | — |
| 2026-04-28 | FAILED | — |
| 2026-04-29 | OK (3.6 G) | — |
| 2026-04-30 | FAILED | — |
| 2026-05-01 | FAILED | — |
| 2026-05-02 | OK (3.6 G) | — |
| 2026-05-03 | (no MC log line) | 8 K |
| 2026-05-04 | (no MC log line) | 8 K |
| 2026-05-05 | (no MC log line) | 8 K |
| 2026-05-06 | (no MC log line) | 12 K |
| 2026-05-07 | (no MC log line) | 12 K |
After 2026-05-02 the entire MC block stopped emitting log lines. The script appears to be exiting before reaching it (the duplicated stray `chmod 600 ... synapse-signing-key` lines at L119122 are orphaned from a botched edit and may now break `set -e`). Effective state: **two MC backups in the last 12 days**, both already pruned by 7-day retention. **No usable backup exists right now.**
Cross-references:
- `_github/infra/STATE.md` Top-5 weakness #2 ("backup.sh broken silently") and #5 ("No off-host backup").
- `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md` §5 already names this `F-backup-1` and proposes "Restic + autorestic to B2/Wasabi or to nullstone-as-spare". This strategy refines that to use on-hand resources rather than paid storage.
### Available resources (no purchasing required)
| Asset | Location | Free | Reachability | Role |
|---|---|---|---|---|
| nullstone `/home` | local NVMe (ext4 LVM) | 142 G of 399 G | local | Primary repo + restic cache |
| onyx `/home` | LUKS NVMe | 1.6 T of 1.9 T | Tailscale 100.64.0.1 (LAN ~5 ms) | **Off-host primary** |
| friend RTX 4080 PC | DESKTOP-LR0RILA | unknown (Windows, large) | Tailscale 100.64.0.3 (WAN, IP-stable via tailnet) | **Off-host secondary** (defer) |
| nullstone `/opt/backups` | same disk as `/opt/docker` | 142 G | local | *Not* a real backup target — same-disk SPOF |
**No purchased B2 / Wasabi / S3 in this proposal.** Tailscale + onyx covers off-host today. B2 stays in the future-options annex.
---
## 1. Threat model
| # | Threat | Concrete example | Frequency | Mitigation in this plan |
|---|---|---|---|---|
| T1 | Player accidental loss (void death, lava, fall) | YOU500, 2026-05-07 | weekly | 5-min playerdata snapshots (RPO ≤ 5 min) |
| T2 | Griefing / theft / chest emptied by ban-evader | possible | monthly | 5-min playerdata + 1-h world snapshots |
| T3 | World corruption (chunk error, region-file truncate) | rare | — | 6-h pre-flight validated full world snapshot |
| T4 | Plugin / config bad change (LuckPerms wipe, server.properties) | edits during ops | weekly | daily configs + DB dump + git history (`live-server/` repo) |
| T5 | Host disk failure (single NVMe) | low/year | — | nightly off-host copy to onyx (Tailscale) |
| T6 | Ransomware / host compromise | low | — | append-only Restic repo on onyx; nullstone holds **no** delete key |
| T7 | Operator `rm -rf` or wrong `docker compose down -v` | low | — | retention floor (4 weekly + 12 monthly) survives a recent rm |
| T8 | Backup script silently failing (current state) | OBSERVED | — | heartbeat alert + monthly restore drill (§7) |
T8 is the one that just bit us. The single most important addition is **alerting on missed runs**, not the storage tech.
---
## 2. RPO / RTO
| Class | Data | RPO | RTO | Backup mechanism |
|---|---|---|---|---|
| A | playerdata (`world/playerdata/*.dat`, `stats/`, `advancements/`) | **5 min** | < 2 min per player | rcon `save-all flush` rsync to local snapshot, then restic-add |
| B | full world (region files, end + nether) | **1 h** during play, **6 h** otherwise | 15 min | restic of `world*/` |
| C | plugin configs + LuckPerms YAML | 24 h | 30 min | tar of `plugins/*/config*.yml` + LP file dump |
| D | LuckPerms / Homestead SQLite DBs (`*.db`, `homestead_data.db`) | 1 h | 5 min | sqlite `.backup` then restic-add |
| E | host-level configs (`docker-compose.yml`, `server.properties`, `purpur.yml`, `bukkit.yml`, `paper-*.yml`, `whitelist.json`, `ops.json`, `banned-*.json`, `config/`) | 24 h | 5 min | already in git repo `_github/minecraft-server/`; backup just covers drift |
**Justification for RPO=5 min on Class A:** the void-death case rebuilds in seconds recovering one `<uuid>.dat` is a ~30 s operation if a 5-min-old snapshot exists. Snapshotting just the 1.3 MB `playerdata/` dir is cheap (single-digit MB/day after dedup).
---
## 3. Tool choice — Restic
Compared:
| Tool | Dedup | Encryption | Snapshots | Network destinations | Verdict |
|---|---|---|---|---|---|
| **restic** | content-addressed, very effective on MC region files | AES-256, repo-key | yes | sftp (Tailscale), local, B2, S3, Azure, rclone | **WINNER** |
| borgbackup | similar | yes | yes | ssh only, lock-on-write | Equally good; restic chosen because operator already plans `restic + autorestic` per `infra/STATE.md` line 112; sftp dest is simpler than borg's required serverside binary |
| rsnapshot | hardlinks, no dedup | none | rotated dirs | local + rsync | No encryption off-host copy on Tailscale (already encrypted) is fine, but no dedup means 18 G × N snapshots is painful. Reject. |
| zfs send | block-level | (zfs native) | snapshots | yes | nullstone is **ext4/LVM**, no ZFS, no btrfs. Reject. |
| LVM snapshot | COW | none | yes | local only | Same-disk only, doesn't survive disk failure. Useful as a *staging* primitive only. |
| custom rsync + cp -al | hardlinks | none | yes | yes | Reinventing rsnapshot. Reject. |
| itzg `BACKUP_*` env | tar to volume | none | rotation | local | Already tried in spirit by current `backup.sh`; same-disk; not granular. Reject as primary. |
**Decision:** `restic` for Classes A, B, C, D. Continue using a thin tar wrapper for Class E (configs are already in the git repo, this is just safety).
Restic strengths for our case:
- Region files dedup *very* well (chunks unchanged across snapshots).
- A 5-min Class-A snapshot adds ~MB to the repo, not the full 1.3 MB × N.
- One repo on local disk + one mirror to onyx via `rclone serve restic` or direct `sftp:` no agent needed on onyx beyond ssh.
- `restic check --read-data-subset=5%` is the canonical scrub.
Apt: `apt install restic` on trixie ships 0.16.x sufficient.
---
## 4. Schedule
All times Europe/London (matches `TZ` in compose file).
| Job | Cadence | Source | Destination | Mechanism |
|---|---|---|---|---|
| **A — playerdata** | every **5 min** | `world/playerdata/`, `world/stats/`, `world/advancements/`, `world*/level.dat`, `*.db` (LP+homestead) | restic repo `/home/user/restic/mc-frequent/` | systemd timer `mc-backup-frequent.timer` |
| **B — full world** | every **1 h** during play (07:0001:00), **6 h** otherwise | `world/`, `world_nether/`, `world_the_end/` | restic repo `/home/user/restic/mc-world/` | systemd timer `mc-backup-world.timer` |
| **C — configs + plugins** | **daily 02:00** | `/opt/docker/minecraft/*.yml`, `*.json`, `plugins/*/config*.yml`, `plugins/LuckPerms/`, `docker-compose.yml` | restic repo `mc-world` (path-tagged) | reuse same timer with second backup target |
| **D — DB dumps** | every **1 h** | `homestead_data.db`, `plugins/CoreProtect/database.db`, `plugins/LuckPerms/luckperms-h2-*` | restic repo `mc-world` | timer hooks `sqlite3 .backup` first |
| **E — off-host mirror** | **nightly 03:30** | nullstone `/home/user/restic/` | onyx `100.64.0.1:/home/admin/backups/nullstone-mc-restic/` | `restic copy` over sftp (Tailscale) append-only key on onyx side |
| **F — verify** | **weekly Sun 04:00** | both repos | | `restic check --read-data-subset=5%` then alert on rc |
| **G — drill** | **monthly 1st Sat 11:00** | random snapshot | scratch dir | §7 procedure |
### Why this works for the void-death case
T1 hits at 18:42. By 18:45 a Class-A snapshot exists containing the player's `<uuid>.dat` from 18:40. Restore: `restic -r ... restore --target /tmp/r --include 'world/playerdata/<uuid>.dat' latest`, stop server (or `/save-off` + minimanip), copy file into place, `/save-on`. Total RTO < 2 min.
---
## 5. Retention
Restic policy (passed to `restic forget --keep-*`):
```
--keep-last 24 # 24 most recent (covers 2h of 5-min snapshots)
--keep-hourly 24 # 24h of hourly
--keep-daily 7 # 7 days
--keep-weekly 4 # 4 weeks
--keep-monthly 12 # 12 months
```
Applied per-tag Class A snapshots tagged `playerdata`, B/C/D tagged `world`. Forget is run **only on the local repo**; the onyx mirror inherits via `restic copy` with same policy after the local forget+prune.
### Storage budget
- Class A: 1.3 MB raw × dedup (~20× on `.dat`, mostly empty NBT slots) ~70 KB / snapshot **net**.
- 24/h × 24h × 7 = 4 032 snapshots/week < 300 MB/week.
- Class B/C/D: 18 G raw ~6.5 G compressed (per current 3.6 G figure × adjustment for nether/end now active). Restic dedup on hourly snapshots: ~50200 MB delta/snapshot during active play.
- 24h hourly + 7 daily + 4 weekly + 12 monthly 47 retained estimate **1525 GB total** at steady state.
- E (off-host): same as above on onyx (1.6 TB free 30× headroom).
**Conclusion:** comfortably fits in nullstone's 142 G free. Onyx is essentially unconstrained.
---
## 6. Off-host destination — onyx via Tailscale
**Choice:** `onyx` (100.64.0.1, 1.6 TB free on `/home`). Reasons:
- Already in the tailnet (`tag:admin`), already trusted, already SSH-reachable.
- 1.6 TB is 100× the dataset.
- Operator's daily-driver: a missed-backup alert on onyx is *seen*.
- Deferred (phase 2): replicate to friend's RTX 4080 PC (100.64.0.3) for true geographic separation. Tailnet IP is stable across the friend's ISP IP changes per memory `project_friend_gpu`.
**Mechanics:**
1. On onyx: create restricted user `mc-backup` with `~/backups/nullstone-mc-restic/` and a `~/.ssh/authorized_keys` entry that **only allows `internal-sftp` chrooted to that dir**, no shell, no port-forward. (`Match User mc-backup ... ChrootDirectory %h, ForceCommand internal-sftp -d /backups/nullstone-mc-restic`).
2. On nullstone: install nullstone's ssh public key on onyx for that user. Use a second **append-only** restic key (separate password) so a compromised nullstone cannot run `forget`/`prune` on the onyx repo. Restic supports this via per-key `--no-cache`-friendly flags, but the harder lock comes from sftp chroot perms (set parent dir owner to root, give `mc-backup` write inside but no `unlink` on rotated lockfiles? practical compromise: rely on `restic copy` adding-only and audit `forget` runs).
3. Nightly job on nullstone: `restic -r sftp:mc-backup@100.64.0.1:/backups/nullstone-mc-restic copy --from-repo /home/user/restic/mc-world latest && ... mc-frequent ...`.
4. Onyx-side cron weekly: `restic check` on the mirror (independent verification).
**Why not friend's GPU PC?** Windows host, no built-in SSH, asymmetric trust. Defer to phase 2 once an SMB or `rclone serve` target is set up there.
---
## 7. Restore drill (monthly, 1st Saturday 11:00)
Runbook: `docs/RUNBOOK-BACKUP-RESTORE.md` (created alongside this proposal).
Drill scenario: "YOU500 lost his inventory to a void death 6 minutes ago." Steps:
1. Pick a known UUID from `world/playerdata/` (operator's own UUID).
2. `restic -r /home/user/restic/mc-frequent snapshots --tag playerdata | tail -5` confirm freshest snapshot is 6 min old.
3. `restic -r ... restore latest --target /tmp/drill-$(date +%s) --include 'world/playerdata/<uuid>.dat'`.
4. `nbted` or `python -m nbtlib` parse the `.dat` confirm it's a valid GZIP NBT structure (not zero bytes, not partial).
5. `diff` against the live `.dat` log the differences (expected: at least the inventory NBT path differs because player kept playing).
6. Repeat from the **onyx mirror** repo to prove off-host works end-to-end.
7. Log result to `docs/RUNBOOK-BACKUP-RESTORE.md` § Drill log.
Drill is **non-destructive** never overwrite live `.dat` during a drill. Real restores follow §3 of the runbook.
Pass criteria: both restores complete in < 2 min wall-clock and the parsed NBT root tag is well-formed.
---
## 8. Implementation — concrete drafts
Two layers: a **fix** to the existing daily script (Class C/E) and a **new sidecar timer** for Classes A/B/D.
### 8.1 Fix `/opt/docker/backup.sh` (F-backup-1)
Already documented in `infra/runbooks/MIGRATION-nullstone-to-cobblestone.md` §5. Minimum work:
- Drop dead `matrix-postgres` block (Synapse retired).
- Drop / fix `mongodb` block (RC stopped 2026-05-06).
- Remove orphaned `chmod 600 ...synapse-signing-key...` block at L119122 (causing `set -e` exit before MC block on most days).
- Wrap each module in `( ... ) || log "module FAILED"` so one module's failure doesn't skip the rest.
Out-of-scope for this strategy doc track in infra audit.
### 8.2 New: `mc-backup-frequent` (Class A) and `mc-backup-world` (Classes B/C/D)
Drop-in files (operator review before deploy):
**`/etc/systemd/system/mc-backup-frequent.service`**
```ini
[Unit]
Description=Minecraft frequent backup (playerdata, every 5 min)
After=docker.service
Wants=docker.service
[Service]
Type=oneshot
User=user
Group=docker
EnvironmentFile=/etc/mc-backup.env
ExecStart=/usr/local/bin/mc-backup-frequent.sh
Nice=10
IOSchedulingClass=best-effort
IOSchedulingPriority=7
```
**`/etc/systemd/system/mc-backup-frequent.timer`**
```ini
[Unit]
Description=Run mc-backup-frequent every 5 minutes
[Timer]
OnBootSec=2min
OnUnitActiveSec=5min
AccuracySec=30s
Persistent=true
[Install]
WantedBy=timers.target
```
**`/etc/mc-backup.env`** (mode 0600, owner `user:docker`)
```
RESTIC_REPOSITORY_FREQUENT=/home/user/restic/mc-frequent
RESTIC_REPOSITORY_WORLD=/home/user/restic/mc-world
RESTIC_PASSWORD_FILE=/etc/mc-backup.pw
MC_DATA=/opt/docker/minecraft
RCON_HOST=127.0.0.1
RCON_PORT=25575
RCON_PASS=*redacted*
HEARTBEAT_URL=https://ntfy.s8n.ru/mc-backup-frequent
ALERT_URL=https://ntfy.s8n.ru/mc-backup-alerts
TS_OFFHOST_USER=mc-backup
TS_OFFHOST_HOST=100.64.0.1
TS_OFFHOST_PATH=/backups/nullstone-mc-restic
```
**`/usr/local/bin/mc-backup-frequent.sh`**
```bash
#!/usr/bin/env bash
set -euo pipefail
. /etc/mc-backup.env
trap 'curl -fsS -m 10 -d "fail rc=$?" "$ALERT_URL" >/dev/null || true' ERR
# 1. Ask MC to flush via rcon (best-effort; don't fail backup if rcon down)
if command -v mcrcon >/dev/null 2>&1; then
mcrcon -H "$RCON_HOST" -P "$RCON_PORT" -p "$RCON_PASS" -w 1 \
"save-all flush" >/dev/null 2>&1 || true
fi
# 2. Snapshot just the small fast-changing things
restic backup \
--tag playerdata \
--tag auto-5min \
--host nullstone \
--exclude='*.lock' \
"$MC_DATA/world/playerdata" \
"$MC_DATA/world/stats" \
"$MC_DATA/world/advancements" \
"$MC_DATA/world/level.dat" \
"$MC_DATA/world_nether/level.dat" \
"$MC_DATA/world_the_end/level.dat" \
"$MC_DATA/homestead_data.db" \
"$MC_DATA/plugins/LuckPerms" \
"$MC_DATA/plugins/CoreProtect/database.db" 2>/dev/null || true
# 3. Cheap retention (only on local repo)
restic forget --tag auto-5min \
--keep-last 24 --keep-hourly 24 --keep-daily 7 \
--prune --quiet
# 4. Heartbeat — alert if NOT received in 15 min via ntfy server
curl -fsS -m 5 "$HEARTBEAT_URL" >/dev/null || true
```
**`mc-backup-world.{service,timer,sh}`** same shape, runs hourly during play / 6h otherwise (use `OnCalendar=*-*-* 07,08,...,01:00:00` or two timers), backs up full `world*/`, configs, DB dumps. After local backup, runs:
```bash
restic copy \
--from-repo "$RESTIC_REPOSITORY_WORLD" \
-r "sftp:$TS_OFFHOST_USER@$TS_OFFHOST_HOST:$TS_OFFHOST_PATH" \
latest
```
And once nightly (separate timer) the same `copy` for `mc-frequent`.
### 8.3 docker-compose.override.yml — alternative path (rejected)
Considered: itzg image supports `BACKUP_INTERVAL`, `BACKUP_METHOD=restic`. Pros: in-container, knows when world is loaded. Cons:
- Bind-mount to host restic repo crosses userns-remap boundary (uid 100000 vs host uid 1000) already a known nullstone footgun (memory `project_nullstone_docker_userns`).
- Container restart wipes restic cache, slow first run after every reboot.
- Mixing in-image and host-cron backup logic doubles failure surfaces.
**Decision:** keep backups in systemd on the host; container is unaware. Override file is **not** part of this proposal.
---
## 9. Monitoring & alerting
Three signals, all routed to ntfy on the existing self-hosted `ntfy.s8n.ru` (assumed to exist; if not, add as part of phase 1 single-container deploy). DiscordSRV was dropped on 2026-04-30 per README.md L170, so Discord is not an option.
| Signal | Trigger | Channel |
|---|---|---|
| `mc-backup-frequent` heartbeat | timer fires successfully | ntfy topic `mc-backup-frequent` (silent on success) |
| Heartbeat **missing > 15 min** | dead-man's switch on ntfy server, or external (`healthchecks.io` is free + self-hostable) | ntfy topic `mc-backup-alerts` (high priority) |
| `restic check` weekly | non-zero rc | ntfy topic `mc-backup-alerts` (high priority) |
| Off-host mirror failure | `restic copy` non-zero rc | ntfy topic `mc-backup-alerts` (high priority) |
Operator subscribes onyx + phone to `mc-backup-alerts` only. The `-frequent` topic is a heartbeat sink (not a notification stream).
**Alternative if no ntfy yet:** write to `/var/log/mc-backup.log` AND a tiny status file `/var/lib/mc-backup/last-success` (mtime checked by an external monitor Gatus on roadmap, Beszel on roadmap). Until either of those lands, a simple cron on **onyx** doing `ssh user@nullstone 'find /var/lib/mc-backup/last-success -mmin -15 | grep .'` and triggering a desktop `notify-send` is enough.
This addresses T8 (the silent-failure threat) directly.
---
## 10. Cost & capacity
**Hardware cost:** £0. Uses existing nullstone NVMe + onyx NVMe + existing Tailscale mesh.
**Disk consumption (steady state, both repos):**
| Where | Estimate | Headroom |
|---|---|---|
| nullstone `/home/user/restic/mc-frequent` | < 1 GB | 142 G free ~140× |
| nullstone `/home/user/restic/mc-world` | 1525 GB | ~6× |
| onyx `~/backups/nullstone-mc-restic/` | 1626 GB | 1.6 T free ~60× |
**Days of retention given current free space:** even if the world doubles to 36 GB raw, dedup keeps growth linear at ~5 % per snapshot well over a year of monthly retention fits.
**Network:** Tailscale LAN-direct (5 ms onyx nullstone). Nightly delta typically < 500 MB after dedup. Negligible.
**Operator time:** ~2 h initial deploy, ~10 min/month for the drill, ~zero on autopilot.
---
## 11. Phase plan
| Phase | What | When | Blocker |
|---|---|---|---|
| 0 | This doc + runbook stub written, reviewed | TODAY | |
| 1 | Stop the bleeding: fix `backup.sh` orphan lines so daily MC tar at least runs again | TODAY (15 min) | |
| 2 | Stand up `mc-backup-frequent` timer + local restic repo (Class A) | this week | needs `apt install restic mcrcon` |
| 3 | Add `mc-backup-world` timer + Class B/C/D | this week | |
| 4 | Onyx off-host SFTP target + `restic copy` job | this week | onyx user provisioning + ssh key |
| 5 | First monthly drill | next 1st Saturday | |
| 6 | Wire ntfy alerts | when ntfy/Gatus deployed (infra roadmap) | external |
| 7 | Friend RTX 4080 PC as second off-host (geographic) | phase 2 | Windows-side tooling |
Phases 14 are doable today with what's on hand. Nothing in phases 15 requires purchasing.
---
## 12. Open questions for operator
1. **ntfy.s8n.ru — does it exist yet?** Memory hints at Tuwunel + Matrix on `txt.s8n.ru`. If ntfy isn't deployed, decide: deploy ntfy *now*, or use Matrix room via Tuwunel webhook bridge as alert sink.
2. **Onyx user `mc-backup`** create today or reuse existing `admin` with restricted authorized_keys? Restricted user is cleaner; reusing `admin` is faster.
3. **Append-only enforcement** on the onyx side accept "sftp chroot + no shell" as good-enough, or invest in a per-repo restic key with `--no-delete`-style isolation (more work, partial mitigation only)?
4. **Pre-flight world validation** run `region-fixer` against the latest snapshot weekly to catch silent corruption (T3)? Adds ~5 min compute weekly. Recommend yes.
5. **Class-E (host configs) — already in `live-server/` git repo via Syncthing/manual?** If yes, drop Class E from this scheme; if no, add it.
---
## 13. References
- `docs/BACKUP.md` current (broken) state docs.
- `docs/RUNBOOK-BACKUP-RESTORE.md` operational runbook (this commit).
- `scripts/backup.sh` to-be-fixed daily script (F-backup-1 in `infra/STATE.md`).
- `_github/infra/STATE.md` Top-5 weakness #2 + #5 tracking this work.
- `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md` §5 F-backup-1 detail; nullstone-as-spare hint.
- Memory: `project_friend_gpu` (Tailscale stable IP for friend), `project_tailscale_mesh` (mesh layout), `project_nullstone_docker_userns` (why container-side backup is rejected).
- `CLAUDE.md` Device Registry onyx 192.168.0.28 / 100.64.0.1.