394 lines
21 KiB
Markdown
394 lines
21 KiB
Markdown
|
|
# Minecraft Backup Strategy — racked.ru on nullstone
|
|||
|
|
|
|||
|
|
**Status:** PROPOSAL (2026-05-07) — not yet implemented.
|
|||
|
|
**Author trigger:** Player lost full inventory to void death today; rollback impossible because the existing 02:00 daily backup had **silently failed for 5 of the last 7 days** and there is **zero off-host copy**.
|
|||
|
|
**Owner:** `s8n` (operator).
|
|||
|
|
**Target host:** `nullstone` (192.168.0.100, Debian 13 trixie).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 0. Current state (audited 2026-05-07)
|
|||
|
|
|
|||
|
|
Existing system in `/opt/docker/backup.sh` + `cron.d/docker-backup` (02:00 daily, 7-day retention in `/opt/backups/`).
|
|||
|
|
|
|||
|
|
Findings from `/opt/backups/backup.log`:
|
|||
|
|
|
|||
|
|
| Date | MC world result | Backup dir total |
|
|||
|
|
|------|-----------------|------------------|
|
|||
|
|
| 2026-04-26 | FAILED | — |
|
|||
|
|
| 2026-04-27 | FAILED | — |
|
|||
|
|
| 2026-04-28 | FAILED | — |
|
|||
|
|
| 2026-04-29 | OK (3.6 G) | — |
|
|||
|
|
| 2026-04-30 | FAILED | — |
|
|||
|
|
| 2026-05-01 | FAILED | — |
|
|||
|
|
| 2026-05-02 | OK (3.6 G) | — |
|
|||
|
|
| 2026-05-03 | (no MC log line) | 8 K |
|
|||
|
|
| 2026-05-04 | (no MC log line) | 8 K |
|
|||
|
|
| 2026-05-05 | (no MC log line) | 8 K |
|
|||
|
|
| 2026-05-06 | (no MC log line) | 12 K |
|
|||
|
|
| 2026-05-07 | (no MC log line) | 12 K |
|
|||
|
|
|
|||
|
|
After 2026-05-02 the entire MC block stopped emitting log lines. The script appears to be exiting before reaching it (the duplicated stray `chmod 600 ... synapse-signing-key` lines at L119–122 are orphaned from a botched edit and may now break `set -e`). Effective state: **two MC backups in the last 12 days**, both already pruned by 7-day retention. **No usable backup exists right now.**
|
|||
|
|
|
|||
|
|
Cross-references:
|
|||
|
|
- `_github/infra/STATE.md` Top-5 weakness #2 ("backup.sh broken silently") and #5 ("No off-host backup").
|
|||
|
|
- `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md` §5 already names this `F-backup-1` and proposes "Restic + autorestic to B2/Wasabi or to nullstone-as-spare". This strategy refines that to use on-hand resources rather than paid storage.
|
|||
|
|
|
|||
|
|
### Available resources (no purchasing required)
|
|||
|
|
|
|||
|
|
| Asset | Location | Free | Reachability | Role |
|
|||
|
|
|---|---|---|---|---|
|
|||
|
|
| nullstone `/home` | local NVMe (ext4 LVM) | 142 G of 399 G | local | Primary repo + restic cache |
|
|||
|
|
| onyx `/home` | LUKS NVMe | 1.6 T of 1.9 T | Tailscale 100.64.0.1 (LAN ~5 ms) | **Off-host primary** |
|
|||
|
|
| friend RTX 4080 PC | DESKTOP-LR0RILA | unknown (Windows, large) | Tailscale 100.64.0.3 (WAN, IP-stable via tailnet) | **Off-host secondary** (defer) |
|
|||
|
|
| nullstone `/opt/backups` | same disk as `/opt/docker` | 142 G | local | *Not* a real backup target — same-disk SPOF |
|
|||
|
|
|
|||
|
|
**No purchased B2 / Wasabi / S3 in this proposal.** Tailscale + onyx covers off-host today. B2 stays in the future-options annex.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. Threat model
|
|||
|
|
|
|||
|
|
| # | Threat | Concrete example | Frequency | Mitigation in this plan |
|
|||
|
|
|---|---|---|---|---|
|
|||
|
|
| T1 | Player accidental loss (void death, lava, fall) | YOU500, 2026-05-07 | weekly | 5-min playerdata snapshots (RPO ≤ 5 min) |
|
|||
|
|
| T2 | Griefing / theft / chest emptied by ban-evader | possible | monthly | 5-min playerdata + 1-h world snapshots |
|
|||
|
|
| T3 | World corruption (chunk error, region-file truncate) | rare | — | 6-h pre-flight validated full world snapshot |
|
|||
|
|
| T4 | Plugin / config bad change (LuckPerms wipe, server.properties) | edits during ops | weekly | daily configs + DB dump + git history (`live-server/` repo) |
|
|||
|
|
| T5 | Host disk failure (single NVMe) | low/year | — | nightly off-host copy to onyx (Tailscale) |
|
|||
|
|
| T6 | Ransomware / host compromise | low | — | append-only Restic repo on onyx; nullstone holds **no** delete key |
|
|||
|
|
| T7 | Operator `rm -rf` or wrong `docker compose down -v` | low | — | retention floor (4 weekly + 12 monthly) survives a recent rm |
|
|||
|
|
| T8 | Backup script silently failing (current state) | OBSERVED | — | heartbeat alert + monthly restore drill (§7) |
|
|||
|
|
|
|||
|
|
T8 is the one that just bit us. The single most important addition is **alerting on missed runs**, not the storage tech.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2. RPO / RTO
|
|||
|
|
|
|||
|
|
| Class | Data | RPO | RTO | Backup mechanism |
|
|||
|
|
|---|---|---|---|---|
|
|||
|
|
| A | playerdata (`world/playerdata/*.dat`, `stats/`, `advancements/`) | **5 min** | < 2 min per player | rcon `save-all flush` → rsync to local snapshot, then restic-add |
|
|||
|
|
| B | full world (region files, end + nether) | **1 h** during play, **6 h** otherwise | 15 min | restic of `world*/` |
|
|||
|
|
| C | plugin configs + LuckPerms YAML | 24 h | 30 min | tar of `plugins/*/config*.yml` + LP file dump |
|
|||
|
|
| D | LuckPerms / Homestead SQLite DBs (`*.db`, `homestead_data.db`) | 1 h | 5 min | sqlite `.backup` then restic-add |
|
|||
|
|
| E | host-level configs (`docker-compose.yml`, `server.properties`, `purpur.yml`, `bukkit.yml`, `paper-*.yml`, `whitelist.json`, `ops.json`, `banned-*.json`, `config/`) | 24 h | 5 min | already in git repo `_github/minecraft-server/`; backup just covers drift |
|
|||
|
|
|
|||
|
|
**Justification for RPO=5 min on Class A:** the void-death case rebuilds in seconds — recovering one `<uuid>.dat` is a ~30 s operation if a 5-min-old snapshot exists. Snapshotting just the 1.3 MB `playerdata/` dir is cheap (single-digit MB/day after dedup).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3. Tool choice — Restic
|
|||
|
|
|
|||
|
|
Compared:
|
|||
|
|
|
|||
|
|
| Tool | Dedup | Encryption | Snapshots | Network destinations | Verdict |
|
|||
|
|
|---|---|---|---|---|---|
|
|||
|
|
| **restic** | content-addressed, very effective on MC region files | AES-256, repo-key | yes | sftp (Tailscale), local, B2, S3, Azure, rclone | **WINNER** |
|
|||
|
|
| borgbackup | similar | yes | yes | ssh only, lock-on-write | Equally good; restic chosen because operator already plans `restic + autorestic` per `infra/STATE.md` line 112; sftp dest is simpler than borg's required serverside binary |
|
|||
|
|
| rsnapshot | hardlinks, no dedup | none | rotated dirs | local + rsync | No encryption ⇒ off-host copy on Tailscale (already encrypted) is fine, but no dedup means 18 G × N snapshots is painful. Reject. |
|
|||
|
|
| zfs send | block-level | (zfs native) | snapshots | yes | nullstone is **ext4/LVM**, no ZFS, no btrfs. Reject. |
|
|||
|
|
| LVM snapshot | COW | none | yes | local only | Same-disk only, doesn't survive disk failure. Useful as a *staging* primitive only. |
|
|||
|
|
| custom rsync + cp -al | hardlinks | none | yes | yes | Reinventing rsnapshot. Reject. |
|
|||
|
|
| itzg `BACKUP_*` env | tar to volume | none | rotation | local | Already tried in spirit by current `backup.sh`; same-disk; not granular. Reject as primary. |
|
|||
|
|
|
|||
|
|
**Decision:** `restic` for Classes A, B, C, D. Continue using a thin tar wrapper for Class E (configs are already in the git repo, this is just safety).
|
|||
|
|
|
|||
|
|
Restic strengths for our case:
|
|||
|
|
- Region files dedup *very* well (chunks unchanged across snapshots).
|
|||
|
|
- A 5-min Class-A snapshot adds ~MB to the repo, not the full 1.3 MB × N.
|
|||
|
|
- One repo on local disk + one mirror to onyx via `rclone serve restic` or direct `sftp:` — no agent needed on onyx beyond ssh.
|
|||
|
|
- `restic check --read-data-subset=5%` is the canonical scrub.
|
|||
|
|
|
|||
|
|
Apt: `apt install restic` on trixie ships 0.16.x — sufficient.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4. Schedule
|
|||
|
|
|
|||
|
|
All times Europe/London (matches `TZ` in compose file).
|
|||
|
|
|
|||
|
|
| Job | Cadence | Source | Destination | Mechanism |
|
|||
|
|
|---|---|---|---|---|
|
|||
|
|
| **A — playerdata** | every **5 min** | `world/playerdata/`, `world/stats/`, `world/advancements/`, `world*/level.dat`, `*.db` (LP+homestead) | restic repo `/home/user/restic/mc-frequent/` | systemd timer `mc-backup-frequent.timer` |
|
|||
|
|
| **B — full world** | every **1 h** during play (07:00–01:00), **6 h** otherwise | `world/`, `world_nether/`, `world_the_end/` | restic repo `/home/user/restic/mc-world/` | systemd timer `mc-backup-world.timer` |
|
|||
|
|
| **C — configs + plugins** | **daily 02:00** | `/opt/docker/minecraft/*.yml`, `*.json`, `plugins/*/config*.yml`, `plugins/LuckPerms/`, `docker-compose.yml` | restic repo `mc-world` (path-tagged) | reuse same timer with second backup target |
|
|||
|
|
| **D — DB dumps** | every **1 h** | `homestead_data.db`, `plugins/CoreProtect/database.db`, `plugins/LuckPerms/luckperms-h2-*` | restic repo `mc-world` | timer hooks `sqlite3 .backup` first |
|
|||
|
|
| **E — off-host mirror** | **nightly 03:30** | nullstone `/home/user/restic/` | onyx `100.64.0.1:/home/admin/backups/nullstone-mc-restic/` | `restic copy` over sftp (Tailscale) — append-only key on onyx side |
|
|||
|
|
| **F — verify** | **weekly Sun 04:00** | both repos | — | `restic check --read-data-subset=5%` then alert on rc |
|
|||
|
|
| **G — drill** | **monthly 1st Sat 11:00** | random snapshot | scratch dir | §7 procedure |
|
|||
|
|
|
|||
|
|
### Why this works for the void-death case
|
|||
|
|
|
|||
|
|
T1 hits at 18:42. By 18:45 a Class-A snapshot exists containing the player's `<uuid>.dat` from 18:40. Restore: `restic -r ... restore --target /tmp/r --include 'world/playerdata/<uuid>.dat' latest`, stop server (or `/save-off` + minimanip), copy file into place, `/save-on`. Total RTO < 2 min.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 5. Retention
|
|||
|
|
|
|||
|
|
Restic policy (passed to `restic forget --keep-*`):
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
--keep-last 24 # 24 most recent (covers 2h of 5-min snapshots)
|
|||
|
|
--keep-hourly 24 # 24h of hourly
|
|||
|
|
--keep-daily 7 # 7 days
|
|||
|
|
--keep-weekly 4 # 4 weeks
|
|||
|
|
--keep-monthly 12 # 12 months
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Applied per-tag — Class A snapshots tagged `playerdata`, B/C/D tagged `world`. Forget is run **only on the local repo**; the onyx mirror inherits via `restic copy` with same policy after the local forget+prune.
|
|||
|
|
|
|||
|
|
### Storage budget
|
|||
|
|
|
|||
|
|
- Class A: 1.3 MB raw × dedup (~20× on `.dat`, mostly empty NBT slots) → ~70 KB / snapshot **net**.
|
|||
|
|
- 24/h × 24h × 7 = 4 032 snapshots/week → < 300 MB/week.
|
|||
|
|
- Class B/C/D: 18 G raw → ~6.5 G compressed (per current 3.6 G figure × adjustment for nether/end now active). Restic dedup on hourly snapshots: ~50–200 MB delta/snapshot during active play.
|
|||
|
|
- 24h hourly + 7 daily + 4 weekly + 12 monthly ≈ 47 retained → estimate **15–25 GB total** at steady state.
|
|||
|
|
- E (off-host): same as above on onyx (1.6 TB free — 30× headroom).
|
|||
|
|
|
|||
|
|
**Conclusion:** comfortably fits in nullstone's 142 G free. Onyx is essentially unconstrained.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 6. Off-host destination — onyx via Tailscale
|
|||
|
|
|
|||
|
|
**Choice:** `onyx` (100.64.0.1, 1.6 TB free on `/home`). Reasons:
|
|||
|
|
- Already in the tailnet (`tag:admin`), already trusted, already SSH-reachable.
|
|||
|
|
- 1.6 TB is 100× the dataset.
|
|||
|
|
- Operator's daily-driver: a missed-backup alert on onyx is *seen*.
|
|||
|
|
- Deferred (phase 2): replicate to friend's RTX 4080 PC (100.64.0.3) for true geographic separation. Tailnet IP is stable across the friend's ISP IP changes per memory `project_friend_gpu`.
|
|||
|
|
|
|||
|
|
**Mechanics:**
|
|||
|
|
1. On onyx: create restricted user `mc-backup` with `~/backups/nullstone-mc-restic/` and a `~/.ssh/authorized_keys` entry that **only allows `internal-sftp` chrooted to that dir**, no shell, no port-forward. (`Match User mc-backup ... ChrootDirectory %h, ForceCommand internal-sftp -d /backups/nullstone-mc-restic`).
|
|||
|
|
2. On nullstone: install nullstone's ssh public key on onyx for that user. Use a second **append-only** restic key (separate password) so a compromised nullstone cannot run `forget`/`prune` on the onyx repo. Restic supports this via per-key `--no-cache`-friendly flags, but the harder lock comes from sftp chroot perms (set parent dir owner to root, give `mc-backup` write inside but no `unlink` on rotated lockfiles? — practical compromise: rely on `restic copy` adding-only and audit `forget` runs).
|
|||
|
|
3. Nightly job on nullstone: `restic -r sftp:mc-backup@100.64.0.1:/backups/nullstone-mc-restic copy --from-repo /home/user/restic/mc-world latest && ... mc-frequent ...`.
|
|||
|
|
4. Onyx-side cron weekly: `restic check` on the mirror (independent verification).
|
|||
|
|
|
|||
|
|
**Why not friend's GPU PC?** Windows host, no built-in SSH, asymmetric trust. Defer to phase 2 once an SMB or `rclone serve` target is set up there.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 7. Restore drill (monthly, 1st Saturday 11:00)
|
|||
|
|
|
|||
|
|
Runbook: `docs/RUNBOOK-BACKUP-RESTORE.md` (created alongside this proposal).
|
|||
|
|
|
|||
|
|
Drill scenario: "YOU500 lost his inventory to a void death 6 minutes ago." Steps:
|
|||
|
|
|
|||
|
|
1. Pick a known UUID from `world/playerdata/` (operator's own UUID).
|
|||
|
|
2. `restic -r /home/user/restic/mc-frequent snapshots --tag playerdata | tail -5` — confirm freshest snapshot is ≤ 6 min old.
|
|||
|
|
3. `restic -r ... restore latest --target /tmp/drill-$(date +%s) --include 'world/playerdata/<uuid>.dat'`.
|
|||
|
|
4. `nbted` or `python -m nbtlib` parse the `.dat` — confirm it's a valid GZIP NBT structure (not zero bytes, not partial).
|
|||
|
|
5. `diff` against the live `.dat` — log the differences (expected: at least the inventory NBT path differs because player kept playing).
|
|||
|
|
6. Repeat from the **onyx mirror** repo to prove off-host works end-to-end.
|
|||
|
|
7. Log result to `docs/RUNBOOK-BACKUP-RESTORE.md` § Drill log.
|
|||
|
|
|
|||
|
|
Drill is **non-destructive** — never overwrite live `.dat` during a drill. Real restores follow §3 of the runbook.
|
|||
|
|
|
|||
|
|
Pass criteria: both restores complete in < 2 min wall-clock and the parsed NBT root tag is well-formed.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 8. Implementation — concrete drafts
|
|||
|
|
|
|||
|
|
Two layers: a **fix** to the existing daily script (Class C/E) and a **new sidecar timer** for Classes A/B/D.
|
|||
|
|
|
|||
|
|
### 8.1 Fix `/opt/docker/backup.sh` (F-backup-1)
|
|||
|
|
|
|||
|
|
Already documented in `infra/runbooks/MIGRATION-nullstone-to-cobblestone.md` §5. Minimum work:
|
|||
|
|
- Drop dead `matrix-postgres` block (Synapse retired).
|
|||
|
|
- Drop / fix `mongodb` block (RC stopped 2026-05-06).
|
|||
|
|
- Remove orphaned `chmod 600 ...synapse-signing-key...` block at L119–122 (causing `set -e` exit before MC block on most days).
|
|||
|
|
- Wrap each module in `( ... ) || log "module FAILED"` so one module's failure doesn't skip the rest.
|
|||
|
|
|
|||
|
|
Out-of-scope for this strategy doc — track in infra audit.
|
|||
|
|
|
|||
|
|
### 8.2 New: `mc-backup-frequent` (Class A) and `mc-backup-world` (Classes B/C/D)
|
|||
|
|
|
|||
|
|
Drop-in files (operator review before deploy):
|
|||
|
|
|
|||
|
|
**`/etc/systemd/system/mc-backup-frequent.service`**
|
|||
|
|
```ini
|
|||
|
|
[Unit]
|
|||
|
|
Description=Minecraft frequent backup (playerdata, every 5 min)
|
|||
|
|
After=docker.service
|
|||
|
|
Wants=docker.service
|
|||
|
|
|
|||
|
|
[Service]
|
|||
|
|
Type=oneshot
|
|||
|
|
User=user
|
|||
|
|
Group=docker
|
|||
|
|
EnvironmentFile=/etc/mc-backup.env
|
|||
|
|
ExecStart=/usr/local/bin/mc-backup-frequent.sh
|
|||
|
|
Nice=10
|
|||
|
|
IOSchedulingClass=best-effort
|
|||
|
|
IOSchedulingPriority=7
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**`/etc/systemd/system/mc-backup-frequent.timer`**
|
|||
|
|
```ini
|
|||
|
|
[Unit]
|
|||
|
|
Description=Run mc-backup-frequent every 5 minutes
|
|||
|
|
|
|||
|
|
[Timer]
|
|||
|
|
OnBootSec=2min
|
|||
|
|
OnUnitActiveSec=5min
|
|||
|
|
AccuracySec=30s
|
|||
|
|
Persistent=true
|
|||
|
|
|
|||
|
|
[Install]
|
|||
|
|
WantedBy=timers.target
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**`/etc/mc-backup.env`** (mode 0600, owner `user:docker`)
|
|||
|
|
```
|
|||
|
|
RESTIC_REPOSITORY_FREQUENT=/home/user/restic/mc-frequent
|
|||
|
|
RESTIC_REPOSITORY_WORLD=/home/user/restic/mc-world
|
|||
|
|
RESTIC_PASSWORD_FILE=/etc/mc-backup.pw
|
|||
|
|
MC_DATA=/opt/docker/minecraft
|
|||
|
|
RCON_HOST=127.0.0.1
|
|||
|
|
RCON_PORT=25575
|
|||
|
|
RCON_PASS=*redacted*
|
|||
|
|
HEARTBEAT_URL=https://ntfy.s8n.ru/mc-backup-frequent
|
|||
|
|
ALERT_URL=https://ntfy.s8n.ru/mc-backup-alerts
|
|||
|
|
TS_OFFHOST_USER=mc-backup
|
|||
|
|
TS_OFFHOST_HOST=100.64.0.1
|
|||
|
|
TS_OFFHOST_PATH=/backups/nullstone-mc-restic
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**`/usr/local/bin/mc-backup-frequent.sh`**
|
|||
|
|
```bash
|
|||
|
|
#!/usr/bin/env bash
|
|||
|
|
set -euo pipefail
|
|||
|
|
. /etc/mc-backup.env
|
|||
|
|
|
|||
|
|
trap 'curl -fsS -m 10 -d "fail rc=$?" "$ALERT_URL" >/dev/null || true' ERR
|
|||
|
|
|
|||
|
|
# 1. Ask MC to flush via rcon (best-effort; don't fail backup if rcon down)
|
|||
|
|
if command -v mcrcon >/dev/null 2>&1; then
|
|||
|
|
mcrcon -H "$RCON_HOST" -P "$RCON_PORT" -p "$RCON_PASS" -w 1 \
|
|||
|
|
"save-all flush" >/dev/null 2>&1 || true
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
# 2. Snapshot just the small fast-changing things
|
|||
|
|
restic backup \
|
|||
|
|
--tag playerdata \
|
|||
|
|
--tag auto-5min \
|
|||
|
|
--host nullstone \
|
|||
|
|
--exclude='*.lock' \
|
|||
|
|
"$MC_DATA/world/playerdata" \
|
|||
|
|
"$MC_DATA/world/stats" \
|
|||
|
|
"$MC_DATA/world/advancements" \
|
|||
|
|
"$MC_DATA/world/level.dat" \
|
|||
|
|
"$MC_DATA/world_nether/level.dat" \
|
|||
|
|
"$MC_DATA/world_the_end/level.dat" \
|
|||
|
|
"$MC_DATA/homestead_data.db" \
|
|||
|
|
"$MC_DATA/plugins/LuckPerms" \
|
|||
|
|
"$MC_DATA/plugins/CoreProtect/database.db" 2>/dev/null || true
|
|||
|
|
|
|||
|
|
# 3. Cheap retention (only on local repo)
|
|||
|
|
restic forget --tag auto-5min \
|
|||
|
|
--keep-last 24 --keep-hourly 24 --keep-daily 7 \
|
|||
|
|
--prune --quiet
|
|||
|
|
|
|||
|
|
# 4. Heartbeat — alert if NOT received in 15 min via ntfy server
|
|||
|
|
curl -fsS -m 5 "$HEARTBEAT_URL" >/dev/null || true
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**`mc-backup-world.{service,timer,sh}`** — same shape, runs hourly during play / 6h otherwise (use `OnCalendar=*-*-* 07,08,...,01:00:00` or two timers), backs up full `world*/`, configs, DB dumps. After local backup, runs:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
restic copy \
|
|||
|
|
--from-repo "$RESTIC_REPOSITORY_WORLD" \
|
|||
|
|
-r "sftp:$TS_OFFHOST_USER@$TS_OFFHOST_HOST:$TS_OFFHOST_PATH" \
|
|||
|
|
latest
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
And once nightly (separate timer) the same `copy` for `mc-frequent`.
|
|||
|
|
|
|||
|
|
### 8.3 docker-compose.override.yml — alternative path (rejected)
|
|||
|
|
|
|||
|
|
Considered: itzg image supports `BACKUP_INTERVAL`, `BACKUP_METHOD=restic`. Pros: in-container, knows when world is loaded. Cons:
|
|||
|
|
- Bind-mount to host restic repo crosses userns-remap boundary (uid 100000 vs host uid 1000) — already a known nullstone footgun (memory `project_nullstone_docker_userns`).
|
|||
|
|
- Container restart wipes restic cache, slow first run after every reboot.
|
|||
|
|
- Mixing in-image and host-cron backup logic doubles failure surfaces.
|
|||
|
|
|
|||
|
|
**Decision:** keep backups in systemd on the host; container is unaware. Override file is **not** part of this proposal.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 9. Monitoring & alerting
|
|||
|
|
|
|||
|
|
Three signals, all routed to ntfy on the existing self-hosted `ntfy.s8n.ru` (assumed to exist; if not, add as part of phase 1 — single-container deploy). DiscordSRV was dropped on 2026-04-30 per README.md L170, so Discord is not an option.
|
|||
|
|
|
|||
|
|
| Signal | Trigger | Channel |
|
|||
|
|
|---|---|---|
|
|||
|
|
| `mc-backup-frequent` heartbeat | timer fires successfully | ntfy topic `mc-backup-frequent` (silent on success) |
|
|||
|
|
| Heartbeat **missing > 15 min** | dead-man's switch on ntfy server, or external (`healthchecks.io` is free + self-hostable) | ntfy topic `mc-backup-alerts` (high priority) |
|
|||
|
|
| `restic check` weekly | non-zero rc | ntfy topic `mc-backup-alerts` (high priority) |
|
|||
|
|
| Off-host mirror failure | `restic copy` non-zero rc | ntfy topic `mc-backup-alerts` (high priority) |
|
|||
|
|
|
|||
|
|
Operator subscribes onyx + phone to `mc-backup-alerts` only. The `-frequent` topic is a heartbeat sink (not a notification stream).
|
|||
|
|
|
|||
|
|
**Alternative if no ntfy yet:** write to `/var/log/mc-backup.log` AND a tiny status file `/var/lib/mc-backup/last-success` (mtime checked by an external monitor — Gatus on roadmap, Beszel on roadmap). Until either of those lands, a simple cron on **onyx** doing `ssh user@nullstone 'find /var/lib/mc-backup/last-success -mmin -15 | grep .'` and triggering a desktop `notify-send` is enough.
|
|||
|
|
|
|||
|
|
This addresses T8 (the silent-failure threat) directly.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 10. Cost & capacity
|
|||
|
|
|
|||
|
|
**Hardware cost:** £0. Uses existing nullstone NVMe + onyx NVMe + existing Tailscale mesh.
|
|||
|
|
|
|||
|
|
**Disk consumption (steady state, both repos):**
|
|||
|
|
|
|||
|
|
| Where | Estimate | Headroom |
|
|||
|
|
|---|---|---|
|
|||
|
|
| nullstone `/home/user/restic/mc-frequent` | < 1 GB | 142 G free → ~140× |
|
|||
|
|
| nullstone `/home/user/restic/mc-world` | 15–25 GB | ~6× |
|
|||
|
|
| onyx `~/backups/nullstone-mc-restic/` | 16–26 GB | 1.6 T free → ~60× |
|
|||
|
|
|
|||
|
|
**Days of retention given current free space:** even if the world doubles to 36 GB raw, dedup keeps growth linear at ~5 % per snapshot — well over a year of monthly retention fits.
|
|||
|
|
|
|||
|
|
**Network:** Tailscale LAN-direct (5 ms onyx ↔ nullstone). Nightly delta typically < 500 MB after dedup. Negligible.
|
|||
|
|
|
|||
|
|
**Operator time:** ~2 h initial deploy, ~10 min/month for the drill, ~zero on autopilot.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 11. Phase plan
|
|||
|
|
|
|||
|
|
| Phase | What | When | Blocker |
|
|||
|
|
|---|---|---|---|
|
|||
|
|
| 0 | This doc + runbook stub written, reviewed | TODAY | — |
|
|||
|
|
| 1 | Stop the bleeding: fix `backup.sh` orphan lines so daily MC tar at least runs again | TODAY (15 min) | — |
|
|||
|
|
| 2 | Stand up `mc-backup-frequent` timer + local restic repo (Class A) | this week | needs `apt install restic mcrcon` |
|
|||
|
|
| 3 | Add `mc-backup-world` timer + Class B/C/D | this week | — |
|
|||
|
|
| 4 | Onyx off-host SFTP target + `restic copy` job | this week | onyx user provisioning + ssh key |
|
|||
|
|
| 5 | First monthly drill | next 1st Saturday | — |
|
|||
|
|
| 6 | Wire ntfy alerts | when ntfy/Gatus deployed (infra roadmap) | external |
|
|||
|
|
| 7 | Friend RTX 4080 PC as second off-host (geographic) | phase 2 | Windows-side tooling |
|
|||
|
|
|
|||
|
|
Phases 1–4 are doable today with what's on hand. Nothing in phases 1–5 requires purchasing.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 12. Open questions for operator
|
|||
|
|
|
|||
|
|
1. **ntfy.s8n.ru — does it exist yet?** Memory hints at Tuwunel + Matrix on `txt.s8n.ru`. If ntfy isn't deployed, decide: deploy ntfy *now*, or use Matrix room via Tuwunel webhook bridge as alert sink.
|
|||
|
|
2. **Onyx user `mc-backup`** — create today or reuse existing `admin` with restricted authorized_keys? Restricted user is cleaner; reusing `admin` is faster.
|
|||
|
|
3. **Append-only enforcement** on the onyx side — accept "sftp chroot + no shell" as good-enough, or invest in a per-repo restic key with `--no-delete`-style isolation (more work, partial mitigation only)?
|
|||
|
|
4. **Pre-flight world validation** — run `region-fixer` against the latest snapshot weekly to catch silent corruption (T3)? Adds ~5 min compute weekly. Recommend yes.
|
|||
|
|
5. **Class-E (host configs) — already in `live-server/` git repo via Syncthing/manual?** If yes, drop Class E from this scheme; if no, add it.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 13. References
|
|||
|
|
|
|||
|
|
- `docs/BACKUP.md` — current (broken) state docs.
|
|||
|
|
- `docs/RUNBOOK-BACKUP-RESTORE.md` — operational runbook (this commit).
|
|||
|
|
- `scripts/backup.sh` — to-be-fixed daily script (F-backup-1 in `infra/STATE.md`).
|
|||
|
|
- `_github/infra/STATE.md` — Top-5 weakness #2 + #5 tracking this work.
|
|||
|
|
- `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md` §5 — F-backup-1 detail; nullstone-as-spare hint.
|
|||
|
|
- Memory: `project_friend_gpu` (Tailscale stable IP for friend), `project_tailscale_mesh` (mesh layout), `project_nullstone_docker_userns` (why container-side backup is rejected).
|
|||
|
|
- `CLAUDE.md` Device Registry — onyx 192.168.0.28 / 100.64.0.1.
|