s8n 3336f52142 redact: scrub leaked Minecraft secrets from public repo

Replaced literal values with env-var placeholders (${RCON_PASSWORD},
${MGMT_SECRET}, ${MC_RCON_PASSWORD}) across server.properties,
.rcon-cli.env, docker-compose.yml(s), backup scripts, and AUDIT-2026-05-07.md.

Affected secrets:
- Paper management-server-secret (HIGH; mitigated by management-server-enabled=false)
- RCON password '*redacted*' (MEDIUM; bound to 127.0.0.1)
- MC_RCON_PASSWORD backup-pipeline default fallback (MEDIUM; same blast radius)

WARNING: HEAD redaction only — values remain in git history. Treat as
compromised and rotate (closes F-17 audit-finding's deferred TODO).
Originals backed up to private s8n/secrets/minecraft-server/.

2026-05-08 15:36:20 +01:00

21 KiB

Raw Permalink Blame History

Minecraft Backup Strategy — racked.ru on nullstone

Status: PROPOSAL (2026-05-07) — not yet implemented. Author trigger: Player lost full inventory to void death today; rollback impossible because the existing 02:00 daily backup had silently failed for 5 of the last 7 days and there is zero off-host copy. Owner: s8n (operator). Target host: nullstone (192.168.0.100, Debian 13 trixie).

0. Current state (audited 2026-05-07)

Existing system in /opt/docker/backup.sh + cron.d/docker-backup (02:00 daily, 7-day retention in /opt/backups/).

Findings from /opt/backups/backup.log:

Date	MC world result	Backup dir total
2026-04-26	FAILED	—
2026-04-27	FAILED	—
2026-04-28	FAILED	—
2026-04-29	OK (3.6 G)	—
2026-04-30	FAILED	—
2026-05-01	FAILED	—
2026-05-02	OK (3.6 G)	—
2026-05-03	(no MC log line)	8 K
2026-05-04	(no MC log line)	8 K
2026-05-05	(no MC log line)	8 K
2026-05-06	(no MC log line)	12 K
2026-05-07	(no MC log line)	12 K

After 2026-05-02 the entire MC block stopped emitting log lines. The script appears to be exiting before reaching it (the duplicated stray chmod 600 ... synapse-signing-key lines at L119–122 are orphaned from a botched edit and may now break set -e). Effective state: two MC backups in the last 12 days, both already pruned by 7-day retention. No usable backup exists right now.

Cross-references:

_github/infra/STATE.md Top-5 weakness #2 ("backup.sh broken silently") and #5 ("No off-host backup").
_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md §5 already names this F-backup-1 and proposes "Restic + autorestic to B2/Wasabi or to nullstone-as-spare". This strategy refines that to use on-hand resources rather than paid storage.

Available resources (no purchasing required)

Asset	Location	Free	Reachability	Role
nullstone `/home`	local NVMe (ext4 LVM)	142 G of 399 G	local	Primary repo + restic cache
onyx `/home`	LUKS NVMe	1.6 T of 1.9 T	Tailscale 100.64.0.1 (LAN ~5 ms)	Off-host primary
friend RTX 4080 PC	DESKTOP-LR0RILA	unknown (Windows, large)	Tailscale 100.64.0.3 (WAN, IP-stable via tailnet)	Off-host secondary (defer)
nullstone `/opt/backups`	same disk as `/opt/docker`	142 G	local	Not a real backup target — same-disk SPOF

No purchased B2 / Wasabi / S3 in this proposal. Tailscale + onyx covers off-host today. B2 stays in the future-options annex.

1. Threat model

#	Threat	Concrete example	Frequency	Mitigation in this plan
T1	Player accidental loss (void death, lava, fall)	YOU500, 2026-05-07	weekly	5-min playerdata snapshots (RPO ≤ 5 min)
T2	Griefing / theft / chest emptied by ban-evader	possible	monthly	5-min playerdata + 1-h world snapshots
T3	World corruption (chunk error, region-file truncate)	rare	—	6-h pre-flight validated full world snapshot
T4	Plugin / config bad change (LuckPerms wipe, server.properties)	edits during ops	weekly	daily configs + DB dump + git history (`live-server/` repo)
T5	Host disk failure (single NVMe)	low/year	—	nightly off-host copy to onyx (Tailscale)
T6	Ransomware / host compromise	low	—	append-only Restic repo on onyx; nullstone holds no delete key
T7	Operator `rm -rf` or wrong `docker compose down -v`	low	—	retention floor (4 weekly + 12 monthly) survives a recent rm
T8	Backup script silently failing (current state)	OBSERVED	—	heartbeat alert + monthly restore drill (§7)

T8 is the one that just bit us. The single most important addition is alerting on missed runs, not the storage tech.

2. RPO / RTO

Class	Data	RPO	RTO	Backup mechanism
A	playerdata (`world/playerdata/*.dat`, `stats/`, `advancements/`)	5 min	< 2 min per player	rcon `save-all flush` → rsync to local snapshot, then restic-add
B	full world (region files, end + nether)	1 h during play, 6 h otherwise	15 min	restic of `world*/`
C	plugin configs + LuckPerms YAML	24 h	30 min	tar of `plugins//config.yml` + LP file dump
D	LuckPerms / Homestead SQLite DBs (`*.db`, `homestead_data.db`)	1 h	5 min	sqlite `.backup` then restic-add
E	host-level configs (`docker-compose.yml`, `server.properties`, `purpur.yml`, `bukkit.yml`, `paper-.yml`, `whitelist.json`, `ops.json`, `banned-.json`, `config/`)	24 h	5 min	already in git repo `_github/minecraft-server/`; backup just covers drift

Justification for RPO=5 min on Class A: the void-death case rebuilds in seconds — recovering one <uuid>.dat is a ~30 s operation if a 5-min-old snapshot exists. Snapshotting just the 1.3 MB playerdata/ dir is cheap (single-digit MB/day after dedup).

3. Tool choice — Restic

Compared:

Tool	Dedup	Encryption	Snapshots	Network destinations	Verdict
restic	content-addressed, very effective on MC region files	AES-256, repo-key	yes	sftp (Tailscale), local, B2, S3, Azure, rclone	WINNER
borgbackup	similar	yes	yes	ssh only, lock-on-write	Equally good; restic chosen because operator already plans `restic + autorestic` per `infra/STATE.md` line 112; sftp dest is simpler than borg's required serverside binary
rsnapshot	hardlinks, no dedup	none	rotated dirs	local + rsync	No encryption ⇒ off-host copy on Tailscale (already encrypted) is fine, but no dedup means 18 G × N snapshots is painful. Reject.
zfs send	block-level	(zfs native)	snapshots	yes	nullstone is ext4/LVM, no ZFS, no btrfs. Reject.
LVM snapshot	COW	none	yes	local only	Same-disk only, doesn't survive disk failure. Useful as a staging primitive only.
custom rsync + cp -al	hardlinks	none	yes	yes	Reinventing rsnapshot. Reject.
itzg `BACKUP_*` env	tar to volume	none	rotation	local	Already tried in spirit by current `backup.sh`; same-disk; not granular. Reject as primary.

Decision: restic for Classes A, B, C, D. Continue using a thin tar wrapper for Class E (configs are already in the git repo, this is just safety).

Restic strengths for our case:

Region files dedup very well (chunks unchanged across snapshots).
A 5-min Class-A snapshot adds ~MB to the repo, not the full 1.3 MB × N.
One repo on local disk + one mirror to onyx via rclone serve restic or direct sftp: — no agent needed on onyx beyond ssh.
restic check --read-data-subset=5% is the canonical scrub.

Apt: apt install restic on trixie ships 0.16.x — sufficient.

4. Schedule

All times Europe/London (matches TZ in compose file).

Job	Cadence	Source	Destination	Mechanism
A — playerdata	every 5 min	`world/playerdata/`, `world/stats/`, `world/advancements/`, `world/level.dat`, `.db` (LP+homestead)	restic repo `/home/user/restic/mc-frequent/`	systemd timer `mc-backup-frequent.timer`
B — full world	every 1 h during play (07:00–01:00), 6 h otherwise	`world/`, `world_nether/`, `world_the_end/`	restic repo `/home/user/restic/mc-world/`	systemd timer `mc-backup-world.timer`
C — configs + plugins	daily 02:00	`/opt/docker/minecraft/.yml`, `.json`, `plugins//config.yml`, `plugins/LuckPerms/`, `docker-compose.yml`	restic repo `mc-world` (path-tagged)	reuse same timer with second backup target
D — DB dumps	every 1 h	`homestead_data.db`, `plugins/CoreProtect/database.db`, `plugins/LuckPerms/luckperms-h2-*`	restic repo `mc-world`	timer hooks `sqlite3 .backup` first
E — off-host mirror	nightly 03:30	nullstone `/home/user/restic/`	onyx `100.64.0.1:/home/admin/backups/nullstone-mc-restic/`	`restic copy` over sftp (Tailscale) — append-only key on onyx side
F — verify	weekly Sun 04:00	both repos	—	`restic check --read-data-subset=5%` then alert on rc
G — drill	monthly 1st Sat 11:00	random snapshot	scratch dir	§7 procedure

Why this works for the void-death case

T1 hits at 18:42. By 18:45 a Class-A snapshot exists containing the player's <uuid>.dat from 18:40. Restore: restic -r ... restore --target /tmp/r --include 'world/playerdata/<uuid>.dat' latest, stop server (or /save-off + minimanip), copy file into place, /save-on. Total RTO < 2 min.

5. Retention

Restic policy (passed to restic forget --keep-*):

--keep-last 24            # 24 most recent (covers 2h of 5-min snapshots)
--keep-hourly 24          # 24h of hourly
--keep-daily 7            # 7 days
--keep-weekly 4           # 4 weeks
--keep-monthly 12         # 12 months

Applied per-tag — Class A snapshots tagged playerdata, B/C/D tagged world. Forget is run only on the local repo; the onyx mirror inherits via restic copy with same policy after the local forget+prune.

Storage budget

Class A: 1.3 MB raw × dedup (~20× on .dat, mostly empty NBT slots) → ~70 KB / snapshot net.
- 24/h × 24h × 7 = 4 032 snapshots/week → < 300 MB/week.
Class B/C/D: 18 G raw → ~6.5 G compressed (per current 3.6 G figure × adjustment for nether/end now active). Restic dedup on hourly snapshots: ~50–200 MB delta/snapshot during active play.
- 24h hourly + 7 daily + 4 weekly + 12 monthly ≈ 47 retained → estimate 15–25 GB total at steady state.
E (off-host): same as above on onyx (1.6 TB free — 30× headroom).

Conclusion: comfortably fits in nullstone's 142 G free. Onyx is essentially unconstrained.

6. Off-host destination — onyx via Tailscale

Choice: onyx (100.64.0.1, 1.6 TB free on /home). Reasons:

Already in the tailnet (tag:admin), already trusted, already SSH-reachable.
1.6 TB is 100× the dataset.
Operator's daily-driver: a missed-backup alert on onyx is seen.
Deferred (phase 2): replicate to friend's RTX 4080 PC (100.64.0.3) for true geographic separation. Tailnet IP is stable across the friend's ISP IP changes per memory project_friend_gpu.

Mechanics:

On onyx: create restricted user mc-backup with ~/backups/nullstone-mc-restic/ and a ~/.ssh/authorized_keys entry that only allows internal-sftp chrooted to that dir, no shell, no port-forward. (Match User mc-backup ... ChrootDirectory %h, ForceCommand internal-sftp -d /backups/nullstone-mc-restic).
On nullstone: install nullstone's ssh public key on onyx for that user. Use a second append-only restic key (separate password) so a compromised nullstone cannot run forget/prune on the onyx repo. Restic supports this via per-key --no-cache-friendly flags, but the harder lock comes from sftp chroot perms (set parent dir owner to root, give mc-backup write inside but no unlink on rotated lockfiles? — practical compromise: rely on restic copy adding-only and audit forget runs).
Nightly job on nullstone: restic -r sftp:mc-backup@100.64.0.1:/backups/nullstone-mc-restic copy --from-repo /home/user/restic/mc-world latest && ... mc-frequent ....
Onyx-side cron weekly: restic check on the mirror (independent verification).

Why not friend's GPU PC? Windows host, no built-in SSH, asymmetric trust. Defer to phase 2 once an SMB or rclone serve target is set up there.

7. Restore drill (monthly, 1st Saturday 11:00)

Runbook: docs/RUNBOOK-BACKUP-RESTORE.md (created alongside this proposal).

Drill scenario: "YOU500 lost his inventory to a void death 6 minutes ago." Steps:

Pick a known UUID from world/playerdata/ (operator's own UUID).
restic -r /home/user/restic/mc-frequent snapshots --tag playerdata | tail -5 — confirm freshest snapshot is ≤ 6 min old.
restic -r ... restore latest --target /tmp/drill-$(date +%s) --include 'world/playerdata/<uuid>.dat'.
nbted or python -m nbtlib parse the .dat — confirm it's a valid GZIP NBT structure (not zero bytes, not partial).
diff against the live .dat — log the differences (expected: at least the inventory NBT path differs because player kept playing).
Repeat from the onyx mirror repo to prove off-host works end-to-end.
Log result to docs/RUNBOOK-BACKUP-RESTORE.md § Drill log.

Drill is non-destructive — never overwrite live .dat during a drill. Real restores follow §3 of the runbook.

Pass criteria: both restores complete in < 2 min wall-clock and the parsed NBT root tag is well-formed.

8. Implementation — concrete drafts

Two layers: a fix to the existing daily script (Class C/E) and a new sidecar timer for Classes A/B/D.

8.1 Fix `/opt/docker/backup.sh` (F-backup-1)

Already documented in infra/runbooks/MIGRATION-nullstone-to-cobblestone.md §5. Minimum work:

Drop dead matrix-postgres block (Synapse retired).
Drop / fix mongodb block (RC stopped 2026-05-06).
Remove orphaned chmod 600 ...synapse-signing-key... block at L119–122 (causing set -e exit before MC block on most days).
Wrap each module in ( ... ) || log "module FAILED" so one module's failure doesn't skip the rest.

Out-of-scope for this strategy doc — track in infra audit.

8.2 New: `mc-backup-frequent` (Class A) and `mc-backup-world` (Classes B/C/D)

Drop-in files (operator review before deploy):

/etc/systemd/system/mc-backup-frequent.service

[Unit]
Description=Minecraft frequent backup (playerdata, every 5 min)
After=docker.service
Wants=docker.service

[Service]
Type=oneshot
User=user
Group=docker
EnvironmentFile=/etc/mc-backup.env
ExecStart=/usr/local/bin/mc-backup-frequent.sh
Nice=10
IOSchedulingClass=best-effort
IOSchedulingPriority=7

/etc/systemd/system/mc-backup-frequent.timer

[Unit]
Description=Run mc-backup-frequent every 5 minutes

[Timer]
OnBootSec=2min
OnUnitActiveSec=5min
AccuracySec=30s
Persistent=true

[Install]
WantedBy=timers.target

/etc/mc-backup.env (mode 0600, owner user:docker)

RESTIC_REPOSITORY_FREQUENT=/home/user/restic/mc-frequent
RESTIC_REPOSITORY_WORLD=/home/user/restic/mc-world
RESTIC_PASSWORD_FILE=/etc/mc-backup.pw
MC_DATA=/opt/docker/minecraft
RCON_HOST=127.0.0.1
RCON_PORT=25575
RCON_PASS=${RCON_PASSWORD}
HEARTBEAT_URL=https://ntfy.s8n.ru/mc-backup-frequent
ALERT_URL=https://ntfy.s8n.ru/mc-backup-alerts
TS_OFFHOST_USER=mc-backup
TS_OFFHOST_HOST=100.64.0.1
TS_OFFHOST_PATH=/backups/nullstone-mc-restic

/usr/local/bin/mc-backup-frequent.sh

#!/usr/bin/env bash
set -euo pipefail
. /etc/mc-backup.env

trap 'curl -fsS -m 10 -d "fail rc=$?" "$ALERT_URL" >/dev/null || true' ERR

# 1. Ask MC to flush via rcon (best-effort; don't fail backup if rcon down)
if command -v mcrcon >/dev/null 2>&1; then
    mcrcon -H "$RCON_HOST" -P "$RCON_PORT" -p "$RCON_PASS" -w 1 \
        "save-all flush" >/dev/null 2>&1 || true
fi

# 2. Snapshot just the small fast-changing things
restic backup \
    --tag playerdata \
    --tag auto-5min \
    --host nullstone \
    --exclude='*.lock' \
    "$MC_DATA/world/playerdata" \
    "$MC_DATA/world/stats" \
    "$MC_DATA/world/advancements" \
    "$MC_DATA/world/level.dat" \
    "$MC_DATA/world_nether/level.dat" \
    "$MC_DATA/world_the_end/level.dat" \
    "$MC_DATA/homestead_data.db" \
    "$MC_DATA/plugins/LuckPerms" \
    "$MC_DATA/plugins/CoreProtect/database.db" 2>/dev/null || true

# 3. Cheap retention (only on local repo)
restic forget --tag auto-5min \
    --keep-last 24 --keep-hourly 24 --keep-daily 7 \
    --prune --quiet

# 4. Heartbeat — alert if NOT received in 15 min via ntfy server
curl -fsS -m 5 "$HEARTBEAT_URL" >/dev/null || true

mc-backup-world.{service,timer,sh} — same shape, runs hourly during play / 6h otherwise (use OnCalendar=*-*-* 07,08,...,01:00:00 or two timers), backs up full world*/, configs, DB dumps. After local backup, runs:

restic copy \
    --from-repo "$RESTIC_REPOSITORY_WORLD" \
    -r "sftp:$TS_OFFHOST_USER@$TS_OFFHOST_HOST:$TS_OFFHOST_PATH" \
    latest

And once nightly (separate timer) the same copy for mc-frequent.

8.3 docker-compose.override.yml — alternative path (rejected)

Considered: itzg image supports BACKUP_INTERVAL, BACKUP_METHOD=restic. Pros: in-container, knows when world is loaded. Cons:

Bind-mount to host restic repo crosses userns-remap boundary (uid 100000 vs host uid 1000) — already a known nullstone footgun (memory project_nullstone_docker_userns).
Container restart wipes restic cache, slow first run after every reboot.
Mixing in-image and host-cron backup logic doubles failure surfaces.

Decision: keep backups in systemd on the host; container is unaware. Override file is not part of this proposal.

9. Monitoring & alerting

Three signals, all routed to ntfy on the existing self-hosted ntfy.s8n.ru (assumed to exist; if not, add as part of phase 1 — single-container deploy). DiscordSRV was dropped on 2026-04-30 per README.md L170, so Discord is not an option.

Signal	Trigger	Channel
`mc-backup-frequent` heartbeat	timer fires successfully	ntfy topic `mc-backup-frequent` (silent on success)
Heartbeat missing > 15 min	dead-man's switch on ntfy server, or external (`healthchecks.io` is free + self-hostable)	ntfy topic `mc-backup-alerts` (high priority)
`restic check` weekly	non-zero rc	ntfy topic `mc-backup-alerts` (high priority)
Off-host mirror failure	`restic copy` non-zero rc	ntfy topic `mc-backup-alerts` (high priority)

Operator subscribes onyx + phone to mc-backup-alerts only. The -frequent topic is a heartbeat sink (not a notification stream).

Alternative if no ntfy yet: write to /var/log/mc-backup.log AND a tiny status file /var/lib/mc-backup/last-success (mtime checked by an external monitor — Gatus on roadmap, Beszel on roadmap). Until either of those lands, a simple cron on onyx doing ssh user@nullstone 'find /var/lib/mc-backup/last-success -mmin -15 | grep .' and triggering a desktop notify-send is enough.

This addresses T8 (the silent-failure threat) directly.

10. Cost & capacity

Hardware cost: £0. Uses existing nullstone NVMe + onyx NVMe + existing Tailscale mesh.

Disk consumption (steady state, both repos):

Where	Estimate	Headroom
nullstone `/home/user/restic/mc-frequent`	< 1 GB	142 G free → ~140×
nullstone `/home/user/restic/mc-world`	15–25 GB	~6×
onyx `~/backups/nullstone-mc-restic/`	16–26 GB	1.6 T free → ~60×

Days of retention given current free space: even if the world doubles to 36 GB raw, dedup keeps growth linear at ~5 % per snapshot — well over a year of monthly retention fits.

Network: Tailscale LAN-direct (5 ms onyx ↔ nullstone). Nightly delta typically < 500 MB after dedup. Negligible.

Operator time: ~2 h initial deploy, ~10 min/month for the drill, ~zero on autopilot.

11. Phase plan

Phase	What	When	Blocker
0	This doc + runbook stub written, reviewed	TODAY	—
1	Stop the bleeding: fix `backup.sh` orphan lines so daily MC tar at least runs again	TODAY (15 min)	—
2	Stand up `mc-backup-frequent` timer + local restic repo (Class A)	this week	needs `apt install restic mcrcon`
3	Add `mc-backup-world` timer + Class B/C/D	this week	—
4	Onyx off-host SFTP target + `restic copy` job	this week	onyx user provisioning + ssh key
5	First monthly drill	next 1st Saturday	—
6	Wire ntfy alerts	when ntfy/Gatus deployed (infra roadmap)	external
7	Friend RTX 4080 PC as second off-host (geographic)	phase 2	Windows-side tooling

Phases 1–4 are doable today with what's on hand. Nothing in phases 1–5 requires purchasing.

12. Open questions for operator

ntfy.s8n.ru — does it exist yet? Memory hints at Tuwunel + Matrix on txt.s8n.ru. If ntfy isn't deployed, decide: deploy ntfy now, or use Matrix room via Tuwunel webhook bridge as alert sink.
Onyx user mc-backup — create today or reuse existing admin with restricted authorized_keys? Restricted user is cleaner; reusing admin is faster.
Append-only enforcement on the onyx side — accept "sftp chroot + no shell" as good-enough, or invest in a per-repo restic key with --no-delete-style isolation (more work, partial mitigation only)?
Pre-flight world validation — run region-fixer against the latest snapshot weekly to catch silent corruption (T3)? Adds ~5 min compute weekly. Recommend yes.
Class-E (host configs) — already in live-server/ git repo via Syncthing/manual? If yes, drop Class E from this scheme; if no, add it.

13. References

docs/BACKUP.md — current (broken) state docs.
docs/RUNBOOK-BACKUP-RESTORE.md — operational runbook (this commit).
scripts/backup.sh — to-be-fixed daily script (F-backup-1 in infra/STATE.md).
_github/infra/STATE.md — Top-5 weakness #2 + #5 tracking this work.
_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md §5 — F-backup-1 detail; nullstone-as-spare hint.
Memory: project_friend_gpu (Tailscale stable IP for friend), project_tailscale_mesh (mesh layout), project_nullstone_docker_userns (why container-side backup is rejected).
CLAUDE.md Device Registry — onyx 192.168.0.28 / 100.64.0.1.

21 KiB Raw Permalink Blame History Unescape Escape