s8n 4c16cebb2b backup: phase 1 + phase 2 scripts; daily script repaired and deployed

Repairs the orphaned synapse-signing-key block at scripts/backup.sh
lines 119-122 that was exiting the script under set -e before the
Minecraft block could run, leaving 5 of the last 7 days without a
world backup and zero usable snapshots after 7-day retention.

Phase 1 (deployed today to /opt/docker/backup.sh on nullstone):
- Repaired script — orphan block removed, MC arm wrapped so failures
  in one tar don't kill the run
- tar exit code 1 ("file changed as we read it") now treated as
  success on the live MC world; spark profiler tmp file noise
  silenced via --ignore-failed-read --warning=no-file-changed
- Plugin DBs (homestead, AuthMe, CoreProtect, LuckPerms) and configs
  now backed up alongside the world
- Sentinel /opt/backups/.last-success stamped only when the world
  arm succeeds — gives outside monitors a single mtime to alert on
- Manually verified end-to-end: 12G world tarball, 492M plugins,
  279M dbs, 14 config files, sentinel updated. Pre-fix script saved
  at /opt/docker/backup.sh.bak-20260507-pre-phase1.

Phase 2 (scripts in repo, deployment pending operator sudo):
- scripts/restic-backup-playerdata.sh — Class A 5-min restic snapshots
  of playerdata/, stats/, advancements/, plugin DBs, LuckPerms;
  rcon save-all flush before snapshot; tag-scoped retention
- scripts/restic-init.sh — one-time bootstrap (root-only) for
  /etc/mc-backup.{env,pw} + repo init at /home/user/restic/
- scripts/systemd/mc-backup-playerdata.{service,timer} — 5-min timer
  with hardening (ProtectSystem=strict, ReadOnlyPaths, etc)
- docs/RUNBOOK-BACKUP-RESTORE.md updated with both phases'
  deployment steps and the operator-action checklist

Off-host mirror to onyx (Phase 4) and class B/C/D world snapshots
(Phase 3) are still TODO — see BACKUP-STRATEGY.md §11 phase plan.

2026-05-07 18:29:30 +01:00

11 KiB

Raw Blame History

Runbook — Backup & Restore (Minecraft, racked.ru on nullstone)

Strategy doc: ../BACKUP-STRATEGY.md. This runbook is the operator-facing procedure for the three scenarios that come up in practice. Keep it short, copy-paste-able, and reachable from the player support workflow.

Status (2026-05-07): Phase 1 (the daily /opt/docker/backup.sh MC world tarball) is deployed and verified — see "Phase 1 deployment" section near the bottom. Phase 2 (mc-backup-playerdata.timer, 5-min cadence) and the onyx off-host mirror are NOT yet deployed; deployment steps in "Phase 2 deployment" below. Until Phase 2 lands, the daily 02:00 tarball is the only safety net (RPO up to 24h).

TL;DR — restore one player's `.dat` from N minutes ago

# On nullstone, as `user`:
PUUID=<player-uuid>          # e.g. from /opt/docker/minecraft/usercache.json
WHEN=latest                  # or "5 min ago", or a snapshot id
RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
restic -r /home/user/restic/mc-frequent \
    restore "$WHEN" \
    --target /tmp/restore-$$ \
    --include "world/playerdata/${PUUID}.dat"

# Verify the file is well-formed NBT before applying:
file /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat
# Expected: "gzip compressed data"

# Apply (server must be running so playerdata is writable; the player
# MUST be offline or we're racing the writer):
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "kick ${PUUID_NICK} Restore in progress"
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-off"
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-all flush"

cp /opt/docker/minecraft/world/playerdata/${PUUID}.dat \
   /opt/docker/minecraft/world/playerdata/${PUUID}.dat.preFix-$(date +%s)
cp /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat \
   /opt/docker/minecraft/world/playerdata/${PUUID}.dat
chown 100000:100000 /opt/docker/minecraft/world/playerdata/${PUUID}.dat   # userns-remap

mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-on"
# Tell the player to log back in.

Why kick + save-off: if the player is online, the server holds their NBT in memory and rewrites the .dat on next save tick — clobbering the restore. save-off halts auto-save; kicking guarantees the in-memory state for that player won't be flushed.

Userns-remap reminder: the host sees container-uid 100000 for files written by the MC process. Restored files written by user (uid 1000) will appear empty/permission-denied to the container. Always chown 100000:100000 (or chmod 666) after restore. Memory: project_nullstone_docker_userns.

Scenario 1 — Player lost inventory (T1, the void-death case)

This is what the strategy was written for. RTO target: < 2 minutes.

Find the UUID:

grep -i 'NICK' /opt/docker/minecraft/usercache.json

Pick a snapshot just before the loss. restic snapshots --tag playerdata shows timestamps.
Run the TL;DR block above with that snapshot id (or latest if loss happened in the last 5 min).
Inform the player: "Your inventory from HH:MM has been restored. Anything you picked up after that point is gone."
Log the incident: append to docs/INCIDENTS.md (create if absent) — date, player, snapshot id, cause.

Scenario 2 — Whole world rolled back (T2/T3, griefing or corruption)

RTO target: 15 minutes. Server downtime expected.

Announce, kick, stop:

mcrcon ... "say Server going down for restore — back in ~15 min"
mcrcon ... "kick @a Restore in progress"
cd /opt/docker/minecraft && docker compose down

Move live data aside (do not delete):

mv /opt/docker/minecraft /opt/docker/minecraft.broken-$(date +%F)
mkdir -p /opt/docker/minecraft

Restore from the world repo:

RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
restic -r /home/user/restic/mc-world \
    restore <snapshot-id> --target /tmp/world-restore
rsync -aHAX /tmp/world-restore/opt/docker/minecraft/ /opt/docker/minecraft/

Re-apply userns-remap perms (critical — see memory):

chmod -R 777 /opt/docker/minecraft   # quickfix; or chown -R 100000:100000

Boot:

cd /opt/docker/minecraft && docker compose up -d
docker logs -f minecraft-mc   # watch for "Done" line

Verify with a known-good UUID's .dat parse, then announce server up.
Keep minecraft.broken-YYYY-MM-DD/ for at least 7 days for forensic comparison.

Scenario 3 — Host disk dead (T5)

RTO target: few hours, depends on hardware swap.

New host: install Debian 13 + Docker per _github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md.
apt install restic. Pull the password from operator's password manager into /etc/mc-backup.pw.

Initialise destination dir, then restore from onyx mirror (not local — local is gone):

restic -r sftp:mc-backup@100.64.0.1:/backups/nullstone-mc-restic \
    restore latest --target /tmp/world-restore

Continue Scenario 2 from step 4.
Stand up the timers on the new host. Do not point them at the same off-host repo until the new host has been re-keyed (rotate restic passwords as part of disaster recovery).

Drill log (monthly)

Date	Operator	Snapshot age	Class A restore time	Off-host restore time	Result
(first drill — 2026-06-06)	s8n	TBD	TBD	TBD	TBD

Procedure: see BACKUP-STRATEGY.md §7.

What if no snapshot exists yet? (CURRENT REALITY 2026-05-07)

Until phases 1–4 of BACKUP-STRATEGY.md are deployed, the only recovery resources are:

Source	What's there	Recoverable?
`/opt/backups/202604xx_020001/mc-world-backup-*.tar.gz`	World tar from Apr 29 + May 2 (others FAILED)	GONE — pruned by 7-day retention
`/opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz`	Plugin jars only, no world	Not useful for player data
Live `/opt/docker/minecraft/world/playerdata/<uuid>.dat_old`	MC's own .dat_old shadow file from previous save	YES — last save tick before current. First-line defence right now.
CoreProtect DB (`plugins/CoreProtect/database.db`)	Block + container actions, NOT inventory state	Partial — can roll back grief, can't restore lost items

Today's playbook for inventory-loss reports:

Server console → co lookup u:NICK to confirm the loss event in CoreProtect.
Stop the server immediately if the report comes in within the same play session — every save tick overwrites .dat_old. docker compose down buys time.
Inspect world/playerdata/<uuid>.dat_old — if it predates the loss, copy it over <uuid>.dat, fix perms (uid 100000), restart.
If .dat_old is too new (already overwritten): the loss is unrecoverable until BACKUP-STRATEGY phases 1–4 are deployed. Apologise to the player. Spawn-in compensation per operator discretion (ops creative-mode replacement is the customary remedy).
Log the incident — adds urgency to deploying the new strategy.

Phase 1 deployment — DONE 2026-05-07

The daily fallback (/opt/docker/backup.sh) was repaired and redeployed. It now backs up MC world (~12 G compressed), plugins (~490 M), plugin DBs (~280 M), and configs nightly at 02:00, prunes after 7 days, and writes a sentinel /opt/backups/.last-success on success.

External monitor (cron on onyx) — the simplest dead-man's switch until ntfy lands:

# Add to onyx crontab, e.g. every 30 min
*/30 * * * * ssh user@192.168.0.100 \
  'find /opt/backups/.last-success -mmin -1500 | grep -q . || \
   echo "ALERT: nullstone MC backup sentinel stale (>25h)"' \
  | mail -s "MC backup stale" you@example.com

(swap mail for notify-send, ntfy publish, etc once those are wired)

A copy of the pre-fix script is preserved at /opt/docker/backup.sh.bak-20260507-pre-phase1 for forensic reference.

Phase 2 deployment — restic playerdata snapshots every 5 min

Implementation is in this repo:

scripts/restic-backup-playerdata.sh — the per-run script
scripts/restic-init.sh — one-time bootstrap (must run as root)
scripts/systemd/mc-backup-playerdata.{service,timer} — 5-min cadence
Strategy + retention + threat model in BACKUP-STRATEGY.md

Deployment status (2026-05-07): NOT YET DEPLOYED — operator action required. restic is not on nullstone; installing it needs sudo, and user's sudo is password-locked. Operator runs:

# On nullstone, as root (sudo -i or via console)
apt-get update && apt-get install -y restic mcrcon

cd /opt/docker
git -C /home/user/repos/minecraft-server pull \
   || git clone ssh://git@192.168.0.100:222/s8n/minecraft-server.git /home/user/repos/minecraft-server
cd /home/user/repos/minecraft-server

# 1) Bootstrap repos + env file
sudo bash scripts/restic-init.sh

# 2) Install systemd units + run script
sudo install -m 644 scripts/systemd/mc-backup-playerdata.service /etc/systemd/system/
sudo install -m 644 scripts/systemd/mc-backup-playerdata.timer   /etc/systemd/system/
sudo install -m 755 scripts/restic-backup-playerdata.sh           /usr/local/bin/

# 3) Enable + start
sudo systemctl daemon-reload
sudo systemctl enable --now mc-backup-playerdata.timer

# 4) Verify
systemctl list-timers mc-backup-playerdata.timer
journalctl -u mc-backup-playerdata.service -n 50 --no-pager
ls -la /home/user/restic/mc-frequent/
restic -r /home/user/restic/mc-frequent --password-file /etc/mc-backup.pw snapshots

The first run should appear within ~7 min (OnBootSec=2min + 5-min cadence).

Off-host mirror to onyx (Phase 4 — separate)

After Phase 2 is running cleanly for ~24h, provision mc-backup user on onyx with chrooted SFTP, then add a nightly restic copy job from nullstone. See BACKUP-STRATEGY.md §6 for the SFTP chroot config and §11 phase plan.

Until then, the local nullstone repo is single-host — survives operator error and bad config edits, not disk failure. The Phase 1 daily tarball in /opt/backups/ is the only redundancy until §6 lands.

TODO — open items (links into BACKUP-STRATEGY.md §11)

Phase 1: fix /opt/docker/backup.sh orphan-line bug (F-backup-1). Done 2026-05-07.
Phase 2: deploy mc-backup-playerdata.timer (Class A, 5-min). Scripts in repo; blocked on operator running apt install restic + restic-init.sh with sudo.
Phase 3: deploy mc-backup-world.timer (Class B/C/D, hourly). Script not yet drafted; will mirror playerdata script.
Phase 4: provision mc-backup user on onyx + restic copy job.
Phase 5: schedule monthly drill calendar entry, run first drill.
Phase 6: ntfy / Matrix alert wiring (depends on ntfy deployment).
Phase 7: friend RTX 4080 PC as secondary off-host.
Verify usercache.json on this host: confirm UUID lookup workflow above resolves to the right .dat.
Decide: mcrcon package vs lightweight Python mcrcon lib.
Document compensation policy for unrecoverable losses (operator discretion right now).
Drop dead matrix-postgres + mongodb + synapse-* blocks from /opt/docker/backup.sh once retirement is complete (currently they no-op-skip — minor noise in log only).

11 KiB Raw Blame History Unescape Escape