minecraft-server/docs/RUNBOOK-BACKUP-RESTORE.md
s8n a1cc3940cf docs: 2026-05-07 incident audit + backup strategy
Player YOU500 lost full inventory to AuthLimbo void-death at 17:13:39.
Investigation revealed deployed /opt/docker/backup.sh is an 88-line stub
missing the Minecraft block; last successful world backup 2026-05-02
(already pruned). No recoverable .dat exists.

Files:
- AUDIT-2026-05-07.md — server-side findings F-01..F-06 (P0 backups,
  no-keepInventory, AuthLimbo silent failure, chunk preload race,
  Xmx > container headroom, container hardening gaps)
- BACKUP-HUNT-2026-05-07.md — exhaustive backup scan; only 6-week-old
  archive at _archive/minecraft-old-2026-04-27.tar.gz
- BACKUP-STRATEGY.md — restic-based plan; 5min/hourly/daily classes,
  off-host to onyx via Tailscale, monthly drill
- CROSS-REFERENCE-2026-05-07.md — repo+doc landing map; flags
  pre-existing infra/STATE.md backup-broken note + HA-CLUSTER restic
  draft to extend rather than duplicate
- docs/RUNBOOK-BACKUP-RESTORE.md — operator runbook for .dat restore,
  full-world restore, host-loss restore, drill log
2026-05-07 17:33:24 +01:00

7.7 KiB
Raw Blame History

Runbook — Backup & Restore (Minecraft, racked.ru on nullstone)

Strategy doc: ../BACKUP-STRATEGY.md. This runbook is the operator-facing procedure for the three scenarios that come up in practice. Keep it short, copy-paste-able, and reachable from the player support workflow.

Status (2026-05-07): This runbook is written ahead of the implementation it describes. The mc-backup-frequent timer and onyx mirror are NOT yet deployed. The "What if no snapshot exists yet?" section at the bottom covers today's reality.


TL;DR — restore one player's .dat from N minutes ago

# On nullstone, as `user`:
PUUID=<player-uuid>          # e.g. from /opt/docker/minecraft/usercache.json
WHEN=latest                  # or "5 min ago", or a snapshot id
RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
restic -r /home/user/restic/mc-frequent \
    restore "$WHEN" \
    --target /tmp/restore-$$ \
    --include "world/playerdata/${PUUID}.dat"

# Verify the file is well-formed NBT before applying:
file /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat
# Expected: "gzip compressed data"

# Apply (server must be running so playerdata is writable; the player
# MUST be offline or we're racing the writer):
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "kick ${PUUID_NICK} Restore in progress"
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-off"
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-all flush"

cp /opt/docker/minecraft/world/playerdata/${PUUID}.dat \
   /opt/docker/minecraft/world/playerdata/${PUUID}.dat.preFix-$(date +%s)
cp /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat \
   /opt/docker/minecraft/world/playerdata/${PUUID}.dat
chown 100000:100000 /opt/docker/minecraft/world/playerdata/${PUUID}.dat   # userns-remap

mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-on"
# Tell the player to log back in.

Why kick + save-off: if the player is online, the server holds their NBT in memory and rewrites the .dat on next save tick — clobbering the restore. save-off halts auto-save; kicking guarantees the in-memory state for that player won't be flushed.

Userns-remap reminder: the host sees container-uid 100000 for files written by the MC process. Restored files written by user (uid 1000) will appear empty/permission-denied to the container. Always chown 100000:100000 (or chmod 666) after restore. Memory: project_nullstone_docker_userns.


Scenario 1 — Player lost inventory (T1, the void-death case)

This is what the strategy was written for. RTO target: < 2 minutes.

  1. Find the UUID:
    grep -i 'NICK' /opt/docker/minecraft/usercache.json
    
  2. Pick a snapshot just before the loss. restic snapshots --tag playerdata shows timestamps.
  3. Run the TL;DR block above with that snapshot id (or latest if loss happened in the last 5 min).
  4. Inform the player: "Your inventory from HH:MM has been restored. Anything you picked up after that point is gone."
  5. Log the incident: append to docs/INCIDENTS.md (create if absent) — date, player, snapshot id, cause.

Scenario 2 — Whole world rolled back (T2/T3, griefing or corruption)

RTO target: 15 minutes. Server downtime expected.

  1. Announce, kick, stop:
    mcrcon ... "say Server going down for restore — back in ~15 min"
    mcrcon ... "kick @a Restore in progress"
    cd /opt/docker/minecraft && docker compose down
    
  2. Move live data aside (do not delete):
    mv /opt/docker/minecraft /opt/docker/minecraft.broken-$(date +%F)
    mkdir -p /opt/docker/minecraft
    
  3. Restore from the world repo:
    RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
    restic -r /home/user/restic/mc-world \
        restore <snapshot-id> --target /tmp/world-restore
    rsync -aHAX /tmp/world-restore/opt/docker/minecraft/ /opt/docker/minecraft/
    
  4. Re-apply userns-remap perms (critical — see memory):
    chmod -R 777 /opt/docker/minecraft   # quickfix; or chown -R 100000:100000
    
  5. Boot:
    cd /opt/docker/minecraft && docker compose up -d
    docker logs -f minecraft-mc   # watch for "Done" line
    
  6. Verify with a known-good UUID's .dat parse, then announce server up.
  7. Keep minecraft.broken-YYYY-MM-DD/ for at least 7 days for forensic comparison.

Scenario 3 — Host disk dead (T5)

RTO target: few hours, depends on hardware swap.

  1. New host: install Debian 13 + Docker per _github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md.
  2. apt install restic. Pull the password from operator's password manager into /etc/mc-backup.pw.
  3. Initialise destination dir, then restore from onyx mirror (not local — local is gone):
    restic -r sftp:mc-backup@100.64.0.1:/backups/nullstone-mc-restic \
        restore latest --target /tmp/world-restore
    
  4. Continue Scenario 2 from step 4.
  5. Stand up the timers on the new host. Do not point them at the same off-host repo until the new host has been re-keyed (rotate restic passwords as part of disaster recovery).

Drill log (monthly)

Date Operator Snapshot age Class A restore time Off-host restore time Result
(first drill — 2026-06-06) s8n TBD TBD TBD TBD

Procedure: see BACKUP-STRATEGY.md §7.


What if no snapshot exists yet? (CURRENT REALITY 2026-05-07)

Until phases 14 of BACKUP-STRATEGY.md are deployed, the only recovery resources are:

Source What's there Recoverable?
/opt/backups/202604xx_020001/mc-world-backup-*.tar.gz World tar from Apr 29 + May 2 (others FAILED) GONE — pruned by 7-day retention
/opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz Plugin jars only, no world Not useful for player data
Live /opt/docker/minecraft/world/playerdata/<uuid>.dat_old MC's own .dat_old shadow file from previous save YES — last save tick before current. First-line defence right now.
CoreProtect DB (plugins/CoreProtect/database.db) Block + container actions, NOT inventory state Partial — can roll back grief, can't restore lost items

Today's playbook for inventory-loss reports:

  1. Server console → co lookup u:NICK to confirm the loss event in CoreProtect.
  2. Stop the server immediately if the report comes in within the same play session — every save tick overwrites .dat_old. docker compose down buys time.
  3. Inspect world/playerdata/<uuid>.dat_old — if it predates the loss, copy it over <uuid>.dat, fix perms (uid 100000), restart.
  4. If .dat_old is too new (already overwritten): the loss is unrecoverable until BACKUP-STRATEGY phases 14 are deployed. Apologise to the player. Spawn-in compensation per operator discretion (ops creative-mode replacement is the customary remedy).
  5. Log the incident — adds urgency to deploying the new strategy.

  • Phase 1: fix /opt/docker/backup.sh orphan-line bug (F-backup-1).
  • Phase 2: deploy mc-backup-frequent.timer (Class A, 5-min playerdata).
  • Phase 3: deploy mc-backup-world.timer (Class B/C/D, hourly).
  • Phase 4: provision mc-backup user on onyx + restic copy job.
  • Phase 5: schedule monthly drill calendar entry, run first drill.
  • Phase 6: ntfy / Matrix alert wiring (depends on ntfy deployment).
  • Phase 7: friend RTX 4080 PC as secondary off-host.
  • Verify usercache.json on this host: confirm UUID lookup workflow above resolves to the right .dat.
  • Decide: mcrcon package vs lightweight Python mcrcon lib.
  • Document compensation policy for unrecoverable losses (operator discretion right now).