minecraft-server/docs/RUNBOOK-BACKUP-RESTORE.md
s8n a1cc3940cf docs: 2026-05-07 incident audit + backup strategy
Player YOU500 lost full inventory to AuthLimbo void-death at 17:13:39.
Investigation revealed deployed /opt/docker/backup.sh is an 88-line stub
missing the Minecraft block; last successful world backup 2026-05-02
(already pruned). No recoverable .dat exists.

Files:
- AUDIT-2026-05-07.md — server-side findings F-01..F-06 (P0 backups,
  no-keepInventory, AuthLimbo silent failure, chunk preload race,
  Xmx > container headroom, container hardening gaps)
- BACKUP-HUNT-2026-05-07.md — exhaustive backup scan; only 6-week-old
  archive at _archive/minecraft-old-2026-04-27.tar.gz
- BACKUP-STRATEGY.md — restic-based plan; 5min/hourly/daily classes,
  off-host to onyx via Tailscale, monthly drill
- CROSS-REFERENCE-2026-05-07.md — repo+doc landing map; flags
  pre-existing infra/STATE.md backup-broken note + HA-CLUSTER restic
  draft to extend rather than duplicate
- docs/RUNBOOK-BACKUP-RESTORE.md — operator runbook for .dat restore,
  full-world restore, host-loss restore, drill log
2026-05-07 17:33:24 +01:00

156 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Runbook — Backup & Restore (Minecraft, racked.ru on nullstone)
Strategy doc: [`../BACKUP-STRATEGY.md`](../BACKUP-STRATEGY.md). This runbook is the **operator-facing** procedure for the three scenarios that come up in practice. Keep it short, copy-paste-able, and reachable from the player support workflow.
> **Status (2026-05-07):** This runbook is written **ahead** of the implementation it describes. The `mc-backup-frequent` timer and onyx mirror are NOT yet deployed. The "What if no snapshot exists yet?" section at the bottom covers today's reality.
---
## TL;DR — restore one player's `.dat` from N minutes ago
```bash
# On nullstone, as `user`:
PUUID=<player-uuid> # e.g. from /opt/docker/minecraft/usercache.json
WHEN=latest # or "5 min ago", or a snapshot id
RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
restic -r /home/user/restic/mc-frequent \
restore "$WHEN" \
--target /tmp/restore-$$ \
--include "world/playerdata/${PUUID}.dat"
# Verify the file is well-formed NBT before applying:
file /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat
# Expected: "gzip compressed data"
# Apply (server must be running so playerdata is writable; the player
# MUST be offline or we're racing the writer):
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "kick ${PUUID_NICK} Restore in progress"
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-off"
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-all flush"
cp /opt/docker/minecraft/world/playerdata/${PUUID}.dat \
/opt/docker/minecraft/world/playerdata/${PUUID}.dat.preFix-$(date +%s)
cp /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat \
/opt/docker/minecraft/world/playerdata/${PUUID}.dat
chown 100000:100000 /opt/docker/minecraft/world/playerdata/${PUUID}.dat # userns-remap
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-on"
# Tell the player to log back in.
```
**Why kick + `save-off`:** if the player is online, the server holds their NBT in memory and rewrites the `.dat` on next save tick — clobbering the restore. `save-off` halts auto-save; kicking guarantees the in-memory state for that player won't be flushed.
**Userns-remap reminder:** the host sees container-uid `100000` for files written by the MC process. Restored files written by `user` (uid 1000) will appear empty/permission-denied to the container. Always `chown 100000:100000` (or `chmod 666`) after restore. Memory: `project_nullstone_docker_userns`.
---
## Scenario 1 — Player lost inventory (T1, the void-death case)
This is what the strategy was written for. RTO target: **< 2 minutes**.
1. Find the UUID:
```bash
grep -i 'NICK' /opt/docker/minecraft/usercache.json
```
2. Pick a snapshot just **before** the loss. `restic snapshots --tag playerdata` shows timestamps.
3. Run the TL;DR block above with that snapshot id (or `latest` if loss happened in the last 5 min).
4. Inform the player: "Your inventory from HH:MM has been restored. Anything you picked up after that point is gone."
5. Log the incident: append to `docs/INCIDENTS.md` (create if absent) date, player, snapshot id, cause.
---
## Scenario 2 — Whole world rolled back (T2/T3, griefing or corruption)
RTO target: **15 minutes**. Server downtime expected.
1. Announce, kick, stop:
```bash
mcrcon ... "say Server going down for restore back in ~15 min"
mcrcon ... "kick @a Restore in progress"
cd /opt/docker/minecraft && docker compose down
```
2. Move live data aside (do not delete):
```bash
mv /opt/docker/minecraft /opt/docker/minecraft.broken-$(date +%F)
mkdir -p /opt/docker/minecraft
```
3. Restore from the world repo:
```bash
RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
restic -r /home/user/restic/mc-world \
restore <snapshot-id> --target /tmp/world-restore
rsync -aHAX /tmp/world-restore/opt/docker/minecraft/ /opt/docker/minecraft/
```
4. **Re-apply userns-remap perms** (critical — see memory):
```bash
chmod -R 777 /opt/docker/minecraft # quickfix; or chown -R 100000:100000
```
5. Boot:
```bash
cd /opt/docker/minecraft && docker compose up -d
docker logs -f minecraft-mc # watch for "Done" line
```
6. Verify with a known-good UUID's `.dat` parse, then announce server up.
7. Keep `minecraft.broken-YYYY-MM-DD/` for at least 7 days for forensic comparison.
---
## Scenario 3 — Host disk dead (T5)
RTO target: **few hours, depends on hardware swap**.
1. New host: install Debian 13 + Docker per `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md`.
2. `apt install restic`. Pull the password from operator's password manager into `/etc/mc-backup.pw`.
3. Initialise destination dir, then restore from **onyx mirror** (not local — local is gone):
```bash
restic -r sftp:mc-backup@100.64.0.1:/backups/nullstone-mc-restic \
restore latest --target /tmp/world-restore
```
4. Continue Scenario 2 from step 4.
5. Stand up the timers on the new host. **Do not** point them at the same off-host repo until the new host has been re-keyed (rotate restic passwords as part of disaster recovery).
---
## Drill log (monthly)
| Date | Operator | Snapshot age | Class A restore time | Off-host restore time | Result |
|------|----------|--------------|----------------------|------------------------|--------|
| (first drill — 2026-06-06) | s8n | TBD | TBD | TBD | TBD |
Procedure: see `BACKUP-STRATEGY.md` §7.
---
## What if no snapshot exists yet? (CURRENT REALITY 2026-05-07)
Until phases 14 of `BACKUP-STRATEGY.md` are deployed, the only recovery resources are:
| Source | What's there | Recoverable? |
|---|---|---|
| `/opt/backups/202604xx_020001/mc-world-backup-*.tar.gz` | World tar from Apr 29 + May 2 (others FAILED) | **GONE** — pruned by 7-day retention |
| `/opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz` | Plugin jars only, no world | Not useful for player data |
| Live `/opt/docker/minecraft/world/playerdata/<uuid>.dat_old` | MC's own .dat_old shadow file from previous save | **YES** — last save tick before current. **First-line defence right now.** |
| CoreProtect DB (`plugins/CoreProtect/database.db`) | Block + container actions, NOT inventory state | Partial — can roll back grief, can't restore lost items |
**Today's playbook for inventory-loss reports:**
1. Server console → `co lookup u:NICK` to confirm the loss event in CoreProtect.
2. **Stop the server immediately** if the report comes in within the same play session — every save tick overwrites `.dat_old`. `docker compose down` buys time.
3. Inspect `world/playerdata/<uuid>.dat_old` — if it predates the loss, copy it over `<uuid>.dat`, fix perms (uid 100000), restart.
4. If `.dat_old` is too new (already overwritten): **the loss is unrecoverable until BACKUP-STRATEGY phases 14 are deployed.** Apologise to the player. Spawn-in compensation per operator discretion (ops creative-mode replacement is the customary remedy).
5. Log the incident — adds urgency to deploying the new strategy.
---
## TODO — open items (links into BACKUP-STRATEGY.md §11)
- [ ] Phase 1: fix `/opt/docker/backup.sh` orphan-line bug (F-backup-1).
- [ ] Phase 2: deploy `mc-backup-frequent.timer` (Class A, 5-min playerdata).
- [ ] Phase 3: deploy `mc-backup-world.timer` (Class B/C/D, hourly).
- [ ] Phase 4: provision `mc-backup` user on onyx + `restic copy` job.
- [ ] Phase 5: schedule monthly drill calendar entry, run first drill.
- [ ] Phase 6: ntfy / Matrix alert wiring (depends on ntfy deployment).
- [ ] Phase 7: friend RTX 4080 PC as secondary off-host.
- [ ] Verify `usercache.json` on this host: confirm UUID lookup workflow above resolves to the right `.dat`.
- [ ] Decide: `mcrcon` package vs lightweight Python `mcrcon` lib.
- [ ] Document compensation policy for unrecoverable losses (operator discretion right now).