Player YOU500 lost full inventory to AuthLimbo void-death at 17:13:39. Investigation revealed deployed /opt/docker/backup.sh is an 88-line stub missing the Minecraft block; last successful world backup 2026-05-02 (already pruned). No recoverable .dat exists. Files: - AUDIT-2026-05-07.md — server-side findings F-01..F-06 (P0 backups, no-keepInventory, AuthLimbo silent failure, chunk preload race, Xmx > container headroom, container hardening gaps) - BACKUP-HUNT-2026-05-07.md — exhaustive backup scan; only 6-week-old archive at _archive/minecraft-old-2026-04-27.tar.gz - BACKUP-STRATEGY.md — restic-based plan; 5min/hourly/daily classes, off-host to onyx via Tailscale, monthly drill - CROSS-REFERENCE-2026-05-07.md — repo+doc landing map; flags pre-existing infra/STATE.md backup-broken note + HA-CLUSTER restic draft to extend rather than duplicate - docs/RUNBOOK-BACKUP-RESTORE.md — operator runbook for .dat restore, full-world restore, host-loss restore, drill log
156 lines
7.7 KiB
Markdown
156 lines
7.7 KiB
Markdown
# Runbook — Backup & Restore (Minecraft, racked.ru on nullstone)
|
||
|
||
Strategy doc: [`../BACKUP-STRATEGY.md`](../BACKUP-STRATEGY.md). This runbook is the **operator-facing** procedure for the three scenarios that come up in practice. Keep it short, copy-paste-able, and reachable from the player support workflow.
|
||
|
||
> **Status (2026-05-07):** This runbook is written **ahead** of the implementation it describes. The `mc-backup-frequent` timer and onyx mirror are NOT yet deployed. The "What if no snapshot exists yet?" section at the bottom covers today's reality.
|
||
|
||
---
|
||
|
||
## TL;DR — restore one player's `.dat` from N minutes ago
|
||
|
||
```bash
|
||
# On nullstone, as `user`:
|
||
PUUID=<player-uuid> # e.g. from /opt/docker/minecraft/usercache.json
|
||
WHEN=latest # or "5 min ago", or a snapshot id
|
||
RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
|
||
restic -r /home/user/restic/mc-frequent \
|
||
restore "$WHEN" \
|
||
--target /tmp/restore-$$ \
|
||
--include "world/playerdata/${PUUID}.dat"
|
||
|
||
# Verify the file is well-formed NBT before applying:
|
||
file /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat
|
||
# Expected: "gzip compressed data"
|
||
|
||
# Apply (server must be running so playerdata is writable; the player
|
||
# MUST be offline or we're racing the writer):
|
||
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "kick ${PUUID_NICK} Restore in progress"
|
||
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-off"
|
||
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-all flush"
|
||
|
||
cp /opt/docker/minecraft/world/playerdata/${PUUID}.dat \
|
||
/opt/docker/minecraft/world/playerdata/${PUUID}.dat.preFix-$(date +%s)
|
||
cp /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat \
|
||
/opt/docker/minecraft/world/playerdata/${PUUID}.dat
|
||
chown 100000:100000 /opt/docker/minecraft/world/playerdata/${PUUID}.dat # userns-remap
|
||
|
||
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-on"
|
||
# Tell the player to log back in.
|
||
```
|
||
|
||
**Why kick + `save-off`:** if the player is online, the server holds their NBT in memory and rewrites the `.dat` on next save tick — clobbering the restore. `save-off` halts auto-save; kicking guarantees the in-memory state for that player won't be flushed.
|
||
|
||
**Userns-remap reminder:** the host sees container-uid `100000` for files written by the MC process. Restored files written by `user` (uid 1000) will appear empty/permission-denied to the container. Always `chown 100000:100000` (or `chmod 666`) after restore. Memory: `project_nullstone_docker_userns`.
|
||
|
||
---
|
||
|
||
## Scenario 1 — Player lost inventory (T1, the void-death case)
|
||
|
||
This is what the strategy was written for. RTO target: **< 2 minutes**.
|
||
|
||
1. Find the UUID:
|
||
```bash
|
||
grep -i 'NICK' /opt/docker/minecraft/usercache.json
|
||
```
|
||
2. Pick a snapshot just **before** the loss. `restic snapshots --tag playerdata` shows timestamps.
|
||
3. Run the TL;DR block above with that snapshot id (or `latest` if loss happened in the last 5 min).
|
||
4. Inform the player: "Your inventory from HH:MM has been restored. Anything you picked up after that point is gone."
|
||
5. Log the incident: append to `docs/INCIDENTS.md` (create if absent) — date, player, snapshot id, cause.
|
||
|
||
---
|
||
|
||
## Scenario 2 — Whole world rolled back (T2/T3, griefing or corruption)
|
||
|
||
RTO target: **15 minutes**. Server downtime expected.
|
||
|
||
1. Announce, kick, stop:
|
||
```bash
|
||
mcrcon ... "say Server going down for restore — back in ~15 min"
|
||
mcrcon ... "kick @a Restore in progress"
|
||
cd /opt/docker/minecraft && docker compose down
|
||
```
|
||
2. Move live data aside (do not delete):
|
||
```bash
|
||
mv /opt/docker/minecraft /opt/docker/minecraft.broken-$(date +%F)
|
||
mkdir -p /opt/docker/minecraft
|
||
```
|
||
3. Restore from the world repo:
|
||
```bash
|
||
RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
|
||
restic -r /home/user/restic/mc-world \
|
||
restore <snapshot-id> --target /tmp/world-restore
|
||
rsync -aHAX /tmp/world-restore/opt/docker/minecraft/ /opt/docker/minecraft/
|
||
```
|
||
4. **Re-apply userns-remap perms** (critical — see memory):
|
||
```bash
|
||
chmod -R 777 /opt/docker/minecraft # quickfix; or chown -R 100000:100000
|
||
```
|
||
5. Boot:
|
||
```bash
|
||
cd /opt/docker/minecraft && docker compose up -d
|
||
docker logs -f minecraft-mc # watch for "Done" line
|
||
```
|
||
6. Verify with a known-good UUID's `.dat` parse, then announce server up.
|
||
7. Keep `minecraft.broken-YYYY-MM-DD/` for at least 7 days for forensic comparison.
|
||
|
||
---
|
||
|
||
## Scenario 3 — Host disk dead (T5)
|
||
|
||
RTO target: **few hours, depends on hardware swap**.
|
||
|
||
1. New host: install Debian 13 + Docker per `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md`.
|
||
2. `apt install restic`. Pull the password from operator's password manager into `/etc/mc-backup.pw`.
|
||
3. Initialise destination dir, then restore from **onyx mirror** (not local — local is gone):
|
||
```bash
|
||
restic -r sftp:mc-backup@100.64.0.1:/backups/nullstone-mc-restic \
|
||
restore latest --target /tmp/world-restore
|
||
```
|
||
4. Continue Scenario 2 from step 4.
|
||
5. Stand up the timers on the new host. **Do not** point them at the same off-host repo until the new host has been re-keyed (rotate restic passwords as part of disaster recovery).
|
||
|
||
---
|
||
|
||
## Drill log (monthly)
|
||
|
||
| Date | Operator | Snapshot age | Class A restore time | Off-host restore time | Result |
|
||
|------|----------|--------------|----------------------|------------------------|--------|
|
||
| (first drill — 2026-06-06) | s8n | TBD | TBD | TBD | TBD |
|
||
|
||
Procedure: see `BACKUP-STRATEGY.md` §7.
|
||
|
||
---
|
||
|
||
## What if no snapshot exists yet? (CURRENT REALITY 2026-05-07)
|
||
|
||
Until phases 1–4 of `BACKUP-STRATEGY.md` are deployed, the only recovery resources are:
|
||
|
||
| Source | What's there | Recoverable? |
|
||
|---|---|---|
|
||
| `/opt/backups/202604xx_020001/mc-world-backup-*.tar.gz` | World tar from Apr 29 + May 2 (others FAILED) | **GONE** — pruned by 7-day retention |
|
||
| `/opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz` | Plugin jars only, no world | Not useful for player data |
|
||
| Live `/opt/docker/minecraft/world/playerdata/<uuid>.dat_old` | MC's own .dat_old shadow file from previous save | **YES** — last save tick before current. **First-line defence right now.** |
|
||
| CoreProtect DB (`plugins/CoreProtect/database.db`) | Block + container actions, NOT inventory state | Partial — can roll back grief, can't restore lost items |
|
||
|
||
**Today's playbook for inventory-loss reports:**
|
||
|
||
1. Server console → `co lookup u:NICK` to confirm the loss event in CoreProtect.
|
||
2. **Stop the server immediately** if the report comes in within the same play session — every save tick overwrites `.dat_old`. `docker compose down` buys time.
|
||
3. Inspect `world/playerdata/<uuid>.dat_old` — if it predates the loss, copy it over `<uuid>.dat`, fix perms (uid 100000), restart.
|
||
4. If `.dat_old` is too new (already overwritten): **the loss is unrecoverable until BACKUP-STRATEGY phases 1–4 are deployed.** Apologise to the player. Spawn-in compensation per operator discretion (ops creative-mode replacement is the customary remedy).
|
||
5. Log the incident — adds urgency to deploying the new strategy.
|
||
|
||
---
|
||
|
||
## TODO — open items (links into BACKUP-STRATEGY.md §11)
|
||
|
||
- [ ] Phase 1: fix `/opt/docker/backup.sh` orphan-line bug (F-backup-1).
|
||
- [ ] Phase 2: deploy `mc-backup-frequent.timer` (Class A, 5-min playerdata).
|
||
- [ ] Phase 3: deploy `mc-backup-world.timer` (Class B/C/D, hourly).
|
||
- [ ] Phase 4: provision `mc-backup` user on onyx + `restic copy` job.
|
||
- [ ] Phase 5: schedule monthly drill calendar entry, run first drill.
|
||
- [ ] Phase 6: ntfy / Matrix alert wiring (depends on ntfy deployment).
|
||
- [ ] Phase 7: friend RTX 4080 PC as secondary off-host.
|
||
- [ ] Verify `usercache.json` on this host: confirm UUID lookup workflow above resolves to the right `.dat`.
|
||
- [ ] Decide: `mcrcon` package vs lightweight Python `mcrcon` lib.
|
||
- [ ] Document compensation policy for unrecoverable losses (operator discretion right now).
|