minecraft-server/docs/RUNBOOK-BACKUP-RESTORE.md
s8n 3336f52142 redact: scrub leaked Minecraft secrets from public repo
Replaced literal values with env-var placeholders (${RCON_PASSWORD},
${MGMT_SECRET}, ${MC_RCON_PASSWORD}) across server.properties,
.rcon-cli.env, docker-compose.yml(s), backup scripts, and AUDIT-2026-05-07.md.

Affected secrets:
- Paper management-server-secret (HIGH; mitigated by management-server-enabled=false)
- RCON password '*redacted*' (MEDIUM; bound to 127.0.0.1)
- MC_RCON_PASSWORD backup-pipeline default fallback (MEDIUM; same blast radius)

WARNING: HEAD redaction only — values remain in git history. Treat as
compromised and rotate (closes F-17 audit-finding's deferred TODO).
Originals backed up to private s8n/secrets/minecraft-server/.
2026-05-08 15:36:20 +01:00

226 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Runbook — Backup & Restore (Minecraft, racked.ru on nullstone)
Strategy doc: [`../BACKUP-STRATEGY.md`](../BACKUP-STRATEGY.md). This runbook is the **operator-facing** procedure for the three scenarios that come up in practice. Keep it short, copy-paste-able, and reachable from the player support workflow.
> **Status (2026-05-07):** Phase 1 (the daily `/opt/docker/backup.sh` MC world tarball) is **deployed and verified** — see "Phase 1 deployment" section near the bottom. Phase 2 (`mc-backup-playerdata.timer`, 5-min cadence) and the onyx off-host mirror are NOT yet deployed; deployment steps in "Phase 2 deployment" below. Until Phase 2 lands, the daily 02:00 tarball is the only safety net (RPO up to 24h).
---
## TL;DR — restore one player's `.dat` from N minutes ago
```bash
# On nullstone, as `user`:
PUUID=<player-uuid> # e.g. from /opt/docker/minecraft/usercache.json
WHEN=latest # or "5 min ago", or a snapshot id
RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
restic -r /home/user/restic/mc-frequent \
restore "$WHEN" \
--target /tmp/restore-$$ \
--include "world/playerdata/${PUUID}.dat"
# Verify the file is well-formed NBT before applying:
file /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat
# Expected: "gzip compressed data"
# Apply (server must be running so playerdata is writable; the player
# MUST be offline or we're racing the writer):
mcrcon -H 127.0.0.1 -P 25575 -p ${RCON_PASSWORD} "kick ${PUUID_NICK} Restore in progress"
mcrcon -H 127.0.0.1 -P 25575 -p ${RCON_PASSWORD} "save-off"
mcrcon -H 127.0.0.1 -P 25575 -p ${RCON_PASSWORD} "save-all flush"
cp /opt/docker/minecraft/world/playerdata/${PUUID}.dat \
/opt/docker/minecraft/world/playerdata/${PUUID}.dat.preFix-$(date +%s)
cp /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat \
/opt/docker/minecraft/world/playerdata/${PUUID}.dat
chown 100000:100000 /opt/docker/minecraft/world/playerdata/${PUUID}.dat # userns-remap
mcrcon -H 127.0.0.1 -P 25575 -p ${RCON_PASSWORD} "save-on"
# Tell the player to log back in.
```
**Why kick + `save-off`:** if the player is online, the server holds their NBT in memory and rewrites the `.dat` on next save tick — clobbering the restore. `save-off` halts auto-save; kicking guarantees the in-memory state for that player won't be flushed.
**Userns-remap reminder:** the host sees container-uid `100000` for files written by the MC process. Restored files written by `user` (uid 1000) will appear empty/permission-denied to the container. Always `chown 100000:100000` (or `chmod 666`) after restore. Memory: `project_nullstone_docker_userns`.
---
## Scenario 1 — Player lost inventory (T1, the void-death case)
This is what the strategy was written for. RTO target: **< 2 minutes**.
1. Find the UUID:
```bash
grep -i 'NICK' /opt/docker/minecraft/usercache.json
```
2. Pick a snapshot just **before** the loss. `restic snapshots --tag playerdata` shows timestamps.
3. Run the TL;DR block above with that snapshot id (or `latest` if loss happened in the last 5 min).
4. Inform the player: "Your inventory from HH:MM has been restored. Anything you picked up after that point is gone."
5. Log the incident: append to `docs/INCIDENTS.md` (create if absent) date, player, snapshot id, cause.
---
## Scenario 2 — Whole world rolled back (T2/T3, griefing or corruption)
RTO target: **15 minutes**. Server downtime expected.
1. Announce, kick, stop:
```bash
mcrcon ... "say Server going down for restore back in ~15 min"
mcrcon ... "kick @a Restore in progress"
cd /opt/docker/minecraft && docker compose down
```
2. Move live data aside (do not delete):
```bash
mv /opt/docker/minecraft /opt/docker/minecraft.broken-$(date +%F)
mkdir -p /opt/docker/minecraft
```
3. Restore from the world repo:
```bash
RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
restic -r /home/user/restic/mc-world \
restore <snapshot-id> --target /tmp/world-restore
rsync -aHAX /tmp/world-restore/opt/docker/minecraft/ /opt/docker/minecraft/
```
4. **Re-apply userns-remap perms** (critical — see memory):
```bash
chmod -R 777 /opt/docker/minecraft # quickfix; or chown -R 100000:100000
```
5. Boot:
```bash
cd /opt/docker/minecraft && docker compose up -d
docker logs -f minecraft-mc # watch for "Done" line
```
6. Verify with a known-good UUID's `.dat` parse, then announce server up.
7. Keep `minecraft.broken-YYYY-MM-DD/` for at least 7 days for forensic comparison.
---
## Scenario 3 — Host disk dead (T5)
RTO target: **few hours, depends on hardware swap**.
1. New host: install Debian 13 + Docker per `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md`.
2. `apt install restic`. Pull the password from operator's password manager into `/etc/mc-backup.pw`.
3. Initialise destination dir, then restore from **onyx mirror** (not local — local is gone):
```bash
restic -r sftp:mc-backup@100.64.0.1:/backups/nullstone-mc-restic \
restore latest --target /tmp/world-restore
```
4. Continue Scenario 2 from step 4.
5. Stand up the timers on the new host. **Do not** point them at the same off-host repo until the new host has been re-keyed (rotate restic passwords as part of disaster recovery).
---
## Drill log (monthly)
| Date | Operator | Snapshot age | Class A restore time | Off-host restore time | Result |
|------|----------|--------------|----------------------|------------------------|--------|
| (first drill — 2026-06-06) | s8n | TBD | TBD | TBD | TBD |
Procedure: see `BACKUP-STRATEGY.md` §7.
---
## What if no snapshot exists yet? (CURRENT REALITY 2026-05-07)
Until phases 14 of `BACKUP-STRATEGY.md` are deployed, the only recovery resources are:
| Source | What's there | Recoverable? |
|---|---|---|
| `/opt/backups/202604xx_020001/mc-world-backup-*.tar.gz` | World tar from Apr 29 + May 2 (others FAILED) | **GONE** — pruned by 7-day retention |
| `/opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz` | Plugin jars only, no world | Not useful for player data |
| Live `/opt/docker/minecraft/world/playerdata/<uuid>.dat_old` | MC's own .dat_old shadow file from previous save | **YES** — last save tick before current. **First-line defence right now.** |
| CoreProtect DB (`plugins/CoreProtect/database.db`) | Block + container actions, NOT inventory state | Partial — can roll back grief, can't restore lost items |
**Today's playbook for inventory-loss reports:**
1. Server console → `co lookup u:NICK` to confirm the loss event in CoreProtect.
2. **Stop the server immediately** if the report comes in within the same play session — every save tick overwrites `.dat_old`. `docker compose down` buys time.
3. Inspect `world/playerdata/<uuid>.dat_old` — if it predates the loss, copy it over `<uuid>.dat`, fix perms (uid 100000), restart.
4. If `.dat_old` is too new (already overwritten): **the loss is unrecoverable until BACKUP-STRATEGY phases 14 are deployed.** Apologise to the player. Spawn-in compensation per operator discretion (ops creative-mode replacement is the customary remedy).
5. Log the incident — adds urgency to deploying the new strategy.
---
## Phase 1 deployment — DONE 2026-05-07
The daily fallback (`/opt/docker/backup.sh`) was repaired and redeployed. It now backs up MC world (~12 G compressed), plugins (~490 M), plugin DBs (~280 M), and configs nightly at 02:00, prunes after 7 days, and writes a sentinel `/opt/backups/.last-success` on success.
External monitor (cron on onyx) — the simplest dead-man's switch until ntfy lands:
```bash
# Add to onyx crontab, e.g. every 30 min
*/30 * * * * ssh user@192.168.0.100 \
'find /opt/backups/.last-success -mmin -1500 | grep -q . || \
echo "ALERT: nullstone MC backup sentinel stale (>25h)"' \
| mail -s "MC backup stale" you@example.com
```
(swap `mail` for `notify-send`, `ntfy publish`, etc once those are wired)
A copy of the pre-fix script is preserved at `/opt/docker/backup.sh.bak-20260507-pre-phase1` for forensic reference.
---
## Phase 2 deployment — restic playerdata snapshots every 5 min
Implementation is in this repo:
- `scripts/restic-backup-playerdata.sh` — the per-run script
- `scripts/restic-init.sh` — one-time bootstrap (must run as root)
- `scripts/systemd/mc-backup-playerdata.{service,timer}` — 5-min cadence
- Strategy + retention + threat model in `BACKUP-STRATEGY.md`
**Deployment status (2026-05-07): NOT YET DEPLOYED — operator action required.** `restic` is not on nullstone; installing it needs sudo, and `user`'s sudo is password-locked. Operator runs:
```bash
# On nullstone, as root (sudo -i or via console)
apt-get update && apt-get install -y restic mcrcon
cd /opt/docker
git -C /home/user/repos/minecraft-server pull \
|| git clone ssh://git@192.168.0.100:222/s8n/minecraft-server.git /home/user/repos/minecraft-server
cd /home/user/repos/minecraft-server
# 1) Bootstrap repos + env file
sudo bash scripts/restic-init.sh
# 2) Install systemd units + run script
sudo install -m 644 scripts/systemd/mc-backup-playerdata.service /etc/systemd/system/
sudo install -m 644 scripts/systemd/mc-backup-playerdata.timer /etc/systemd/system/
sudo install -m 755 scripts/restic-backup-playerdata.sh /usr/local/bin/
# 3) Enable + start
sudo systemctl daemon-reload
sudo systemctl enable --now mc-backup-playerdata.timer
# 4) Verify
systemctl list-timers mc-backup-playerdata.timer
journalctl -u mc-backup-playerdata.service -n 50 --no-pager
ls -la /home/user/restic/mc-frequent/
restic -r /home/user/restic/mc-frequent --password-file /etc/mc-backup.pw snapshots
```
The first run should appear within ~7 min (`OnBootSec=2min` + 5-min cadence).
### Off-host mirror to onyx (Phase 4 — separate)
After Phase 2 is running cleanly for ~24h, provision `mc-backup` user on onyx with chrooted SFTP, then add a nightly `restic copy` job from nullstone. See `BACKUP-STRATEGY.md` §6 for the SFTP chroot config and §11 phase plan.
Until then, the local nullstone repo is single-host — survives operator error and bad config edits, **not** disk failure. The Phase 1 daily tarball in `/opt/backups/` is the only redundancy until §6 lands.
---
## TODO — open items (links into BACKUP-STRATEGY.md §11)
- [x] Phase 1: fix `/opt/docker/backup.sh` orphan-line bug (F-backup-1). **Done 2026-05-07.**
- [ ] Phase 2: deploy `mc-backup-playerdata.timer` (Class A, 5-min). Scripts in repo; **blocked on operator running `apt install restic` + `restic-init.sh` with sudo**.
- [ ] Phase 3: deploy `mc-backup-world.timer` (Class B/C/D, hourly). Script not yet drafted; will mirror playerdata script.
- [ ] Phase 4: provision `mc-backup` user on onyx + `restic copy` job.
- [ ] Phase 5: schedule monthly drill calendar entry, run first drill.
- [ ] Phase 6: ntfy / Matrix alert wiring (depends on ntfy deployment).
- [ ] Phase 7: friend RTX 4080 PC as secondary off-host.
- [ ] Verify `usercache.json` on this host: confirm UUID lookup workflow above resolves to the right `.dat`.
- [ ] Decide: `mcrcon` package vs lightweight Python `mcrcon` lib.
- [ ] Document compensation policy for unrecoverable losses (operator discretion right now).
- [ ] Drop dead `matrix-postgres` + `mongodb` + `synapse-*` blocks from `/opt/docker/backup.sh` once retirement is complete (currently they no-op-skip — minor noise in log only).