minecraft-server/docs/RUNBOOK-BACKUP-RESTORE.md

# Runbook — Backup & Restore (Minecraft, racked.ru on nullstone)

Strategy doc: [`../BACKUP-STRATEGY.md`](../BACKUP-STRATEGY.md). This runbook is the **operator-facing** procedure for the three scenarios that come up in practice. Keep it short, copy-paste-able, and reachable from the player support workflow.

> **Status (2026-05-07):** Phase 1 (the daily `/opt/docker/backup.sh` MC world tarball) is **deployed and verified** — see "Phase 1 deployment" section near the bottom. Phase 2 (`mc-backup-playerdata.timer`, 5-min cadence) and the onyx off-host mirror are NOT yet deployed; deployment steps in "Phase 2 deployment" below. Until Phase 2 lands, the daily 02:00 tarball is the only safety net (RPO up to 24h).

---

## TL;DR — restore one player's `.dat` from N minutes ago

```bash
# On nullstone, as `user`:
PUUID=<player-uuid>          # e.g. from /opt/docker/minecraft/usercache.json
WHEN=latest                  # or "5 min ago", or a snapshot id
RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
restic -r /home/user/restic/mc-frequent \
    restore "$WHEN" \
    --target /tmp/restore-$$ \
    --include "world/playerdata/${PUUID}.dat"

# Verify the file is well-formed NBT before applying:
file /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat
# Expected: "gzip compressed data"

# Apply (server must be running so playerdata is writable; the player
# MUST be offline or we're racing the writer):
mcrcon -H 127.0.0.1 -P 25575 -p ${RCON_PASSWORD} "kick ${PUUID_NICK} Restore in progress"
mcrcon -H 127.0.0.1 -P 25575 -p ${RCON_PASSWORD} "save-off"
mcrcon -H 127.0.0.1 -P 25575 -p ${RCON_PASSWORD} "save-all flush"

cp /opt/docker/minecraft/world/playerdata/${PUUID}.dat \
   /opt/docker/minecraft/world/playerdata/${PUUID}.dat.preFix-$(date +%s)
cp /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat \
   /opt/docker/minecraft/world/playerdata/${PUUID}.dat
chown 100000:100000 /opt/docker/minecraft/world/playerdata/${PUUID}.dat   # userns-remap

mcrcon -H 127.0.0.1 -P 25575 -p ${RCON_PASSWORD} "save-on"
# Tell the player to log back in.
```

**Why kick + `save-off`:** if the player is online, the server holds their NBT in memory and rewrites the `.dat` on next save tick — clobbering the restore. `save-off` halts auto-save; kicking guarantees the in-memory state for that player won't be flushed.

**Userns-remap reminder:** the host sees container-uid `100000` for files written by the MC process. Restored files written by `user` (uid 1000) will appear empty/permission-denied to the container. Always `chown 100000:100000` (or `chmod 666`) after restore. Memory: `project_nullstone_docker_userns`.

---

## Scenario 1 — Player lost inventory (T1, the void-death case)

This is what the strategy was written for. RTO target: **< 2 minutes**.

1. Find the UUID:
   ```bash
   grep -i 'NICK' /opt/docker/minecraft/usercache.json
   ```
2. Pick a snapshot just **before** the loss. `restic snapshots --tag playerdata` shows timestamps.
3. Run the TL;DR block above with that snapshot id (or `latest` if loss happened in the last 5 min).
4. Inform the player: "Your inventory from HH:MM has been restored. Anything you picked up after that point is gone."
5. Log the incident: append to `docs/INCIDENTS.md` (create if absent) — date, player, snapshot id, cause.

---

## Scenario 2 — Whole world rolled back (T2/T3, griefing or corruption)

RTO target: **15 minutes**. Server downtime expected.

1. Announce, kick, stop:
   ```bash
   mcrcon ... "say Server going down for restore — back in ~15 min"
   mcrcon ... "kick @a Restore in progress"
   cd /opt/docker/minecraft && docker compose down
   ```
2. Move live data aside (do not delete):
   ```bash
   mv /opt/docker/minecraft /opt/docker/minecraft.broken-$(date +%F)
   mkdir -p /opt/docker/minecraft
   ```
3. Restore from the world repo:
   ```bash
   RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
   restic -r /home/user/restic/mc-world \
       restore <snapshot-id> --target /tmp/world-restore
   rsync -aHAX /tmp/world-restore/opt/docker/minecraft/ /opt/docker/minecraft/
   ```
4. **Re-apply userns-remap perms** (critical — see memory):
   ```bash
   chmod -R 777 /opt/docker/minecraft   # quickfix; or chown -R 100000:100000
   ```
5. Boot:
   ```bash
   cd /opt/docker/minecraft && docker compose up -d
   docker logs -f minecraft-mc   # watch for "Done" line
   ```
6. Verify with a known-good UUID's `.dat` parse, then announce server up.
7. Keep `minecraft.broken-YYYY-MM-DD/` for at least 7 days for forensic comparison.

---

## Scenario 3 — Host disk dead (T5)

RTO target: **few hours, depends on hardware swap**.

1. New host: install Debian 13 + Docker per `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md`.
2. `apt install restic`. Pull the password from operator's password manager into `/etc/mc-backup.pw`.
3. Initialise destination dir, then restore from **onyx mirror** (not local — local is gone):
   ```bash
   restic -r sftp:mc-backup@100.64.0.1:/backups/nullstone-mc-restic \
       restore latest --target /tmp/world-restore
   ```
4. Continue Scenario 2 from step 4.
5. Stand up the timers on the new host. **Do not** point them at the same off-host repo until the new host has been re-keyed (rotate restic passwords as part of disaster recovery).

---

## Drill log (monthly)

| Date | Operator | Snapshot age | Class A restore time | Off-host restore time | Result |
|------|----------|--------------|----------------------|------------------------|--------|
| (first drill — 2026-06-06) | s8n | TBD | TBD | TBD | TBD |

Procedure: see `BACKUP-STRATEGY.md` §7.

---

## What if no snapshot exists yet? (CURRENT REALITY 2026-05-07)

Until phases 1–4 of `BACKUP-STRATEGY.md` are deployed, the only recovery resources are:

| Source | What's there | Recoverable? |
|---|---|---|
| `/opt/backups/202604xx_020001/mc-world-backup-*.tar.gz` | World tar from Apr 29 + May 2 (others FAILED) | **GONE** — pruned by 7-day retention |
| `/opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz` | Plugin jars only, no world | Not useful for player data |
| Live `/opt/docker/minecraft/world/playerdata/<uuid>.dat_old` | MC's own .dat_old shadow file from previous save | **YES** — last save tick before current. **First-line defence right now.** |
| CoreProtect DB (`plugins/CoreProtect/database.db`) | Block + container actions, NOT inventory state | Partial — can roll back grief, can't restore lost items |

**Today's playbook for inventory-loss reports:**

1. Server console → `co lookup u:NICK` to confirm the loss event in CoreProtect.
2. **Stop the server immediately** if the report comes in within the same play session — every save tick overwrites `.dat_old`. `docker compose down` buys time.
3. Inspect `world/playerdata/<uuid>.dat_old` — if it predates the loss, copy it over `<uuid>.dat`, fix perms (uid 100000), restart.
4. If `.dat_old` is too new (already overwritten): **the loss is unrecoverable until BACKUP-STRATEGY phases 1–4 are deployed.** Apologise to the player. Spawn-in compensation per operator discretion (ops creative-mode replacement is the customary remedy).
5. Log the incident — adds urgency to deploying the new strategy.

---

## Phase 1 deployment — DONE 2026-05-07

The daily fallback (`/opt/docker/backup.sh`) was repaired and redeployed. It now backs up MC world (~12 G compressed), plugins (~490 M), plugin DBs (~280 M), and configs nightly at 02:00, prunes after 7 days, and writes a sentinel `/opt/backups/.last-success` on success.

External monitor (cron on onyx) — the simplest dead-man's switch until ntfy lands:

```bash
# Add to onyx crontab, e.g. every 30 min
*/30 * * * * ssh user@192.168.0.100 \
  'find /opt/backups/.last-success -mmin -1500 | grep -q . || \
   echo "ALERT: nullstone MC backup sentinel stale (>25h)"' \
  | mail -s "MC backup stale" you@example.com
```

(swap `mail` for `notify-send`, `ntfy publish`, etc once those are wired)

A copy of the pre-fix script is preserved at `/opt/docker/backup.sh.bak-20260507-pre-phase1` for forensic reference.

---

## Phase 2 deployment — restic playerdata snapshots every 5 min

Implementation is in this repo:

- `scripts/restic-backup-playerdata.sh` — the per-run script
- `scripts/restic-init.sh` — one-time bootstrap (must run as root)
- `scripts/systemd/mc-backup-playerdata.{service,timer}` — 5-min cadence
- Strategy + retention + threat model in `BACKUP-STRATEGY.md`

**Deployment status (2026-05-07): NOT YET DEPLOYED — operator action required.** `restic` is not on nullstone; installing it needs sudo, and `user`'s sudo is password-locked. Operator runs:

```bash
# On nullstone, as root (sudo -i or via console)
apt-get update && apt-get install -y restic mcrcon

cd /opt/docker
git -C /home/user/repos/minecraft-server pull \
   || git clone ssh://git@192.168.0.100:222/s8n/minecraft-server.git /home/user/repos/minecraft-server
cd /home/user/repos/minecraft-server

# 1) Bootstrap repos + env file
sudo bash scripts/restic-init.sh

# 2) Install systemd units + run script
sudo install -m 644 scripts/systemd/mc-backup-playerdata.service /etc/systemd/system/
sudo install -m 644 scripts/systemd/mc-backup-playerdata.timer   /etc/systemd/system/
sudo install -m 755 scripts/restic-backup-playerdata.sh           /usr/local/bin/

# 3) Enable + start
sudo systemctl daemon-reload
sudo systemctl enable --now mc-backup-playerdata.timer

# 4) Verify
systemctl list-timers mc-backup-playerdata.timer
journalctl -u mc-backup-playerdata.service -n 50 --no-pager
ls -la /home/user/restic/mc-frequent/
restic -r /home/user/restic/mc-frequent --password-file /etc/mc-backup.pw snapshots
```

The first run should appear within ~7 min (`OnBootSec=2min` + 5-min cadence).

### Off-host mirror to onyx (Phase 4 — separate)

After Phase 2 is running cleanly for ~24h, provision `mc-backup` user on onyx with chrooted SFTP, then add a nightly `restic copy` job from nullstone. See `BACKUP-STRATEGY.md` §6 for the SFTP chroot config and §11 phase plan.

Until then, the local nullstone repo is single-host — survives operator error and bad config edits, **not** disk failure. The Phase 1 daily tarball in `/opt/backups/` is the only redundancy until §6 lands.

---

## TODO — open items (links into BACKUP-STRATEGY.md §11)

- [x] Phase 1: fix `/opt/docker/backup.sh` orphan-line bug (F-backup-1). **Done 2026-05-07.**
- [ ] Phase 2: deploy `mc-backup-playerdata.timer` (Class A, 5-min). Scripts in repo; **blocked on operator running `apt install restic` + `restic-init.sh` with sudo**.
- [ ] Phase 3: deploy `mc-backup-world.timer` (Class B/C/D, hourly). Script not yet drafted; will mirror playerdata script.
- [ ] Phase 4: provision `mc-backup` user on onyx + `restic copy` job.
- [ ] Phase 5: schedule monthly drill calendar entry, run first drill.
- [ ] Phase 6: ntfy / Matrix alert wiring (depends on ntfy deployment).
- [ ] Phase 7: friend RTX 4080 PC as secondary off-host.
- [ ] Verify `usercache.json` on this host: confirm UUID lookup workflow above resolves to the right `.dat`.
- [ ] Decide: `mcrcon` package vs lightweight Python `mcrcon` lib.
- [ ] Document compensation policy for unrecoverable losses (operator discretion right now).
- [ ] Drop dead `matrix-postgres` + `mongodb` + `synapse-*` blocks from `/opt/docker/backup.sh` once retirement is complete (currently they no-op-skip — minor noise in log only).
-												docs: 2026-05-07 incident audit + backup strategy

Player YOU500 lost full inventory to AuthLimbo void-death at 17:13:39.
Investigation revealed deployed /opt/docker/backup.sh is an 88-line stub
missing the Minecraft block; last successful world backup 2026-05-02
(already pruned). No recoverable .dat exists.

Files:
- AUDIT-2026-05-07.md — server-side findings F-01..F-06 (P0 backups,
  no-keepInventory, AuthLimbo silent failure, chunk preload race,
  Xmx > container headroom, container hardening gaps)
- BACKUP-HUNT-2026-05-07.md — exhaustive backup scan; only 6-week-old
  archive at _archive/minecraft-old-2026-04-27.tar.gz
- BACKUP-STRATEGY.md — restic-based plan; 5min/hourly/daily classes,
  off-host to onyx via Tailscale, monthly drill
- CROSS-REFERENCE-2026-05-07.md — repo+doc landing map; flags
  pre-existing infra/STATE.md backup-broken note + HA-CLUSTER restic
  draft to extend rather than duplicate
- docs/RUNBOOK-BACKUP-RESTORE.md — operator runbook for .dat restore,
  full-world restore, host-loss restore, drill log

											
										
										
											2026-05-07 17:33:24 +01:00
+								# Runbook — Backup & Restore (Minecraft, racked.ru on nullstone)
 								Strategy doc: [`../BACKUP-STRATEGY.md`](../BACKUP-STRATEGY.md). This runbook is the **operator-facing** procedure for the three scenarios that come up in practice. Keep it short, copy-paste-able, and reachable from the player support workflow.
-												backup: phase 1 + phase 2 scripts; daily script repaired and deployed

Repairs the orphaned synapse-signing-key block at scripts/backup.sh
lines 119-122 that was exiting the script under set -e before the
Minecraft block could run, leaving 5 of the last 7 days without a
world backup and zero usable snapshots after 7-day retention.

Phase 1 (deployed today to /opt/docker/backup.sh on nullstone):
- Repaired script — orphan block removed, MC arm wrapped so failures
  in one tar don't kill the run
- tar exit code 1 ("file changed as we read it") now treated as
  success on the live MC world; spark profiler tmp file noise
  silenced via --ignore-failed-read --warning=no-file-changed
- Plugin DBs (homestead, AuthMe, CoreProtect, LuckPerms) and configs
  now backed up alongside the world
- Sentinel /opt/backups/.last-success stamped only when the world
  arm succeeds — gives outside monitors a single mtime to alert on
- Manually verified end-to-end: 12G world tarball, 492M plugins,
  279M dbs, 14 config files, sentinel updated. Pre-fix script saved
  at /opt/docker/backup.sh.bak-20260507-pre-phase1.

Phase 2 (scripts in repo, deployment pending operator sudo):
- scripts/restic-backup-playerdata.sh — Class A 5-min restic snapshots
  of playerdata/, stats/, advancements/, plugin DBs, LuckPerms;
  rcon save-all flush before snapshot; tag-scoped retention
- scripts/restic-init.sh — one-time bootstrap (root-only) for
  /etc/mc-backup.{env,pw} + repo init at /home/user/restic/
- scripts/systemd/mc-backup-playerdata.{service,timer} — 5-min timer
  with hardening (ProtectSystem=strict, ReadOnlyPaths, etc)
- docs/RUNBOOK-BACKUP-RESTORE.md updated with both phases'
  deployment steps and the operator-action checklist

Off-host mirror to onyx (Phase 4) and class B/C/D world snapshots
(Phase 3) are still TODO — see BACKUP-STRATEGY.md §11 phase plan.

											
										
										
											2026-05-07 18:29:30 +01:00
+								> **Status (2026-05-07):** Phase 1 (the daily `/opt/docker/backup.sh` MC world tarball) is **deployed and verified** — see "Phase 1 deployment" section near the bottom. Phase 2 (`mc-backup-playerdata.timer`, 5-min cadence) and the onyx off-host mirror are NOT yet deployed; deployment steps in "Phase 2 deployment" below. Until Phase 2 lands, the daily 02:00 tarball is the only safety net (RPO up to 24h).
-												docs: 2026-05-07 incident audit + backup strategy

Player YOU500 lost full inventory to AuthLimbo void-death at 17:13:39.
Investigation revealed deployed /opt/docker/backup.sh is an 88-line stub
missing the Minecraft block; last successful world backup 2026-05-02
(already pruned). No recoverable .dat exists.

Files:
- AUDIT-2026-05-07.md — server-side findings F-01..F-06 (P0 backups,
  no-keepInventory, AuthLimbo silent failure, chunk preload race,
  Xmx > container headroom, container hardening gaps)
- BACKUP-HUNT-2026-05-07.md — exhaustive backup scan; only 6-week-old
  archive at _archive/minecraft-old-2026-04-27.tar.gz
- BACKUP-STRATEGY.md — restic-based plan; 5min/hourly/daily classes,
  off-host to onyx via Tailscale, monthly drill
- CROSS-REFERENCE-2026-05-07.md — repo+doc landing map; flags
  pre-existing infra/STATE.md backup-broken note + HA-CLUSTER restic
  draft to extend rather than duplicate
- docs/RUNBOOK-BACKUP-RESTORE.md — operator runbook for .dat restore,
  full-world restore, host-loss restore, drill log

											
										
										
											2026-05-07 17:33:24 +01:00
 								---
 								## TL;DR — restore one player's `.dat` from N minutes ago
 								```bash
 								# On nullstone, as `user`:
 								PUUID=<player-uuid>          # e.g. from /opt/docker/minecraft/usercache.json
 								WHEN=latest                  # or "5 min ago", or a snapshot id
 								RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
 								restic -r /home/user/restic/mc-frequent \
 								    restore "$WHEN" \
 								    --target /tmp/restore-$$ \
 								    --include "world/playerdata/${PUUID}.dat"
 								# Verify the file is well-formed NBT before applying:
 								file /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat
 								# Expected: "gzip compressed data"
 								# Apply (server must be running so playerdata is writable; the player
 								# MUST be offline or we're racing the writer):
-												redact: scrub leaked Minecraft secrets from public repo

Replaced literal values with env-var placeholders (${RCON_PASSWORD},
${MGMT_SECRET}, ${MC_RCON_PASSWORD}) across server.properties,
.rcon-cli.env, docker-compose.yml(s), backup scripts, and AUDIT-2026-05-07.md.

Affected secrets:
- Paper management-server-secret (HIGH; mitigated by management-server-enabled=false)
- RCON password '*redacted*' (MEDIUM; bound to 127.0.0.1)
- MC_RCON_PASSWORD backup-pipeline default fallback (MEDIUM; same blast radius)

WARNING: HEAD redaction only — values remain in git history. Treat as
compromised and rotate (closes F-17 audit-finding's deferred TODO).
Originals backed up to private s8n/secrets/minecraft-server/.

											
										
										
											2026-05-08 15:36:20 +01:00
+								mcrcon -H 127.0.0.1 -P 25575 -p ${RCON_PASSWORD} "kick ${PUUID_NICK} Restore in progress"
 								mcrcon -H 127.0.0.1 -P 25575 -p ${RCON_PASSWORD} "save-off"
 								mcrcon -H 127.0.0.1 -P 25575 -p ${RCON_PASSWORD} "save-all flush"
-												docs: 2026-05-07 incident audit + backup strategy

Player YOU500 lost full inventory to AuthLimbo void-death at 17:13:39.
Investigation revealed deployed /opt/docker/backup.sh is an 88-line stub
missing the Minecraft block; last successful world backup 2026-05-02
(already pruned). No recoverable .dat exists.

Files:
- AUDIT-2026-05-07.md — server-side findings F-01..F-06 (P0 backups,
  no-keepInventory, AuthLimbo silent failure, chunk preload race,
  Xmx > container headroom, container hardening gaps)
- BACKUP-HUNT-2026-05-07.md — exhaustive backup scan; only 6-week-old
  archive at _archive/minecraft-old-2026-04-27.tar.gz
- BACKUP-STRATEGY.md — restic-based plan; 5min/hourly/daily classes,
  off-host to onyx via Tailscale, monthly drill
- CROSS-REFERENCE-2026-05-07.md — repo+doc landing map; flags
  pre-existing infra/STATE.md backup-broken note + HA-CLUSTER restic
  draft to extend rather than duplicate
- docs/RUNBOOK-BACKUP-RESTORE.md — operator runbook for .dat restore,
  full-world restore, host-loss restore, drill log

											
										
										
											2026-05-07 17:33:24 +01:00
 								cp /opt/docker/minecraft/world/playerdata/${PUUID}.dat \
 								   /opt/docker/minecraft/world/playerdata/${PUUID}.dat.preFix-$(date +%s)
 								cp /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat \
 								   /opt/docker/minecraft/world/playerdata/${PUUID}.dat
 								chown 100000:100000 /opt/docker/minecraft/world/playerdata/${PUUID}.dat   # userns-remap
-												redact: scrub leaked Minecraft secrets from public repo

Replaced literal values with env-var placeholders (${RCON_PASSWORD},
${MGMT_SECRET}, ${MC_RCON_PASSWORD}) across server.properties,
.rcon-cli.env, docker-compose.yml(s), backup scripts, and AUDIT-2026-05-07.md.

Affected secrets:
- Paper management-server-secret (HIGH; mitigated by management-server-enabled=false)
- RCON password '*redacted*' (MEDIUM; bound to 127.0.0.1)
- MC_RCON_PASSWORD backup-pipeline default fallback (MEDIUM; same blast radius)

WARNING: HEAD redaction only — values remain in git history. Treat as
compromised and rotate (closes F-17 audit-finding's deferred TODO).
Originals backed up to private s8n/secrets/minecraft-server/.

											
										
										
											2026-05-08 15:36:20 +01:00
+								mcrcon -H 127.0.0.1 -P 25575 -p ${RCON_PASSWORD} "save-on"
-												docs: 2026-05-07 incident audit + backup strategy

Player YOU500 lost full inventory to AuthLimbo void-death at 17:13:39.
Investigation revealed deployed /opt/docker/backup.sh is an 88-line stub
missing the Minecraft block; last successful world backup 2026-05-02
(already pruned). No recoverable .dat exists.

Files:
- AUDIT-2026-05-07.md — server-side findings F-01..F-06 (P0 backups,
  no-keepInventory, AuthLimbo silent failure, chunk preload race,
  Xmx > container headroom, container hardening gaps)
- BACKUP-HUNT-2026-05-07.md — exhaustive backup scan; only 6-week-old
  archive at _archive/minecraft-old-2026-04-27.tar.gz
- BACKUP-STRATEGY.md — restic-based plan; 5min/hourly/daily classes,
  off-host to onyx via Tailscale, monthly drill
- CROSS-REFERENCE-2026-05-07.md — repo+doc landing map; flags
  pre-existing infra/STATE.md backup-broken note + HA-CLUSTER restic
  draft to extend rather than duplicate
- docs/RUNBOOK-BACKUP-RESTORE.md — operator runbook for .dat restore,
  full-world restore, host-loss restore, drill log

											
										
										
											2026-05-07 17:33:24 +01:00
+								# Tell the player to log back in.
 								```
 								**Why kick + `save-off`:** if the player is online, the server holds their NBT in memory and rewrites the `.dat` on next save tick — clobbering the restore. `save-off` halts auto-save; kicking guarantees the in-memory state for that player won't be flushed.
 								**Userns-remap reminder:** the host sees container-uid `100000` for files written by the MC process. Restored files written by `user` (uid 1000) will appear empty/permission-denied to the container. Always `chown 100000:100000` (or `chmod 666`) after restore. Memory: `project_nullstone_docker_userns`.
 								---
 								## Scenario 1 — Player lost inventory (T1, the void-death case)
 								This is what the strategy was written for. RTO target: **< 2 minutes**.
 . Find the UUID:
 								   ```bash
 								   grep -i 'NICK' /opt/docker/minecraft/usercache.json
 								   ```
 . Pick a snapshot just **before** the loss. `restic snapshots --tag playerdata` shows timestamps.
 . Run the TL;DR block above with that snapshot id (or `latest` if loss happened in the last 5 min).
 . Inform the player: "Your inventory from HH:MM has been restored. Anything you picked up after that point is gone."
 . Log the incident: append to `docs/INCIDENTS.md` (create if absent) — date, player, snapshot id, cause.
 								---
 								## Scenario 2 — Whole world rolled back (T2/T3, griefing or corruption)
 								RTO target: **15 minutes**. Server downtime expected.
 . Announce, kick, stop:
 								   ```bash
 								   mcrcon ... "say Server going down for restore — back in ~15 min"
 								   mcrcon ... "kick @a Restore in progress"
 								   cd /opt/docker/minecraft && docker compose down
 								   ```
 . Move live data aside (do not delete):
 								   ```bash
 								   mv /opt/docker/minecraft /opt/docker/minecraft.broken-$(date +%F)
 								   mkdir -p /opt/docker/minecraft
 								   ```
 . Restore from the world repo:
 								   ```bash
 								   RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
 								   restic -r /home/user/restic/mc-world \
 								       restore <snapshot-id> --target /tmp/world-restore
 								   rsync -aHAX /tmp/world-restore/opt/docker/minecraft/ /opt/docker/minecraft/
 								   ```
 . **Re-apply userns-remap perms** (critical — see memory):
 								   ```bash
 								   chmod -R 777 /opt/docker/minecraft   # quickfix; or chown -R 100000:100000
 								   ```
 . Boot:
 								   ```bash
 								   cd /opt/docker/minecraft && docker compose up -d
 								   docker logs -f minecraft-mc   # watch for "Done" line
 								   ```
 . Verify with a known-good UUID's `.dat` parse, then announce server up.
 . Keep `minecraft.broken-YYYY-MM-DD/` for at least 7 days for forensic comparison.
 								---
 								## Scenario 3 — Host disk dead (T5)
 								RTO target: **few hours, depends on hardware swap**.
 . New host: install Debian 13 + Docker per `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md`.
 . `apt install restic`. Pull the password from operator's password manager into `/etc/mc-backup.pw`.
 . Initialise destination dir, then restore from **onyx mirror** (not local — local is gone):
 								   ```bash
 								   restic -r sftp:mc-backup@100.64.0.1:/backups/nullstone-mc-restic \
 								       restore latest --target /tmp/world-restore
 								   ```
 . Continue Scenario 2 from step 4.
 . Stand up the timers on the new host. **Do not** point them at the same off-host repo until the new host has been re-keyed (rotate restic passwords as part of disaster recovery).
 								---
 								## Drill log (monthly)
 								| Date | Operator | Snapshot age | Class A restore time | Off-host restore time | Result |
 								|------|----------|--------------|----------------------|------------------------|--------|
 								| (first drill — 2026-06-06) | s8n | TBD | TBD | TBD | TBD |
 								Procedure: see `BACKUP-STRATEGY.md` §7.
 								---
 								## What if no snapshot exists yet? (CURRENT REALITY 2026-05-07)
 								Until phases 1–4 of `BACKUP-STRATEGY.md` are deployed, the only recovery resources are:
 								| Source | What's there | Recoverable? |
 								|---|---|---|
 								| `/opt/backups/202604xx_020001/mc-world-backup-*.tar.gz` | World tar from Apr 29 + May 2 (others FAILED) | **GONE** — pruned by 7-day retention |
 								| `/opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz` | Plugin jars only, no world | Not useful for player data |
 								| Live `/opt/docker/minecraft/world/playerdata/<uuid>.dat_old` | MC's own .dat_old shadow file from previous save | **YES** — last save tick before current. **First-line defence right now.** |
 								| CoreProtect DB (`plugins/CoreProtect/database.db`) | Block + container actions, NOT inventory state | Partial — can roll back grief, can't restore lost items |
 								**Today's playbook for inventory-loss reports:**
 . Server console → `co lookup u:NICK` to confirm the loss event in CoreProtect.
 . **Stop the server immediately** if the report comes in within the same play session — every save tick overwrites `.dat_old`. `docker compose down` buys time.
 . Inspect `world/playerdata/<uuid>.dat_old` — if it predates the loss, copy it over `<uuid>.dat`, fix perms (uid 100000), restart.
 . If `.dat_old` is too new (already overwritten): **the loss is unrecoverable until BACKUP-STRATEGY phases 1–4 are deployed.** Apologise to the player. Spawn-in compensation per operator discretion (ops creative-mode replacement is the customary remedy).
 . Log the incident — adds urgency to deploying the new strategy.
 								---
-												backup: phase 1 + phase 2 scripts; daily script repaired and deployed

Repairs the orphaned synapse-signing-key block at scripts/backup.sh
lines 119-122 that was exiting the script under set -e before the
Minecraft block could run, leaving 5 of the last 7 days without a
world backup and zero usable snapshots after 7-day retention.

Phase 1 (deployed today to /opt/docker/backup.sh on nullstone):
- Repaired script — orphan block removed, MC arm wrapped so failures
  in one tar don't kill the run
- tar exit code 1 ("file changed as we read it") now treated as
  success on the live MC world; spark profiler tmp file noise
  silenced via --ignore-failed-read --warning=no-file-changed
- Plugin DBs (homestead, AuthMe, CoreProtect, LuckPerms) and configs
  now backed up alongside the world
- Sentinel /opt/backups/.last-success stamped only when the world
  arm succeeds — gives outside monitors a single mtime to alert on
- Manually verified end-to-end: 12G world tarball, 492M plugins,
  279M dbs, 14 config files, sentinel updated. Pre-fix script saved
  at /opt/docker/backup.sh.bak-20260507-pre-phase1.

Phase 2 (scripts in repo, deployment pending operator sudo):
- scripts/restic-backup-playerdata.sh — Class A 5-min restic snapshots
  of playerdata/, stats/, advancements/, plugin DBs, LuckPerms;
  rcon save-all flush before snapshot; tag-scoped retention
- scripts/restic-init.sh — one-time bootstrap (root-only) for
  /etc/mc-backup.{env,pw} + repo init at /home/user/restic/
- scripts/systemd/mc-backup-playerdata.{service,timer} — 5-min timer
  with hardening (ProtectSystem=strict, ReadOnlyPaths, etc)
- docs/RUNBOOK-BACKUP-RESTORE.md updated with both phases'
  deployment steps and the operator-action checklist

Off-host mirror to onyx (Phase 4) and class B/C/D world snapshots
(Phase 3) are still TODO — see BACKUP-STRATEGY.md §11 phase plan.

											
										
										
											2026-05-07 18:29:30 +01:00
+								## Phase 1 deployment — DONE 2026-05-07
 								The daily fallback (`/opt/docker/backup.sh`) was repaired and redeployed. It now backs up MC world (~12 G compressed), plugins (~490 M), plugin DBs (~280 M), and configs nightly at 02:00, prunes after 7 days, and writes a sentinel `/opt/backups/.last-success` on success.
 								External monitor (cron on onyx) — the simplest dead-man's switch until ntfy lands:
 								```bash
 								# Add to onyx crontab, e.g. every 30 min
 								*/30 * * * * ssh user@192.168.0.100 \
 								  'find /opt/backups/.last-success -mmin -1500 | grep -q . || \
 								   echo "ALERT: nullstone MC backup sentinel stale (>25h)"' \
 								  | mail -s "MC backup stale" you@example.com
 								```
 								(swap `mail` for `notify-send`, `ntfy publish`, etc once those are wired)
 								A copy of the pre-fix script is preserved at `/opt/docker/backup.sh.bak-20260507-pre-phase1` for forensic reference.
 								---
 								## Phase 2 deployment — restic playerdata snapshots every 5 min
 								Implementation is in this repo:
 								- `scripts/restic-backup-playerdata.sh` — the per-run script
 								- `scripts/restic-init.sh` — one-time bootstrap (must run as root)
 								- `scripts/systemd/mc-backup-playerdata.{service,timer}` — 5-min cadence
 								- Strategy + retention + threat model in `BACKUP-STRATEGY.md`
 								**Deployment status (2026-05-07): NOT YET DEPLOYED — operator action required.** `restic` is not on nullstone; installing it needs sudo, and `user`'s sudo is password-locked. Operator runs:
 								```bash
 								# On nullstone, as root (sudo -i or via console)
 								apt-get update && apt-get install -y restic mcrcon
 								cd /opt/docker
 								git -C /home/user/repos/minecraft-server pull \
 								   || git clone ssh://git@192.168.0.100:222/s8n/minecraft-server.git /home/user/repos/minecraft-server
 								cd /home/user/repos/minecraft-server
 								# 1) Bootstrap repos + env file
 								sudo bash scripts/restic-init.sh
 								# 2) Install systemd units + run script
 								sudo install -m 644 scripts/systemd/mc-backup-playerdata.service /etc/systemd/system/
 								sudo install -m 644 scripts/systemd/mc-backup-playerdata.timer   /etc/systemd/system/
 								sudo install -m 755 scripts/restic-backup-playerdata.sh           /usr/local/bin/
 								# 3) Enable + start
 								sudo systemctl daemon-reload
 								sudo systemctl enable --now mc-backup-playerdata.timer
 								# 4) Verify
 								systemctl list-timers mc-backup-playerdata.timer
 								journalctl -u mc-backup-playerdata.service -n 50 --no-pager
 								ls -la /home/user/restic/mc-frequent/
 								restic -r /home/user/restic/mc-frequent --password-file /etc/mc-backup.pw snapshots
 								```
 								The first run should appear within ~7 min (`OnBootSec=2min` + 5-min cadence).
 								### Off-host mirror to onyx (Phase 4 — separate)
 								After Phase 2 is running cleanly for ~24h, provision `mc-backup` user on onyx with chrooted SFTP, then add a nightly `restic copy` job from nullstone. See `BACKUP-STRATEGY.md` §6 for the SFTP chroot config and §11 phase plan.
 								Until then, the local nullstone repo is single-host — survives operator error and bad config edits, **not** disk failure. The Phase 1 daily tarball in `/opt/backups/` is the only redundancy until §6 lands.
 								---
-												docs: 2026-05-07 incident audit + backup strategy

Player YOU500 lost full inventory to AuthLimbo void-death at 17:13:39.
Investigation revealed deployed /opt/docker/backup.sh is an 88-line stub
missing the Minecraft block; last successful world backup 2026-05-02
(already pruned). No recoverable .dat exists.

Files:
- AUDIT-2026-05-07.md — server-side findings F-01..F-06 (P0 backups,
  no-keepInventory, AuthLimbo silent failure, chunk preload race,
  Xmx > container headroom, container hardening gaps)
- BACKUP-HUNT-2026-05-07.md — exhaustive backup scan; only 6-week-old
  archive at _archive/minecraft-old-2026-04-27.tar.gz
- BACKUP-STRATEGY.md — restic-based plan; 5min/hourly/daily classes,
  off-host to onyx via Tailscale, monthly drill
- CROSS-REFERENCE-2026-05-07.md — repo+doc landing map; flags
  pre-existing infra/STATE.md backup-broken note + HA-CLUSTER restic
  draft to extend rather than duplicate
- docs/RUNBOOK-BACKUP-RESTORE.md — operator runbook for .dat restore,
  full-world restore, host-loss restore, drill log

											
										
										
											2026-05-07 17:33:24 +01:00
+								## TODO — open items (links into BACKUP-STRATEGY.md §11)
-												backup: phase 1 + phase 2 scripts; daily script repaired and deployed

Repairs the orphaned synapse-signing-key block at scripts/backup.sh
lines 119-122 that was exiting the script under set -e before the
Minecraft block could run, leaving 5 of the last 7 days without a
world backup and zero usable snapshots after 7-day retention.

Phase 1 (deployed today to /opt/docker/backup.sh on nullstone):
- Repaired script — orphan block removed, MC arm wrapped so failures
  in one tar don't kill the run
- tar exit code 1 ("file changed as we read it") now treated as
  success on the live MC world; spark profiler tmp file noise
  silenced via --ignore-failed-read --warning=no-file-changed
- Plugin DBs (homestead, AuthMe, CoreProtect, LuckPerms) and configs
  now backed up alongside the world
- Sentinel /opt/backups/.last-success stamped only when the world
  arm succeeds — gives outside monitors a single mtime to alert on
- Manually verified end-to-end: 12G world tarball, 492M plugins,
  279M dbs, 14 config files, sentinel updated. Pre-fix script saved
  at /opt/docker/backup.sh.bak-20260507-pre-phase1.

Phase 2 (scripts in repo, deployment pending operator sudo):
- scripts/restic-backup-playerdata.sh — Class A 5-min restic snapshots
  of playerdata/, stats/, advancements/, plugin DBs, LuckPerms;
  rcon save-all flush before snapshot; tag-scoped retention
- scripts/restic-init.sh — one-time bootstrap (root-only) for
  /etc/mc-backup.{env,pw} + repo init at /home/user/restic/
- scripts/systemd/mc-backup-playerdata.{service,timer} — 5-min timer
  with hardening (ProtectSystem=strict, ReadOnlyPaths, etc)
- docs/RUNBOOK-BACKUP-RESTORE.md updated with both phases'
  deployment steps and the operator-action checklist

Off-host mirror to onyx (Phase 4) and class B/C/D world snapshots
(Phase 3) are still TODO — see BACKUP-STRATEGY.md §11 phase plan.

											
										
										
											2026-05-07 18:29:30 +01:00
+								- [x] Phase 1: fix `/opt/docker/backup.sh` orphan-line bug (F-backup-1). **Done 2026-05-07.**
 								- [ ] Phase 2: deploy `mc-backup-playerdata.timer` (Class A, 5-min). Scripts in repo; **blocked on operator running `apt install restic` + `restic-init.sh` with sudo**.
 								- [ ] Phase 3: deploy `mc-backup-world.timer` (Class B/C/D, hourly). Script not yet drafted; will mirror playerdata script.
-												docs: 2026-05-07 incident audit + backup strategy

Player YOU500 lost full inventory to AuthLimbo void-death at 17:13:39.
Investigation revealed deployed /opt/docker/backup.sh is an 88-line stub
missing the Minecraft block; last successful world backup 2026-05-02
(already pruned). No recoverable .dat exists.

Files:
- AUDIT-2026-05-07.md — server-side findings F-01..F-06 (P0 backups,
  no-keepInventory, AuthLimbo silent failure, chunk preload race,
  Xmx > container headroom, container hardening gaps)
- BACKUP-HUNT-2026-05-07.md — exhaustive backup scan; only 6-week-old
  archive at _archive/minecraft-old-2026-04-27.tar.gz
- BACKUP-STRATEGY.md — restic-based plan; 5min/hourly/daily classes,
  off-host to onyx via Tailscale, monthly drill
- CROSS-REFERENCE-2026-05-07.md — repo+doc landing map; flags
  pre-existing infra/STATE.md backup-broken note + HA-CLUSTER restic
  draft to extend rather than duplicate
- docs/RUNBOOK-BACKUP-RESTORE.md — operator runbook for .dat restore,
  full-world restore, host-loss restore, drill log

											
										
										
											2026-05-07 17:33:24 +01:00
+								- [ ] Phase 4: provision `mc-backup` user on onyx + `restic copy` job.
 								- [ ] Phase 5: schedule monthly drill calendar entry, run first drill.
 								- [ ] Phase 6: ntfy / Matrix alert wiring (depends on ntfy deployment).
 								- [ ] Phase 7: friend RTX 4080 PC as secondary off-host.
 								- [ ] Verify `usercache.json` on this host: confirm UUID lookup workflow above resolves to the right `.dat`.
 								- [ ] Decide: `mcrcon` package vs lightweight Python `mcrcon` lib.
 								- [ ] Document compensation policy for unrecoverable losses (operator discretion right now).
-												backup: phase 1 + phase 2 scripts; daily script repaired and deployed

Repairs the orphaned synapse-signing-key block at scripts/backup.sh
lines 119-122 that was exiting the script under set -e before the
Minecraft block could run, leaving 5 of the last 7 days without a
world backup and zero usable snapshots after 7-day retention.

Phase 1 (deployed today to /opt/docker/backup.sh on nullstone):
- Repaired script — orphan block removed, MC arm wrapped so failures
  in one tar don't kill the run
- tar exit code 1 ("file changed as we read it") now treated as
  success on the live MC world; spark profiler tmp file noise
  silenced via --ignore-failed-read --warning=no-file-changed
- Plugin DBs (homestead, AuthMe, CoreProtect, LuckPerms) and configs
  now backed up alongside the world
- Sentinel /opt/backups/.last-success stamped only when the world
  arm succeeds — gives outside monitors a single mtime to alert on
- Manually verified end-to-end: 12G world tarball, 492M plugins,
  279M dbs, 14 config files, sentinel updated. Pre-fix script saved
  at /opt/docker/backup.sh.bak-20260507-pre-phase1.

Phase 2 (scripts in repo, deployment pending operator sudo):
- scripts/restic-backup-playerdata.sh — Class A 5-min restic snapshots
  of playerdata/, stats/, advancements/, plugin DBs, LuckPerms;
  rcon save-all flush before snapshot; tag-scoped retention
- scripts/restic-init.sh — one-time bootstrap (root-only) for
  /etc/mc-backup.{env,pw} + repo init at /home/user/restic/
- scripts/systemd/mc-backup-playerdata.{service,timer} — 5-min timer
  with hardening (ProtectSystem=strict, ReadOnlyPaths, etc)
- docs/RUNBOOK-BACKUP-RESTORE.md updated with both phases'
  deployment steps and the operator-action checklist

Off-host mirror to onyx (Phase 4) and class B/C/D world snapshots
(Phase 3) are still TODO — see BACKUP-STRATEGY.md §11 phase plan.

											
										
										
											2026-05-07 18:29:30 +01:00
+								- [ ] Drop dead `matrix-postgres` + `mongodb` + `synapse-*` blocks from `/opt/docker/backup.sh` once retirement is complete (currently they no-op-skip — minor noise in log only).