docs: 2026-05-07 incident audit + backup strategy

Player YOU500 lost full inventory to AuthLimbo void-death at 17:13:39. Investigation revealed deployed /opt/docker/backup.sh is an 88-line stub missing the Minecraft block; last successful world backup 2026-05-02 (already pruned). No recoverable .dat exists. Files: - AUDIT-2026-05-07.md — server-side findings F-01..F-06 (P0 backups, no-keepInventory, AuthLimbo silent failure, chunk preload race, Xmx > container headroom, container hardening gaps) - BACKUP-HUNT-2026-05-07.md — exhaustive backup scan; only 6-week-old archive at _archive/minecraft-old-2026-04-27.tar.gz - BACKUP-STRATEGY.md — restic-based plan; 5min/hourly/daily classes, off-host to onyx via Tailscale, monthly drill - CROSS-REFERENCE-2026-05-07.md — repo+doc landing map; flags pre-existing infra/STATE.md backup-broken note + HA-CLUSTER restic draft to extend rather than duplicate - docs/RUNBOOK-BACKUP-RESTORE.md — operator runbook for .dat restore, full-world restore, host-loss restore, drill log
2026-05-07 17:33:24 +01:00 · 2026-05-07 17:33:24 +01:00 · a1cc3940cf
commit a1cc3940cf
parent 909eb7bbd6
5 changed files with 1215 additions and 0 deletions
--- a/AUDIT-2026-05-07.md
+++ b/AUDIT-2026-05-07.md
@ -0,0 +1,184 @@
 # Minecraft Server Audit — racked.ru
 **Container:** `minecraft-mc` on nullstone (192.168.0.100)
 **Date:** 2026-05-07
 **Audit type:** Operational / data-integrity (NOT a network-security audit)
 **Auditor:** Claude (Opus 4.7) via SSH read-only inspection
 **Catalyst:** Player **YOU500** void-died at login (~17:13:39 BST), inventory lost. No usable backup existed.
 ---
 ## Executive Summary
 **Status:** Critical issues found.
 **Risk score model:** Likelihood (1-5) x Impact (1-5) = 1-25. >=15 = High, >=20 = Critical.
 A live AuthLimbo `teleportAsync returned false` warning fired during YOU500's first login of the day, immediately after `YOU500 left the confines of this world` (void death in `auth_limbo` world). The player retried twice. On retry #3 they were teleported to (-264.6, 86, -49.8) and 23 seconds later `was blown up by Creeper`. Console operator (s8n) attempted recovery via RCON but neither the void death nor the creeper death had item-restore data because:
 1. **No working backups.** `/opt/docker/backup.sh` deployed on nullstone is a stale 88-line copy missing the entire Minecraft block. The repo version (`scripts/backup.sh`) has the block but **was never deployed**. Daily 02:00 cron has been running for at least 7 days producing 8-12K archives that contain no world / playerdata / plugins. `BACKUP.md` claims the script handles MC; it does not.
 2. **CoreProtect tracks inventory transactions but not death drops.** `co inspect` will not surface "dropped on death" entries the way it does pickup/drop, and even if it did, the 1.5 GB SQLite blob is approaching the point where `/co rollback` over an inventory radius is operationally slow.
 3. **No `keepInventory` rule, no death-drop rescue plugin.** With `difficulty=hard`, `gamemode=survival`, and no Essentials `keepinv` permission flow visible, every death is a total loss.
 4. **AuthLimbo has no death-listener and no failure remediation.** When `teleportAsync` returns false, the player is dropped at limbo spawn and the warning is logged at WARN level only — no alert, no rollback, no temp-stash of inventory.
 5. **JVM heap sized larger than container limit.** `JVM_OPTS=-Xmx16384M` inside an `18G` container limit with `MEMORY_SIZE=16G`; if Aikar G1 heap actually grows to Xmx, plus off-heap (Netty, mmaps, zip cache) >2 GB, **kernel OOM kills the container**. Restart-on-OOM has no warning hook to discord/Matrix.
 **Three biggest exposures**
 1. Backups silently broken for 7+ days. (Critical — 5x4=20)
 2. No item-loss safety net for any cause of death. (Critical — 4x5=20)
 3. AuthLimbo failure path has no recovery. (High — 4x4=16)
 ---
 ## Findings Table
 Severity = Likelihood x Impact. P0 = act this week, P1 = this month, P2 = this quarter.
 | ID | Severity | Finding | Recommendation | Effort |
 |----|----------|---------|----------------|--------|
 | F-01 | **P0 / 20** | `/opt/docker/backup.sh` on nullstone is missing the entire MC backup block. Repo `scripts/backup.sh` has it but was never deployed. Daily backups since 2026-04-30 are 8-12K (effectively empty). | Sync the deployed script with repo, run a manual backup, verify world tarball >= 5 GB. Add a sentinel check to backup.sh that fails the run if `mc-world-backup-*.tar.gz` < 1 GB. | 30 min |
 | F-02 | **P0 / 20** | No `keepInventory` rule and no `essentials.keepinv` permission. Every death is total loss. | Decide policy: (a) `gamerule keepInventory true` server-wide, (b) keep-inv only when death cause is "void"/"plugin teleport", or (c) auto-restore-on-AuthLimbo-failure. The narrow option (b) preserves survival pain while plugging the AuthLimbo data-loss vector. Plugin candidates: `KeepInventoryOnVoid`, `DeathChestPro`, custom listener in AuthLimbo. | 1-2h research, 1d implement |
 | F-03 | **P0 / 18** | AuthLimbo logs WARN on teleport failure but has no alerting or recovery. The player is left at limbo spawn (y128 platform) where they re-disconnect and on retry get teleported normally — but the warning never reaches an operator. | (a) Bump `teleportAsync returned false` to ERROR. (b) Add a Discord/Matrix webhook alert via existing webhook stack. (c) On failure: snapshot player inventory, kick with friendly message, write recovery file `auth_limbo/incident-<uuid>-<ts>.dat` for ops replay. | 1d |
 | F-04 | **P0 / 18** | YOU500's first failed teleport target was (2380.4, 69.9, -11358.4) — that's 11k blocks out and the chunk likely was not loaded yet. AuthLimbo's `preload-chunks: true` setting fires on `AuthMeAsyncPreLoginEvent` which may not run before `LoginEvent` in HaHaWTH's AuthMe fork. Exact timing race is unverified. | Add chunk-loaded assertion in AuthLimbo before calling `teleportAsync`; if not loaded, force-load synchronously OR delay teleport another 10-20 ticks. Add debug logging of chunk-load state in the WARN line. | 0.5d |
 | F-05 | **P0 / 16** | JVM `-Xmx16384M` inside container `mem_limit=18G` with no headroom for off-heap (Netty buffers, native mmaps, mod metadata). Aikar flags + 25 plugins easily push native to 2-3 GB. Kernel OOM kill is silent. | Either (a) lower `-Xmx` to 12-14 GB and `MaxRAMPercentage`-style flag, OR (b) raise `mem_limit` to 24 GB. Also add `oom_score_adj` and a `docker events --filter event=oom` watcher that pings Discord. | 1h config + 2h alerting |
 | F-06 | **P0 / 16** | No `pids_limit`, no `cap_drop: ALL`, no `read_only: true`. Container runs with the default Docker capability set (CAP_NET_RAW, CAP_SYS_CHROOT, etc.) it does not need. | Add `cap_drop: [ALL]`, `cap_add: [NET_BIND_SERVICE]` (only if binding <1024; 25565 is high so likely none), `pids_limit: 4096`, `security_opt: [no-new-privileges:true]`. Test boot, watch for startup failures. | 1h test |
 | F-07 | **P1 / 15** | CoreProtect SQLite at 1.5 GB. Performance and reliability degrade past 2-3 GB. `database.db` is the only copy; no WAL checkpoint or vacuum schedule. | (a) Migrate to MySQL/MariaDB sidecar container. (b) Add monthly cron `co purge t:30d` (purge entries older than 30 days; CoreProtect docs). (c) Schedule `VACUUM` after purge. | 1d for MySQL migration, 1h for purge cron |
 | F-08 | **P1 / 12** | AuthMe still on `passwordHash: SHA256` (legacy). Migration plan for SHA256 -> BCRYPT is on TODO list and still pending. | Set `legacyHashes: [SHA256]` and `passwordHash: BCRYPT`. AuthMe re-hashes on next successful login. Communicate "your password works as before, no action needed". | 30 min config + monitoring |
 | F-09 | **P1 / 12** | `online-mode=false`. Server depends entirely on AuthMe + EpicGuard for identity. EpicGuard config not audited in this pass. | Verify `enableProtection: false` in AuthMe (currently false) is intentional, since geofencing is `US, GB, LOCALHOST` only — any user from another country is locked out if protection re-enabled. Document the choice in `RULES.md`. | 1h doc only |
 | F-10 | **P1 / 12** | `auto-save-interval: 2400` (= 2 minutes at 20 TPS) is fine, BUT `paper-global.yml` has `player-auto-save: rate: -1` (= use auto-save-interval, so also 2 min). A player who joins, dies, and disconnects within 2 min may have NO post-death snapshot persisted before the player.dat is overwritten by their next login. Player save *does* fire on quit, but if the death happens and the player keeps moving / interacting before logout, items in chunks not yet saved are at risk for tar-while-running backups. | Set `player-auto-save: rate: 1200` (= 1 min). Switch backup strategy to `save-off` + `save-all flush` + tar + `save-on` to guarantee consistency, OR snapshot the host bind-mount with a filesystem-level snapshot (LVM / btrfs / ZFS). | 30 min config, 0.5d for snapshot path |
 | F-11 | **P2 / 10** | `EZShop-1.0-SNAPSHOT.jar` is bundled alongside `AuctionHouse-1.4.6.jar`. PLUGIN_ALTERNATIVES.md TODO calls for dropping EZShop. | Remove EZShop, migrate any active shops to AuctionHouse, document the migration in `docs/migrations/`. | 0.5d player communication, 1h technical |
 | F-12 | **P2 / 10** | Spigot `entity-tracking-range`: monsters 96, misc 96. Roadmap suggests tightening to monster=32, misc=16 for TPS / network savings. | Tune on next maintenance window, re-baseline TPS with `spark` profile. | 1h config, 1d to verify under load |
 | F-13 | **P2 / 9** | 21 plugin folders without matching jar (orphans): `bStats`, `CarbonChat`, `ComfyWhitelist`, `EpicGuard`, `Essentials`, `faststats`, `GrimAC`, `Homestead`, `Lands`, `LPC`, `MarriageMaster`, `MiniMOTD`, `Multiverse-Core`, `PhantomSMP`, `TAB`, `UltimateTimber`, `UnexpectedSpawn`, `Vault`, `WorldEdit`, plus `.bak-*` directories. Most have a renamed jar (`carbonchat-paper-...jar`, `EssentialsX-...jar`) so this is mostly cosmetic. `Lands`, `LPC`, `MarriageMaster`, `PhantomSMP`, `UltimateTimber`, `UnexpectedSpawn` truly orphaned: jars not present. | Audit each: delete data dirs of plugins truly removed; the bStats/Essentials/Vault names are normal. Document plugin-name <-> jar-name pattern in `PLUGINS.md`. | 1h |
 | F-14 | **P2 / 9** | No TPS Discord webhook alert (mentioned on TODO). spark is installed but auto-profile + alerting are not wired up. | spark already supports `spark profile --thresholds`; route to Discord via existing webhook stack. | 0.5d |
 | F-15 | **P2 / 8** | RCON output for async commands (CoreProtect, LuckPerms) does not return to the issuing rcon-cli session. Found while trying `co inspect` from RCON. Async command results land in console only. | Document this in `docs/OPERATIONS.md` (does not exist yet — create it). For automation, attach to `docker logs -f minecraft-mc` in parallel. | 30 min doc |
 | F-16 | **P2 / 8** | `gamerule keepInventory` could not be queried via `rcon-cli` due to `execute in <world> run` argument parsing bug in itzg's rcon-cli wrapper (or RCON quoting). State unknown without in-game console. | Verify in-game by op user, document the rcon-cli limitation. | 5 min in-game |
 | F-17 | **P2 / 6** | `RCON_PASSWORD` is committed to `docker-compose.yml` in plaintext (`*redacted*`). RCON port (25575) is bound to `127.0.0.1` so the blast radius is local only — but the secret is still in git history. | Rotate password, move to `.env` (gitignored), confirm `127.0.0.1`-only binding stays. | 30 min |
 | F-18 | **P2 / 6** | `restart: unless-stopped` with no `start_period` re-evaluation on rapid OOM-restart loops. If the container OOMs every 60s, Docker keeps restarting indefinitely. | Add `restart_policy: { condition: on-failure, max_attempts: 5, window: 300s }` (compose v3+ deploy block) and a watchdog alert. | 30 min |
 ---
 ## Detailed Methodology
 ### Inputs inspected (read-only, no writes)
 | Source | Path | Method |
 |--------|------|--------|
 | Container env | `docker inspect minecraft-mc` | host shell |
 | docker-compose | `/opt/docker/minecraft/docker-compose.yml` | host cat |
 | AuthLimbo config | `/data/plugins/AuthLimbo/config.yml` | `docker exec cat` |
 | AuthLimbo logs | `/data/plugins/AuthLimbo/` (no log files exist; only `config.yml`) | `docker exec ls` |
 | AuthMe config | `/data/plugins/AuthMe/config.yml` | `docker exec cat` |
 | AuthMe DB record for YOU500 | `/data/plugins/AuthMe/authme.db` | `docker exec python3 sqlite3` |
 | CoreProtect config | `/data/plugins/CoreProtect/config.yml` | `docker exec cat` |
 | CoreProtect DB size | `/data/plugins/CoreProtect/database.db` | `docker exec du -sh` |
 | Server log | `/data/logs/latest.log` | `docker exec grep` |
 | Paper / Spigot / Purpur configs | `/data/config/paper-*.yml`, `/data/spigot.yml`, `/data/purpur.yml` | `docker exec cat` |
 | World sizes | `/data/world*/` | `docker exec du -sh` |
 | Backup script (deployed) | `/opt/docker/backup.sh` | host cat |
 | Backup script (repo) | `/home/admin/ai-lab/_github/minecraft-server/scripts/backup.sh` | local cat |
 | Backup output | `/opt/backups/` | host stat |
 | Backup log | `/opt/backups/backup.log` | host tail |
 | Live state | RCON `tps`, `list` | `docker exec rcon-cli` |
 ### YOU500 incident timeline (reconstructed from `latest.log`)
 | Time (BST 2026-05-07) | Event |
 |-----------------------|-------|
 | 17:13:34 | Login from 45.157.234.219, UUID c7c2df8e-...-686b |
 | 17:13:35 | Spawned in `auth_limbo` (0.5, 128, 0.5) per AuthLimbo platform default |
 | 17:13:38 | AuthMe: "YOU500 logged in" |
 | 17:13:39 | AuthLimbo: "Restoring YOU500 to world(2380.4, 69.9, -11358.4)" |
 | 17:13:39 | **`YOU500 left the confines of this world`** — void death |
 | 17:13:39 | **`[AuthLimbo] teleportAsync returned false for YOU500 — Paper may have rejected the location.`** |
 | 17:15:33 | Disconnect |
 | 17:15:39 | Re-login from 82.22.5.229. Stored auth-loc has now been UPDATED to (-264.6, 86, -49.8) — different from the first attempt. Either user `/sethome`'d previously or AuthMe overwrote on the void death. |
 | 17:15:44 | AuthLimbo: "Restoring YOU500 to world(-264.6, 86.0, -49.8)" — no WARN this time |
 | 17:15:53 | Disconnect |
 | 17:16:00 | Re-login from 82.22.5.230 |
 | 17:16:05 | AuthLimbo: "Restoring YOU500 to world(-264.6, 86.0, -49.8)" |
 | 17:16:28 | **`YOU500 was blown up by Creeper`** |
 | 17:16:57 | Operator (s8n) RCON: `tpa YOU500 -264 86 -50` + `tell YOU500 grab items fast 5min despawn` |
 | 17:17:02 | RCON teleport executed |
 | 17:18:22 | s8n in-game: `/tp2p YOU500 s8n` |
 The void death at 17:13:39 is the data-loss event. AuthMe had `SaveQuitLocation: true` so the (2380, 70, -11358) was a real prior position but the chunk was almost certainly not loaded yet (11k blocks out, no recent player there). `teleportAsync` returned false either because:
 - the chunk failed to load within Paper's async generation budget, or
 - the entity was already dead (void death raced ahead of teleport).
 ### What CoreProtect WOULD have caught (and didn't)
 CoreProtect inventory tracking is enabled (`item-transactions: true`, `item-drops: true`, `item-pickups: true`, `rollback-items: true`). However:
 - A void-death drops items into the world for ~5 min then despawns. Drops are item entities, not container transactions; CoreProtect logs them as drops only if a player was the immediate cause of the drop.
 - A death-drop in the `auth_limbo` world (where the void death happened) drops into y<0 air which is itself a non-event for CP.
 - Thus there was no item-rollback path even if `co inspect` had been run within minutes.
 **Implication:** CoreProtect is the wrong tool for death-drop recovery. A real death-drop plugin or `keepInventory` is the only fix.
 ### Backup script forensics
 - Deployed: 88 lines, last block is "Prune old backups". No Minecraft block. No `umask 077`.
 - Repo: 131 lines (with malformed lines 119-122 leftover from a bad merge — ALSO a bug to fix on the next push). Has the Minecraft block. Has `umask 077`.
 - `/opt/backups/backup.log` shows last 5 days of "Backup complete" entries averaging 8-12K. None contain MC data. None mention MC. The log line `Configs: partial (some files missing)` is the configs section misfiring on Matrix paths and was never the MC block.
 - Last verified-good MC archive on host: `/opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz` (one-shot pre-rebrand snapshot; contents not verified in this audit).
 ---
 ## Action Items (Prioritised)
 ### P0 — this week (by 2026-05-14)
 1. **F-01 / Backups.** Sync deployed backup.sh with repo. Fix the lines 119-122 corruption in repo first. Add post-run sentinel: `[ "$(stat -c%s mc-world-backup-*.tar.gz)" -gt 1073741824 ] || log "WORLD BACKUP TOO SMALL — ABORT"`. Run manual backup, verify >= 5 GB on disk. Test a restore into a scratch dir.
 2. **F-02 / Item-loss safety net.** Decide policy. Recommend: enable `keepInventory true` in `auth_limbo` world only (cheap, narrow), and write a 50-line AuthLimbo extension `OnPlayerDeath` listener that detects "death in auth_limbo" -> restore inventory snapshot taken at AuthMeAsyncPreLogin. Survival pain preserved everywhere else.
 3. **F-03 / AuthLimbo recovery.** Bump WARN to ERROR. Wire to existing Discord webhook (per workspace memory: webhook stack on nullstone). On failure, write player snapshot to `auth_limbo/incidents/<uuid>-<ts>.dat`.
 4. **F-04 / Chunk preload race.** Add chunk-loaded check + sync force-load before `teleportAsync`. If still false, kick with friendly message instead of letting the player drop into limbo.
 5. **F-05 / OOM headroom.** Lower `-Xmx` to 14 GB and add `docker events` watcher.
 6. **F-06 / Container hardening.** Add `cap_drop`, `pids_limit`, `no-new-privileges`. Boot test in a window.
 ### P1 — this month
 7. **F-07** CoreProtect prune cron, plan MySQL migration.
 8. **F-08** SHA256 -> BCRYPT migration with legacyHashes fallback.
 9. **F-09** Document `online-mode=false` rationale in RULES.md.
 10. **F-10** Consider LVM/ZFS snapshot for backup atomicity.
 ### P2 — this quarter
 11. **F-11** Drop EZShop after player communication window.
 12. **F-12** Tighten entity tracking range, re-profile with spark.
 13. **F-13** Clean orphan plugin folders.
 14. **F-14** Wire spark TPS alerts to Discord.
 15. **F-15** Document RCON async-command behaviour.
 16. **F-17** Rotate RCON password, move to .env.
 17. **F-18** Add restart-policy max_attempts.
 ---
 ## Open Questions for the Operator
 1. **Inventory restoration policy.** Is silent `keepInventory` only in `auth_limbo` acceptable, or do you want a manual ops-restore-from-snapshot approval gate?
 2. **YOU500 specifically.** Is there an out-of-band record of what they were carrying (Discord screenshot, witness)? If yes, manual NBT injection into player.dat is feasible. CoreProtect cannot help.
 3. **Chunk preload trade-off.** Force-loading distant chunks at login adds 200-2000ms to login time. Acceptable vs the void-death risk?
 4. **MySQL for CoreProtect.** Adds an operational dependency (another container, another backup target). Worth the complexity, or is monthly purge to keep SQLite under 1 GB sufficient?
 5. **RCON password rotation.** The committed value should be rotated on principle. Schedule a maintenance window?
 6. **online-mode=false.** Confirm long-term stance. Mojang ToS implications for racked.ru?
 7. **Backups offsite.** Currently `/opt/backups/` is on the same host. Plan for offsite copy (B2, restic to friend-PC, anything)?
 ---
 ## What was NOT in scope this audit
 - Network firewall, fail2ban, host-side security (nullstone-server has its own audit folder).
 - Plugin source-supply-chain audit (covered by `docs/ROADMAP.md` "plugin acquisition overhaul").
 - Performance profiling under load (deferred per F-12).
 - LuckPerms permission graph correctness.
 - Rules / chat-format / prefix audit (workspace memory: do NOT touch LP prefixes).
 - Per-region (Lands / Homestead) data integrity.
 ---
 ## Sign-off
 | Field | Value |
 |-------|-------|
 | Audit date | 2026-05-07 |
 | Method | Read-only SSH inspection, no fixes applied |
 | Workspace rule applied | "Audit findings -> docs first, then fix" |
 | Next action | Operator review + go/no-go on each P0 item |
 | Next audit due | 2026-08-07 (quarterly), or sooner after backups remediated |
--- a/BACKUP-HUNT-2026-05-07.md
+++ b/BACKUP-HUNT-2026-05-07.md
@ -0,0 +1,118 @@
 # YOU500 Inventory Recovery — Backup Hunt Report
 **Date:** 2026-05-07
 **Player:** YOU500 (UUID `c7c2df8e-8783-30b5-891c-86ec9343686b`)
 **Incident:** Full inventory loss at 17:13:39 BST. AuthLimbo `teleportAsync returned false`, player teleport into world from auth_limbo failed → `YOU500 left the confines of this world` (void death). Vanilla `/data/world/playerdata` overwritten on respawn with empty inventory; vanilla void = no drops in world.
 **Host:** nullstone (192.168.0.100), live MC data at `/home/docker/minecraft/` (== `/opt/docker/minecraft/`, same FS, inode 18877649 confirmed).
 **SSH user:** `user` (no sudo). All `/opt/backups/2026*` dated subdirs are root-owned 0700 → unreadable. `/var/lib/docker/volumes/` unreadable.
 ---
 ## Summary
 **Recoverable backup exists: YES — partial.** The pre-rebrand world archive `/home/user/ai-lab/_archive/minecraft-old-2026-04-27.tar.gz` contains YOU500's playerdata `.dat` from **2026-03-25 18:53** (size 9617 B vs current 9192 B — bigger = inventory likely populated). It is the **only known full-inventory snapshot for this UUID** anywhere on the host.
 **Caveat:** This is a 6-week-old snapshot. Items gained between 2026-03-25 and 2026-05-07 17:13 are NOT recoverable from any file backup. **CoreProtect** is installed and has been logging since 2026-05-01 → use `/co inventory YOU500` and `/co rollback` to retrieve anything stored in containers post-2026-05-01.
 **No scheduled world backups exist.** `/opt/docker/backup.sh` stopped backing up the MC world after 2026-05-02 (the world-backup branch was removed when the script was last edited; only configs/Matrix/RC are now dumped). Last world tarball that landed on disk: `/opt/backups/20260430_020001/minecraft-configs-20260430_020001.tar.gz` (12 KB → configs only, no playerdata).
 ---
 ## Inventory of Backup Artifacts (oldest → newest)
 | When | Path | Size | Owner | Contains YOU500 .dat? | Notes |
 |------|------|------|-------|----------------------|-------|
 | 2026-03-25 18:53 (file mtime inside) | `/home/user/ai-lab/_archive/minecraft-old-2026-04-27.tar.gz` | ~? large | user | **YES** — `minecraft/world/playerdata/c7c2df8e-…dat` 9617 B + `.dat_old` 9616 B (2026-03-25 18:49) | **Best candidate.** 133 player .dat files, full world tree, Essentials/LitePlaytimeRewards/LandClaim DBs, advancements, stats. |
 | 2026-04-30 02:01 | `/opt/backups/20260430_020001/minecraft-configs-20260430_020001.tar.gz` | 12 KB | root (UNREADABLE) | NO — configs only | Cannot read without sudo; size implies no world data anyway. |
 | 2026-04-30 02:01 | `/opt/backups/20260430_020001/configs-20260430_020001.tar.gz` | 2.4 KB | root | NO | Traefik/Matrix/RC configs. |
 | 2026-04-30 19:21 | `/opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz` | 224 MB | user | NO playerdata `.dat` files. Has `plugins/AuthMe/playerdata/` (empty), `plugins/AuthMe.bak-20260430-144204/playerdata/` (empty), `plugins/SkinsRestorer/cache/YOU500.mojangcache`. Vanilla world NOT included. | Plugin trees only — useful for password DB (`plugins/AuthMe.bak-…/authme.db`), not inventory. |
 | 2026-05-03 02:00 | `/opt/backups/20260503_020001/configs-20260503_020001.tar.gz` | 2.4 KB | root | NO | Configs. |
 | 2026-05-04 02:00 → 2026-05-07 02:00 | `/opt/backups/20260504_020001` … `20260507_020001` | 0700 dirs | root (UNREADABLE) | Inferred NO from log: backup.log shows only "configs OK" / "Matrix Postgres skipping" / "Volumes skipping" — world not touched after 2026-05-02. | All four dirs report 12 KB. |
 | 2026-05-07 17:15 | `/home/docker/minecraft/world/playerdata/c7c2df8e-…dat_old` | 9181 B | uid 101000 | YES — but POST-DEATH (empty inventory). | Identical to live state right after first respawn. |
 | 2026-05-07 17:21 | `/home/docker/minecraft/world/playerdata/c7c2df8e-…dat` | 9192 B | uid 101000 | YES — current live, empty inventory. | |
 | 2026-05-07 17:15 | `/tmp/you500.dat` | 9181 B | user | YES — but byte-identical-size to `.dat_old`; gunzip strings show only base attribute schema (no item/Slot tags) → already empty. | Someone (you) already extracted the empty post-death dat. Useless for recovery. |
 ### Misc archives checked, NOT relevant
 - `/opt/source-endpoint/source.tar.gz` — Misskey AGPL source dump.
 - `/opt/backups/misskey/*` — Misskey DB/files.
 - `/home/user/ai-lab/.stversions/_projects/_minecraft/launcher/java/java21.tar~*.gz` — JDK.
 - `/home/user/ai-lab/_projects/_minecraft/resources/racked.ru.-.minecraft.7z` — launcher resources.
 - `/home/user/ai-lab/.stversions/**` — Syncthing versions hold only **server config files** (`server.properties`, `bukkit.yml`, `purpur.yml` etc.) under `_github/online/minecraft-server/config/`. **No `.dat` or `playerdata/`** anywhere in `.stversions`. `.stignore` does not list `world/`, but the synced repo never contained the world dir to begin with (it's `_github/minecraft-server/` = configs + docker-compose only).
 ---
 ## CoreProtect — Live Rollback Source
 | Path | Size | Born | Last modified |
 |------|------|------|---------------|
 | `/data/plugins/CoreProtect/database.db` (in container) | 1.59 GB | 2026-05-01 10:11:53 | 2026-05-07 17:27 |
 CoreProtect logs container interactions, item drops, deaths, inventory changes since **2026-05-01**. For YOU500's items stored in chests/shulkers/ender chests within the world, an in-game rollback can recover them:
 - Inspect deaths: `/co lookup user:YOU500 action:#kill time:1d`
 - Inspect inventory transactions: `/co inventory YOU500` (CoreProtect-CE feature)
 - Rollback drops/voids near death: `/co rollback time:1h user:YOU500 radius:#global action:-drop,#kill`
 (Items YOU500 carried in person and lost to void at 17:13:39 are unlikely to appear in CoreProtect — vanilla void death deletes drops without a kill event in some versions; CoreProtect's `#kill` may or may not have logged it. Worth a `/co lookup user:YOU500 time:30m` to confirm.)
 ---
 ## Best Recovery Candidate
 **File:** `/home/user/ai-lab/_archive/minecraft-old-2026-04-27.tar.gz`
 **Internal path:** `minecraft/world/playerdata/c7c2df8e-8783-30b5-891c-86ec9343686b.dat`
 **Snapshot date:** 2026-03-25 18:53 (~6 weeks before incident).
 ### Extraction command (DO NOT RUN — for review only)
 ```bash
 # Extract just the YOU500 dat to a staging area, do NOT touch live data
 mkdir -p /tmp/you500-recovery
 tar -xzvf /home/user/ai-lab/_archive/minecraft-old-2026-04-27.tar.gz \
  -C /tmp/you500-recovery \
  minecraft/world/playerdata/c7c2df8e-8783-30b5-891c-86ec9343686b.dat \
  minecraft/world/playerdata/c7c2df8e-8783-30b5-891c-86ec9343686b.dat_old
 # Confirm and inspect (NBT viewer or zcat | strings) before any restore
 ls -la /tmp/you500-recovery/minecraft/world/playerdata/
 zcat /tmp/you500-recovery/minecraft/world/playerdata/c7c2df8e-8783-30b5-891c-86ec9343686b.dat \
  | strings | grep -E 'Slot|count|minecraft:diamond|minecraft:netherite' | head -40
 ```
 ### Restore plan (operator decision — NOT executed)
 1. Stop the server (or kick YOU500) so file is not held open.
 2. With sudo (uid 101000 owns the file): copy the extracted `.dat` over `/home/docker/minecraft/world/playerdata/c7c2df8e-8783-30b5-891c-86ec9343686b.dat`, preserve mode/owner.
 3. Also overwrite `.dat_old`.
 4. Optional: replace `Essentials/userdata/c7c2df8e-…yml` from same archive if the YML matters.
 5. Restart server. Player rejoins with March 25 inventory + position.
 **Tradeoff:** YOU500 will lose all progress 2026-03-25 → 2026-05-07. Communicate before applying. Combine with CoreProtect rollback to minimise loss.
 ---
 ## Gaps
 - **No scheduled world backups since 2026-05-02.** `/opt/docker/backup.sh` no longer dumps `world/`. The 2026-04-30 daily contains a 12 KB "minecraft-configs" tarball (configs, not world). Action: re-add a world tarball to the daily script.
 - **No off-host backup.** No restic / borg / duplicity / rsnapshot installed. No rclone. No second host pulling MC data. Syncthing does not sync the world dir.
 - **No filesystem snapshots.** Root is ext4 on LVM (no LVM thinpool snapshots in use), `/home` is ext4 (no btrfs/ZFS).
 - **`/var/lib/docker/volumes/` unreadable** without sudo. Confirmed via `docker volume ls | grep -iE mine|back|world` returning empty (named volumes not used for MC — bind mount only).
 - **`/opt/backups/2026*_020001` subdirs unreadable** (mode 0700 root). Cannot diff their contents byte-for-byte; relied on `backup.log` text + indirect listing. They almost certainly contain only configs (12 KB dirs, log entries match).
 - **`docker exec minecraft-mc env | grep -i backup` returned nothing** — no env-driven autosave/backup plugin enabled (e.g. `itzg/mc-backup` sidecar absent, no AutomatedBackup / EasyBackup jar in `/data/plugins`).
 - **AuthMe `playerdata/` dirs are empty** in both live and `.bak-20260430-144204` — AuthMe is configured without inventory protection (no logged-out inv snapshots).
 - **No InvSee / InventoryRollback plugin.** Only CoreProtect (logs, not snapshots).
 ---
 ## Permission-Limited Reads (no sudo via SSH)
 | Path | What we couldn't see | Likely contents |
 |------|----------------------|-----------------|
 | `/opt/backups/20260504_020001/` … `20260507_020001/` | Directory listings (0700 root) | Daily configs tarballs, ~12 KB each — confirmed via `du` in backup.log |
 | `/opt/backups/20260430_020001/minecraft-configs-20260430_020001.tar.gz` | tar listing (root-owned, 0600) | MC config bind-mount tarball, 12914 B |
 | `/var/lib/docker/volumes/` | Directory listing | Named volumes — not used by MC (bind mount only) |
 | `/var/backups/` (host) | Listing | Standard Debian dpkg/apt backups, not MC |
 | `/root/` | Anything | — |
 Re-run with `sudo` if any of these need confirmation, but content is improbable to change the conclusion.
--- a/BACKUP-STRATEGY.md
+++ b/BACKUP-STRATEGY.md
@ -0,0 +1,393 @@
 # Minecraft Backup Strategy — racked.ru on nullstone
 **Status:** PROPOSAL (2026-05-07) — not yet implemented.
 **Author trigger:** Player lost full inventory to void death today; rollback impossible because the existing 02:00 daily backup had **silently failed for 5 of the last 7 days** and there is **zero off-host copy**.
 **Owner:** `s8n` (operator).
 **Target host:** `nullstone` (192.168.0.100, Debian 13 trixie).
 ---
 ## 0. Current state (audited 2026-05-07)
 Existing system in `/opt/docker/backup.sh` + `cron.d/docker-backup` (02:00 daily, 7-day retention in `/opt/backups/`).
 Findings from `/opt/backups/backup.log`:
 | Date | MC world result | Backup dir total |
 |------|-----------------|------------------|
 | 2026-04-26 | FAILED | — |
 | 2026-04-27 | FAILED | — |
 | 2026-04-28 | FAILED | — |
 | 2026-04-29 | OK (3.6 G) | — |
 | 2026-04-30 | FAILED | — |
 | 2026-05-01 | FAILED | — |
 | 2026-05-02 | OK (3.6 G) | — |
 | 2026-05-03 | (no MC log line) | 8 K |
 | 2026-05-04 | (no MC log line) | 8 K |
 | 2026-05-05 | (no MC log line) | 8 K |
 | 2026-05-06 | (no MC log line) | 12 K |
 | 2026-05-07 | (no MC log line) | 12 K |
 After 2026-05-02 the entire MC block stopped emitting log lines. The script appears to be exiting before reaching it (the duplicated stray `chmod 600 ... synapse-signing-key` lines at L119–122 are orphaned from a botched edit and may now break `set -e`). Effective state: **two MC backups in the last 12 days**, both already pruned by 7-day retention. **No usable backup exists right now.**
 Cross-references:
 - `_github/infra/STATE.md` Top-5 weakness #2 ("backup.sh broken silently") and #5 ("No off-host backup").
 - `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md` §5 already names this `F-backup-1` and proposes "Restic + autorestic to B2/Wasabi or to nullstone-as-spare". This strategy refines that to use on-hand resources rather than paid storage.
 ### Available resources (no purchasing required)
 | Asset | Location | Free | Reachability | Role |
 |---|---|---|---|---|
 | nullstone `/home` | local NVMe (ext4 LVM) | 142 G of 399 G | local | Primary repo + restic cache |
 | onyx `/home` | LUKS NVMe | 1.6 T of 1.9 T | Tailscale 100.64.0.1 (LAN ~5 ms) | **Off-host primary** |
 | friend RTX 4080 PC | DESKTOP-LR0RILA | unknown (Windows, large) | Tailscale 100.64.0.3 (WAN, IP-stable via tailnet) | **Off-host secondary** (defer) |
 | nullstone `/opt/backups` | same disk as `/opt/docker` | 142 G | local | *Not* a real backup target — same-disk SPOF |
 **No purchased B2 / Wasabi / S3 in this proposal.** Tailscale + onyx covers off-host today. B2 stays in the future-options annex.
 ---
 ## 1. Threat model
 | # | Threat | Concrete example | Frequency | Mitigation in this plan |
 |---|---|---|---|---|
 | T1 | Player accidental loss (void death, lava, fall) | YOU500, 2026-05-07 | weekly | 5-min playerdata snapshots (RPO ≤ 5 min) |
 | T2 | Griefing / theft / chest emptied by ban-evader | possible | monthly | 5-min playerdata + 1-h world snapshots |
 | T3 | World corruption (chunk error, region-file truncate) | rare | — | 6-h pre-flight validated full world snapshot |
 | T4 | Plugin / config bad change (LuckPerms wipe, server.properties) | edits during ops | weekly | daily configs + DB dump + git history (`live-server/` repo) |
 | T5 | Host disk failure (single NVMe) | low/year | — | nightly off-host copy to onyx (Tailscale) |
 | T6 | Ransomware / host compromise | low | — | append-only Restic repo on onyx; nullstone holds **no** delete key |
 | T7 | Operator `rm -rf` or wrong `docker compose down -v` | low | — | retention floor (4 weekly + 12 monthly) survives a recent rm |
 | T8 | Backup script silently failing (current state) | OBSERVED | — | heartbeat alert + monthly restore drill (§7) |
 T8 is the one that just bit us. The single most important addition is **alerting on missed runs**, not the storage tech.
 ---
 ## 2. RPO / RTO
 | Class | Data | RPO | RTO | Backup mechanism |
 |---|---|---|---|---|
 | A | playerdata (`world/playerdata/*.dat`, `stats/`, `advancements/`) | **5 min** | < 2 min per player | rcon `save-all flush` → rsync to local snapshot, then restic-add |
 | B | full world (region files, end + nether) | **1 h** during play, **6 h** otherwise | 15 min | restic of `world*/` |
 | C | plugin configs + LuckPerms YAML | 24 h | 30 min | tar of `plugins/*/config*.yml` + LP file dump |
 | D | LuckPerms / Homestead SQLite DBs (`*.db`, `homestead_data.db`) | 1 h | 5 min | sqlite `.backup` then restic-add |
 | E | host-level configs (`docker-compose.yml`, `server.properties`, `purpur.yml`, `bukkit.yml`, `paper-*.yml`, `whitelist.json`, `ops.json`, `banned-*.json`, `config/`) | 24 h | 5 min | already in git repo `_github/minecraft-server/`; backup just covers drift |
 **Justification for RPO=5 min on Class A:** the void-death case rebuilds in seconds — recovering one `<uuid>.dat` is a ~30 s operation if a 5-min-old snapshot exists. Snapshotting just the 1.3 MB `playerdata/` dir is cheap (single-digit MB/day after dedup).
 ---
 ## 3. Tool choice — Restic
 Compared:
 | Tool | Dedup | Encryption | Snapshots | Network destinations | Verdict |
 |---|---|---|---|---|---|
 | **restic** | content-addressed, very effective on MC region files | AES-256, repo-key | yes | sftp (Tailscale), local, B2, S3, Azure, rclone | **WINNER** |
 | borgbackup | similar | yes | yes | ssh only, lock-on-write | Equally good; restic chosen because operator already plans `restic + autorestic` per `infra/STATE.md` line 112; sftp dest is simpler than borg's required serverside binary |
 | rsnapshot | hardlinks, no dedup | none | rotated dirs | local + rsync | No encryption ⇒ off-host copy on Tailscale (already encrypted) is fine, but no dedup means 18 G × N snapshots is painful. Reject. |
 | zfs send | block-level | (zfs native) | snapshots | yes | nullstone is **ext4/LVM**, no ZFS, no btrfs. Reject. |
 | LVM snapshot | COW | none | yes | local only | Same-disk only, doesn't survive disk failure. Useful as a *staging* primitive only. |
 | custom rsync + cp -al | hardlinks | none | yes | yes | Reinventing rsnapshot. Reject. |
 | itzg `BACKUP_*` env | tar to volume | none | rotation | local | Already tried in spirit by current `backup.sh`; same-disk; not granular. Reject as primary. |
 **Decision:** `restic` for Classes A, B, C, D. Continue using a thin tar wrapper for Class E (configs are already in the git repo, this is just safety).
 Restic strengths for our case:
 - Region files dedup *very* well (chunks unchanged across snapshots).
 - A 5-min Class-A snapshot adds ~MB to the repo, not the full 1.3 MB × N.
 - One repo on local disk + one mirror to onyx via `rclone serve restic` or direct `sftp:` — no agent needed on onyx beyond ssh.
 - `restic check --read-data-subset=5%` is the canonical scrub.
 Apt: `apt install restic` on trixie ships 0.16.x — sufficient.
 ---
 ## 4. Schedule
 All times Europe/London (matches `TZ` in compose file).
 | Job | Cadence | Source | Destination | Mechanism |
 |---|---|---|---|---|
 | **A — playerdata** | every **5 min** | `world/playerdata/`, `world/stats/`, `world/advancements/`, `world*/level.dat`, `*.db` (LP+homestead) | restic repo `/home/user/restic/mc-frequent/` | systemd timer `mc-backup-frequent.timer` |
 | **B — full world** | every **1 h** during play (07:00–01:00), **6 h** otherwise | `world/`, `world_nether/`, `world_the_end/` | restic repo `/home/user/restic/mc-world/` | systemd timer `mc-backup-world.timer` |
 | **C — configs + plugins** | **daily 02:00** | `/opt/docker/minecraft/*.yml`, `*.json`, `plugins/*/config*.yml`, `plugins/LuckPerms/`, `docker-compose.yml` | restic repo `mc-world` (path-tagged) | reuse same timer with second backup target |
 | **D — DB dumps** | every **1 h** | `homestead_data.db`, `plugins/CoreProtect/database.db`, `plugins/LuckPerms/luckperms-h2-*` | restic repo `mc-world` | timer hooks `sqlite3 .backup` first |
 | **E — off-host mirror** | **nightly 03:30** | nullstone `/home/user/restic/` | onyx `100.64.0.1:/home/admin/backups/nullstone-mc-restic/` | `restic copy` over sftp (Tailscale) — append-only key on onyx side |
 | **F — verify** | **weekly Sun 04:00** | both repos | — | `restic check --read-data-subset=5%` then alert on rc |
 | **G — drill** | **monthly 1st Sat 11:00** | random snapshot | scratch dir | §7 procedure |
 ### Why this works for the void-death case
 T1 hits at 18:42. By 18:45 a Class-A snapshot exists containing the player's `<uuid>.dat` from 18:40. Restore: `restic -r ... restore --target /tmp/r --include 'world/playerdata/<uuid>.dat' latest`, stop server (or `/save-off` + minimanip), copy file into place, `/save-on`. Total RTO < 2 min.
 ---
 ## 5. Retention
 Restic policy (passed to `restic forget --keep-*`):
 ```
 --keep-last 24            # 24 most recent (covers 2h of 5-min snapshots)
 --keep-hourly 24          # 24h of hourly
 --keep-daily 7            # 7 days
 --keep-weekly 4           # 4 weeks
 --keep-monthly 12         # 12 months
 ```
 Applied per-tag — Class A snapshots tagged `playerdata`, B/C/D tagged `world`. Forget is run **only on the local repo**; the onyx mirror inherits via `restic copy` with same policy after the local forget+prune.
 ### Storage budget
 - Class A: 1.3 MB raw × dedup (~20× on `.dat`, mostly empty NBT slots) → ~70 KB / snapshot **net**.
  - 24/h × 24h × 7 = 4 032 snapshots/week → < 300 MB/week.
 - Class B/C/D: 18 G raw → ~6.5 G compressed (per current 3.6 G figure × adjustment for nether/end now active). Restic dedup on hourly snapshots: ~50–200 MB delta/snapshot during active play.
  - 24h hourly + 7 daily + 4 weekly + 12 monthly ≈ 47 retained → estimate **15–25 GB total** at steady state.
 - E (off-host): same as above on onyx (1.6 TB free — 30× headroom).
 **Conclusion:** comfortably fits in nullstone's 142 G free. Onyx is essentially unconstrained.
 ---
 ## 6. Off-host destination — onyx via Tailscale
 **Choice:** `onyx` (100.64.0.1, 1.6 TB free on `/home`). Reasons:
 - Already in the tailnet (`tag:admin`), already trusted, already SSH-reachable.
 - 1.6 TB is 100× the dataset.
 - Operator's daily-driver: a missed-backup alert on onyx is *seen*.
 - Deferred (phase 2): replicate to friend's RTX 4080 PC (100.64.0.3) for true geographic separation. Tailnet IP is stable across the friend's ISP IP changes per memory `project_friend_gpu`.
 **Mechanics:**
 1. On onyx: create restricted user `mc-backup` with `~/backups/nullstone-mc-restic/` and a `~/.ssh/authorized_keys` entry that **only allows `internal-sftp` chrooted to that dir**, no shell, no port-forward. (`Match User mc-backup ... ChrootDirectory %h, ForceCommand internal-sftp -d /backups/nullstone-mc-restic`).
 2. On nullstone: install nullstone's ssh public key on onyx for that user. Use a second **append-only** restic key (separate password) so a compromised nullstone cannot run `forget`/`prune` on the onyx repo. Restic supports this via per-key `--no-cache`-friendly flags, but the harder lock comes from sftp chroot perms (set parent dir owner to root, give `mc-backup` write inside but no `unlink` on rotated lockfiles? — practical compromise: rely on `restic copy` adding-only and audit `forget` runs).
 3. Nightly job on nullstone: `restic -r sftp:mc-backup@100.64.0.1:/backups/nullstone-mc-restic copy --from-repo /home/user/restic/mc-world latest && ... mc-frequent ...`.
 4. Onyx-side cron weekly: `restic check` on the mirror (independent verification).
 **Why not friend's GPU PC?** Windows host, no built-in SSH, asymmetric trust. Defer to phase 2 once an SMB or `rclone serve` target is set up there.
 ---
 ## 7. Restore drill (monthly, 1st Saturday 11:00)
 Runbook: `docs/RUNBOOK-BACKUP-RESTORE.md` (created alongside this proposal).
 Drill scenario: "YOU500 lost his inventory to a void death 6 minutes ago." Steps:
 1. Pick a known UUID from `world/playerdata/` (operator's own UUID).
 2. `restic -r /home/user/restic/mc-frequent snapshots --tag playerdata | tail -5` — confirm freshest snapshot is ≤ 6 min old.
 3. `restic -r ... restore latest --target /tmp/drill-$(date +%s) --include 'world/playerdata/<uuid>.dat'`.
 4. `nbted` or `python -m nbtlib` parse the `.dat` — confirm it's a valid GZIP NBT structure (not zero bytes, not partial).
 5. `diff` against the live `.dat` — log the differences (expected: at least the inventory NBT path differs because player kept playing).
 6. Repeat from the **onyx mirror** repo to prove off-host works end-to-end.
 7. Log result to `docs/RUNBOOK-BACKUP-RESTORE.md` § Drill log.
 Drill is **non-destructive** — never overwrite live `.dat` during a drill. Real restores follow §3 of the runbook.
 Pass criteria: both restores complete in < 2 min wall-clock and the parsed NBT root tag is well-formed.
 ---
 ## 8. Implementation — concrete drafts
 Two layers: a **fix** to the existing daily script (Class C/E) and a **new sidecar timer** for Classes A/B/D.
 ### 8.1 Fix `/opt/docker/backup.sh` (F-backup-1)
 Already documented in `infra/runbooks/MIGRATION-nullstone-to-cobblestone.md` §5. Minimum work:
 - Drop dead `matrix-postgres` block (Synapse retired).
 - Drop / fix `mongodb` block (RC stopped 2026-05-06).
 - Remove orphaned `chmod 600 ...synapse-signing-key...` block at L119–122 (causing `set -e` exit before MC block on most days).
 - Wrap each module in `( ... ) || log "module FAILED"` so one module's failure doesn't skip the rest.
 Out-of-scope for this strategy doc — track in infra audit.
 ### 8.2 New: `mc-backup-frequent` (Class A) and `mc-backup-world` (Classes B/C/D)
 Drop-in files (operator review before deploy):
 **`/etc/systemd/system/mc-backup-frequent.service`**
 ```ini
 [Unit]
 Description=Minecraft frequent backup (playerdata, every 5 min)
 After=docker.service
 Wants=docker.service
 [Service]
 Type=oneshot
 User=user
 Group=docker
 EnvironmentFile=/etc/mc-backup.env
 ExecStart=/usr/local/bin/mc-backup-frequent.sh
 Nice=10
 IOSchedulingClass=best-effort
 IOSchedulingPriority=7
 ```
 **`/etc/systemd/system/mc-backup-frequent.timer`**
 ```ini
 [Unit]
 Description=Run mc-backup-frequent every 5 minutes
 [Timer]
 OnBootSec=2min
 OnUnitActiveSec=5min
 AccuracySec=30s
 Persistent=true
 [Install]
 WantedBy=timers.target
 ```
 **`/etc/mc-backup.env`** (mode 0600, owner `user:docker`)
 ```
 RESTIC_REPOSITORY_FREQUENT=/home/user/restic/mc-frequent
 RESTIC_REPOSITORY_WORLD=/home/user/restic/mc-world
 RESTIC_PASSWORD_FILE=/etc/mc-backup.pw
 MC_DATA=/opt/docker/minecraft
 RCON_HOST=127.0.0.1
 RCON_PORT=25575
 RCON_PASS=*redacted*
 HEARTBEAT_URL=https://ntfy.s8n.ru/mc-backup-frequent
 ALERT_URL=https://ntfy.s8n.ru/mc-backup-alerts
 TS_OFFHOST_USER=mc-backup
 TS_OFFHOST_HOST=100.64.0.1
 TS_OFFHOST_PATH=/backups/nullstone-mc-restic
 ```
 **`/usr/local/bin/mc-backup-frequent.sh`**
 ```bash
 #!/usr/bin/env bash
 set -euo pipefail
 . /etc/mc-backup.env
 trap 'curl -fsS -m 10 -d "fail rc=$?" "$ALERT_URL" >/dev/null || true' ERR
 # 1. Ask MC to flush via rcon (best-effort; don't fail backup if rcon down)
 if command -v mcrcon >/dev/null 2>&1; then
    mcrcon -H "$RCON_HOST" -P "$RCON_PORT" -p "$RCON_PASS" -w 1 \
        "save-all flush" >/dev/null 2>&1 || true
 fi
 # 2. Snapshot just the small fast-changing things
 restic backup \
    --tag playerdata \
    --tag auto-5min \
    --host nullstone \
    --exclude='*.lock' \
    "$MC_DATA/world/playerdata" \
    "$MC_DATA/world/stats" \
    "$MC_DATA/world/advancements" \
    "$MC_DATA/world/level.dat" \
    "$MC_DATA/world_nether/level.dat" \
    "$MC_DATA/world_the_end/level.dat" \
    "$MC_DATA/homestead_data.db" \
    "$MC_DATA/plugins/LuckPerms" \
    "$MC_DATA/plugins/CoreProtect/database.db" 2>/dev/null || true
 # 3. Cheap retention (only on local repo)
 restic forget --tag auto-5min \
    --keep-last 24 --keep-hourly 24 --keep-daily 7 \
    --prune --quiet
 # 4. Heartbeat — alert if NOT received in 15 min via ntfy server
 curl -fsS -m 5 "$HEARTBEAT_URL" >/dev/null || true
 ```
 **`mc-backup-world.{service,timer,sh}`** — same shape, runs hourly during play / 6h otherwise (use `OnCalendar=*-*-* 07,08,...,01:00:00` or two timers), backs up full `world*/`, configs, DB dumps. After local backup, runs:
 ```bash
 restic copy \
    --from-repo "$RESTIC_REPOSITORY_WORLD" \
    -r "sftp:$TS_OFFHOST_USER@$TS_OFFHOST_HOST:$TS_OFFHOST_PATH" \
    latest
 ```
 And once nightly (separate timer) the same `copy` for `mc-frequent`.
 ### 8.3 docker-compose.override.yml — alternative path (rejected)
 Considered: itzg image supports `BACKUP_INTERVAL`, `BACKUP_METHOD=restic`. Pros: in-container, knows when world is loaded. Cons:
 - Bind-mount to host restic repo crosses userns-remap boundary (uid 100000 vs host uid 1000) — already a known nullstone footgun (memory `project_nullstone_docker_userns`).
 - Container restart wipes restic cache, slow first run after every reboot.
 - Mixing in-image and host-cron backup logic doubles failure surfaces.
 **Decision:** keep backups in systemd on the host; container is unaware. Override file is **not** part of this proposal.
 ---
 ## 9. Monitoring & alerting
 Three signals, all routed to ntfy on the existing self-hosted `ntfy.s8n.ru` (assumed to exist; if not, add as part of phase 1 — single-container deploy). DiscordSRV was dropped on 2026-04-30 per README.md L170, so Discord is not an option.
 | Signal | Trigger | Channel |
 |---|---|---|
 | `mc-backup-frequent` heartbeat | timer fires successfully | ntfy topic `mc-backup-frequent` (silent on success) |
 | Heartbeat **missing > 15 min** | dead-man's switch on ntfy server, or external (`healthchecks.io` is free + self-hostable) | ntfy topic `mc-backup-alerts` (high priority) |
 | `restic check` weekly | non-zero rc | ntfy topic `mc-backup-alerts` (high priority) |
 | Off-host mirror failure | `restic copy` non-zero rc | ntfy topic `mc-backup-alerts` (high priority) |
 Operator subscribes onyx + phone to `mc-backup-alerts` only. The `-frequent` topic is a heartbeat sink (not a notification stream).
 **Alternative if no ntfy yet:** write to `/var/log/mc-backup.log` AND a tiny status file `/var/lib/mc-backup/last-success` (mtime checked by an external monitor — Gatus on roadmap, Beszel on roadmap). Until either of those lands, a simple cron on **onyx** doing `ssh user@nullstone 'find /var/lib/mc-backup/last-success -mmin -15 | grep .'` and triggering a desktop `notify-send` is enough.
 This addresses T8 (the silent-failure threat) directly.
 ---
 ## 10. Cost & capacity
 **Hardware cost:** £0. Uses existing nullstone NVMe + onyx NVMe + existing Tailscale mesh.
 **Disk consumption (steady state, both repos):**
 | Where | Estimate | Headroom |
 |---|---|---|
 | nullstone `/home/user/restic/mc-frequent` | < 1 GB | 142 G free → ~140× |
 | nullstone `/home/user/restic/mc-world` | 15–25 GB | ~6× |
 | onyx `~/backups/nullstone-mc-restic/` | 16–26 GB | 1.6 T free → ~60× |
 **Days of retention given current free space:** even if the world doubles to 36 GB raw, dedup keeps growth linear at ~5 % per snapshot — well over a year of monthly retention fits.
 **Network:** Tailscale LAN-direct (5 ms onyx ↔ nullstone). Nightly delta typically < 500 MB after dedup. Negligible.
 **Operator time:** ~2 h initial deploy, ~10 min/month for the drill, ~zero on autopilot.
 ---
 ## 11. Phase plan
 | Phase | What | When | Blocker |
 |---|---|---|---|
 | 0 | This doc + runbook stub written, reviewed | TODAY | — |
 | 1 | Stop the bleeding: fix `backup.sh` orphan lines so daily MC tar at least runs again | TODAY (15 min) | — |
 | 2 | Stand up `mc-backup-frequent` timer + local restic repo (Class A) | this week | needs `apt install restic mcrcon` |
 | 3 | Add `mc-backup-world` timer + Class B/C/D | this week | — |
 | 4 | Onyx off-host SFTP target + `restic copy` job | this week | onyx user provisioning + ssh key |
 | 5 | First monthly drill | next 1st Saturday | — |
 | 6 | Wire ntfy alerts | when ntfy/Gatus deployed (infra roadmap) | external |
 | 7 | Friend RTX 4080 PC as second off-host (geographic) | phase 2 | Windows-side tooling |
 Phases 1–4 are doable today with what's on hand. Nothing in phases 1–5 requires purchasing.
 ---
 ## 12. Open questions for operator
 1. **ntfy.s8n.ru — does it exist yet?** Memory hints at Tuwunel + Matrix on `txt.s8n.ru`. If ntfy isn't deployed, decide: deploy ntfy *now*, or use Matrix room via Tuwunel webhook bridge as alert sink.
 2. **Onyx user `mc-backup`** — create today or reuse existing `admin` with restricted authorized_keys? Restricted user is cleaner; reusing `admin` is faster.
 3. **Append-only enforcement** on the onyx side — accept "sftp chroot + no shell" as good-enough, or invest in a per-repo restic key with `--no-delete`-style isolation (more work, partial mitigation only)?
 4. **Pre-flight world validation** — run `region-fixer` against the latest snapshot weekly to catch silent corruption (T3)? Adds ~5 min compute weekly. Recommend yes.
 5. **Class-E (host configs) — already in `live-server/` git repo via Syncthing/manual?** If yes, drop Class E from this scheme; if no, add it.
 ---
 ## 13. References
 - `docs/BACKUP.md` — current (broken) state docs.
 - `docs/RUNBOOK-BACKUP-RESTORE.md` — operational runbook (this commit).
 - `scripts/backup.sh` — to-be-fixed daily script (F-backup-1 in `infra/STATE.md`).
 - `_github/infra/STATE.md` — Top-5 weakness #2 + #5 tracking this work.
 - `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md` §5 — F-backup-1 detail; nullstone-as-spare hint.
 - Memory: `project_friend_gpu` (Tailscale stable IP for friend), `project_tailscale_mesh` (mesh layout), `project_nullstone_docker_userns` (why container-side backup is rejected).
 - `CLAUDE.md` Device Registry — onyx 192.168.0.28 / 100.64.0.1.
--- a/CROSS-REFERENCE-2026-05-07.md
+++ b/CROSS-REFERENCE-2026-05-07.md
@ -0,0 +1,364 @@
 <!--
  Cross-reference survey for the 2026-05-07 racked.ru / YOU500 incident.
  Read-only inventory of existing docs across local repo clones, written
  to help the four parallel investigation outputs (backup hunt, AuthLimbo
  audit, backup strategy, server audit) integrate without conflict.
  Author: cross-reference agent (read-only)
  Status: survey only — no fixes proposed here, that's the other agents' job.
 -->
 # Cross-Reference Survey — 2026-05-07
 **Trigger:** racked.ru player **YOU500** void-died via AuthLimbo
 `teleportAsync` failure, lost full inventory, no backups exist.
 Four parallel agents are writing audit + plan docs. This doc maps
 them onto existing infra so nothing collides or gets orphaned.
 ---
 ## 1. Per-repo state snapshot
 ### `auth-limbo` (Paper plugin source)
 | Field | Value |
 |---|---|
 | Origin | `ssh://git@192.168.0.100:222/s8n-ru/auth-limbo.git` ⚠️ stale (`s8n-ru` rename) |
 | Latest tag in CHANGELOG | **1.0.0** (2026-04-30) — single release |
 | Last commit | `b686380 readme: restyle to match minecraft-launcher format` |
 | Recent commits | README rewrites, AGPL switch, rename chain `RackedLimbo → LoginLimbo → AuthLimbo` |
 | CI | `.github/workflows/build.yml` + `release.yml` (GitHub Actions, **not** `.forgejo/`) |
 | Tests | **None.** `src/test/` does not exist. |
 | Source | 5 Java files: `AuthLimbo`, `AuthMeDatabase`, `LimboWorldManager`, `LoginListener`, `VoidGenerator` |
 | Docs | `docs/{compatibility,configuration,how-it-works,installation}.md` |
 | CHANGELOG style | **Keep a Changelog + SemVer**, date-suffixed `## [1.0.0] - 2026-04-30` |
 | License | AGPL-3.0-or-later, SPDX header in every Java file |
 **Key existing detail relevant to the bug** — `LoginListener.java`
 already implements the documented Paper #4085 fix (chunk-ticket pin
 in `AuthMeAsyncPreLoginEvent` + `getChunkAtAsyncUrgently` chained
 with `teleportAsync` at MONITOR priority on `LoginEvent`, with
 configurable `authme.teleport-delay-ticks`). If YOU500 still
 void-died, the bug is in **how** that chain handled a return-value
 of `false` / a thrown exception — the current code only logs a
 `warning` and lets the player stay wherever they were (which on
 login is the limbo void). See `LoginListener.java:166-191`.
 The AuthLimbo audit agent's findings should land as:
 - **`docs/INCIDENT-2026-05-07-you500.md`** (new) — forensic root-cause
  doc, follow `docs/REBRAND_2026-04-30.md` style (date-prefixed,
  scope/apply/result/rollback sections — convention shown below).
 - **`CHANGELOG.md`** — bump to `## [1.0.1] - 2026-05-07` with
  `### Fixed` block, follow Keep-a-Changelog format.
 - **`src/main/java/ru/authlimbo/LoginListener.java`** — code patch.
  Likely changes: handle `success == false` and `exceptionally`
  with a kick or retry rather than silent log; consider raising
  default `teleport-delay-ticks` from 10 → 20.
 - **`src/test/`** (new directory) — unit tests for the listener.
  No precedent here, but pom.xml needs JUnit added.
 ---
 ### `minecraft-server` (server repo — this repo)
 | Field | Value |
 |---|---|
 | Origin | `ssh://git@192.168.0.100:222/s8n-ru/minecraft-server.git` ⚠️ stale |
 | Last commit | `ede6029 proantitab: allow lp/luckperms in global; deny essentials.motd default` |
 | Top-level docs | `MISSION.md`, `README.md`, `RULES.md`, `THANKS.md`, `VIBE.md`, `TELEMETRY_AUDIT.md` |
 | `docs/` | `BACKUP.md`, `DEPLOY.md`, `PERMISSIONS.md`, `PLUGINS.md`, `PLUGIN_ALTERNATIVES.md`, `RACKED_BRAND.md`, `REBRAND_2026-04-30.md`, `ROADMAP.md`, `migrations/lands-to-landclaim.md`, `plugins/<name>.md` (20 files) |
 | Existing TODO | The README "Roadmap / TODO" section (lines 91-180) is the canonical living checklist. Tagged `[P0]` blocker / `[P1]` vision / `[P2]` improvement / `[P3]` nice-to-have. `docs/ROADMAP.md` is **scoped narrowly** to plugin-acquisition overhaul (Phases 1-3). |
 | `live-server/` | live config snapshot (purpur.yml, server.properties, ops.json, plugins/) — **mirrors prod state**, not a build input. |
 | Backup script | `scripts/backup.sh` — note **bug at line 119** (orphaned `"${BACKUP_PATH}/synapse-signing-key-${TIMESTAMP}.key"` block sits outside any `if`, will fail at runtime if signing-key path absent) |
 | CI | `.github/workflows/` is empty. `.github/ISSUE_TEMPLATE/` empty. No `.forgejo/`. |
 **No existing files named** `AUDIT*`, `INCIDENT*`, `RUNBOOK*`,
 `TODO*`, `CHANGELOG*` at root or in `docs/`. The closest precedents:
 - `docs/REBRAND_2026-04-30.md` — date-prefixed event log w/
  Apply/Side incident/Rollback sections. **Use this as the format
  template for any new INCIDENT-* doc.**
 - `docs/migrations/lands-to-landclaim.md` — multi-section migration
  plan (Current State / Target / Plan / Rollback). Format template
  for future strategy docs.
 - `MISSION.md` / `VIBE.md` / `RULES.md` — top-level "values" docs.
  Don't add new top-level capitalised md files unless the doc is
  similarly load-bearing for the project's identity. Detail goes in
  `docs/`.
 ---
 ### `infra` (nullstone+cobblestone runbooks)
 | Field | Value |
 |---|---|
 | Origin | `ssh://git@192.168.0.100:222/veilor-org/infra.git` ✅ org-scoped, no rename impact |
 | Last commit | `381f923 runbook: distribute load + sync data (operator's HA vision)` |
 | Layout | `forgejo/`, `runbooks/`, `repos/`, root `STATE.md` + `AUDIT-2026-05-05.md` |
 | Runbooks | `COBBLESTONE-INTAKE.md`, `DE-DECISION-cobblestone.md`, **`HA-CLUSTER-distribute-and-sync.md`** (already covers MC backup placement!), `MIGRATION-nullstone-to-cobblestone.md` |
 **Critical pre-existing context:**
 - `STATE.md` already lists *"`/opt/docker/backup.sh` fixes —
  matrix-postgres + rocketchat-mongodb + literal CHANGE_ME pw"* as
  open issue (line 97), AND lists Restic+autorestic as the **#1**
  recommended addition (lines 113, 283-285 of `AUDIT-2026-05-05.md`).
 - `runbooks/HA-CLUSTER-distribute-and-sync.md` line 51 already plans
  *"Backups (offsite) — Restic to B2/Wasabi nightly"* and line 72
  pins MC to nullstone with *"World data ZFS-replicated for DR
  only"*. The backup-strategy agent's plan must reconcile with this
  — don't propose a parallel scheme; either extend the HA runbook or
  cross-link it as the parent design.
 - `AUDIT-2026-05-05.md` lines 200-203 already flag the backup script
  as silently broken (RC + ex-Matrix not dumping). Confirms the
  symptom that caused YOU500's loss.
 **Format conventions in `infra/`:**
 - Audit reports: `# 5-Agent Audit Report — YYYY-MM-DD` header,
  TL;DR section, severity-ordered Action items section, file index.
 - Runbooks: `# Runbook — <topic>` header, Goal blockquote, North-star
  diagram if applicable, phase plan, failure scenarios + RTO table,
  open decisions, related links.
 - Dating: filenames always `<TYPE>-YYYY-MM-DD.md`.
 ---
 ### `minecraft-launcher`
 | Field | Value |
 |---|---|
 | Origin | `ssh://git@192.168.0.100:222/s8n-ru/minecraft-launcher.git` ⚠️ stale |
 | Last commit | `31d25f8 readme: shrink license section to single sub line` |
 | Relevance to incident | None direct. Would only matter if the incident agent recommends a launcher-side patch (e.g. forced relog on void death detection) — unlikely. |
 ### `minecraft-client`
 **Not a git repo** (`fatal: not a git repository`). No remote to
 worry about. Excluded from any rewrite list.
 ### `veilor-os`
 | Field | Value |
 |---|---|
 | Origin | `ssh://git@192.168.0.100:222/veilor-org/veilor-os.git` ✅ no rename impact |
 | Relevance | None — separate brand (security distro), not Minecraft. Skipped per instructions. |
 ---
 ## 2. Stale `s8n-ru` origin URLs (per 2026-05-07 rename)
 Per workspace memory `user_git_identity.md` the Forgejo user `s8n-ru`
 was renamed to `s8n` on 2026-05-07. Forgejo serves a 307 redirect for
 now but the canonical path is `s8n/<repo>`. The following local
 clones still have the old origin:
 | Repo (local clone) | Current origin | Should become |
 |---|---|---|
 | `_github/auth-limbo` | `ssh://git@192.168.0.100:222/s8n-ru/auth-limbo.git` | `ssh://git@192.168.0.100:222/s8n/auth-limbo.git` |
 | `_github/minecraft-server` | `ssh://git@192.168.0.100:222/s8n-ru/minecraft-server.git` | `ssh://git@192.168.0.100:222/s8n/minecraft-server.git` |
 | `_github/minecraft-launcher` | `ssh://git@192.168.0.100:222/s8n-ru/minecraft-launcher.git` | `ssh://git@192.168.0.100:222/s8n/minecraft-launcher.git` |
 **No rename required for:** `_github/infra` (`veilor-org/`),
 `_github/veilor-os` (`veilor-org/`), `_github/minecraft-client` (not
 a repo).
 Recommended one-shot fix (deferred — not part of these four agents):
 ```bash
 for r in auth-limbo minecraft-server minecraft-launcher; do
  cd /home/admin/ai-lab/_github/$r
  git remote set-url origin ssh://git@192.168.0.100:222/s8n/$r.git
 done
 ```
 Also update the in-doc URL references:
 - `auth-limbo/src/main/resources/plugin.yml` line 7: `website: https://github.com/s8n-ru/auth-limbo`
 - `auth-limbo/src/main/java/ru/authlimbo/*.java` SPDX header: `Copyright (C) 2026 s8n-ru`
 - `minecraft-server/VIBE.md` line 38: `github.com/s8n-ru/auth-limbo`
 ---
 ## 3. Overlap with session-noted TODO items
 The session noted these TODOs that the four agents may want to fold
 into recommendations. State as of HEAD:
 | Item | Existing mention? | Where | Status |
 |---|---|---|---|
 | **SHA256 → BCRYPT** (AuthMe hashing) | ✅ flagged 2026-05-02 | `security/nullstone-server/2026-05-02-mc-audit.md` summary: *"AuthMe also uses unsalted SHA-256, no tempban, no captcha, and 5-char minimum passwords"* | **Not yet addressed in repo.** No TODO entry in README. New. |
 | **EZShop drop** | ⚠️ Plugin loaded via `PLUGINS:` in `docker-compose.yml:51` | docker-compose.yml | No TODO entry yet. New. |
 | **CapDrop** (Linux capabilities) | ❌ No mention | — | Net-new infra-side item (deploy.security level). Belongs in server-audit agent's report. |
 | **tracking-range** | ❌ No mention | — | Net-new (purpur.yml tuning). New. |
 | **CO DB → MySQL** (CoreProtect) | ❌ No mention | — | Net-new. Touches plugin policy (CoreProtect-CE is the one acknowledged license exception per MISSION.md — CO config change OK, plugin swap not). |
 | **TPS webhook** | ⚠️ "Prometheus exporter + Grafana" entry exists in README:105 (P2). Webhook would be lighter-weight alternative. | README.md:105 | Adjacent to existing TODO; consider replacing or augmenting it. |
 | **spark baseline** | ✅ spark already loaded in `PLUGINS:` (compose:54) and listed in VIBE.md:78 | docker-compose.yml, VIBE.md | "Baseline" = capture a profiling run for ref. Net-new. |
 | **plugin folder cleanup** | ⚠️ `live-server/plugins/` is checked-in live config snapshot. Past cleanup happened in REBRAND_2026-04-30 (Side incident — disk full). | docs/REBRAND_2026-04-30.md:65-74 | Operational, not docs. Net-new. |
 **None of the eight overlap with the existing `docs/ROADMAP.md`**
 (which is scoped narrowly to *plugin-acquisition* — manifest +
 lockfile + CI). They all belong in the **README.md "Roadmap / TODO"
 checklist** by current convention. The server-audit agent should
 append them there, not create a new ROADMAP-* doc.
 ---
 ## 4. Existing backup-related mentions
 | File | Line | Content |
 |---|---|---|
 | `docs/BACKUP.md` | all | Documents the daily 02:00 cron + retention. **Critical drift:** describes worlds being backed up but VIBE.md:54-58 says *"no world backups"*. Direct contradiction. |
 | `scripts/backup.sh` | 80-117 | Minecraft block: docker-exec tar of world/world_nether/world_the_end + configs. **Real, working code.** |
 | `scripts/backup.sh` | 119-122 | **Orphaned dead-code block** outside any `if` (dangling from `synapse-signing-key`). Will trigger script failure if signing-key path missing. |
 | `README.md` | 23, 45, 164, 179 | Mentions backup feature. README:179 records "freed 11G+ (old backups, ...)". |
 | `VIBE.md` | 54-58 | *"Daily configs, no world backups (it'd eat too much disk). If you lose a base to grief, that's the game."* — **conflicts with reality.** |
 | `docs/REBRAND_2026-04-30.md` | 53, 65-74 | Records 2026-04-30 backup tarball and 2026-05-01 disk-full incident from accumulated backups. Confirms backups *were* running. |
 | `SYSTEM.md` | 737-749 | Workspace-level system reference says backups run daily, ~5-7GB compressed. Out-of-date plugin counts (says 25, actual ~16) and Purpur version (says 1.21.10, actual 1.21.11). |
 **Major contradiction the backup-strategy agent must resolve:**
 either VIBE.md must drop the *"no world backups"* line (recommended
 — reality is that worlds **are** being backed up), or the operator
 must accept that the YOU500 loss happened because the worlds were
 **logically excluded from the policy** even though they were
 mechanically being archived. The latter is unlikely — daily 02:00
 tarball would have caught a 2026-05-07 daytime void death.
 **Backup-hunt agent finding to verify:** does `/opt/backups/` on
 nullstone actually contain any usable `mc-world-backup-*.tar.gz`
 files? `STATE.md` line 97 + `AUDIT-2026-05-05.md` lines 200-203
 suggest the script *runs* but its other arms are failing silently;
 the MC arm at lines 80-117 of backup.sh has no obvious bug, so
 backups should exist. If they don't, that's the deepest finding.
 ---
 ## 5. Forgejo runner / CI integration
 Per memory `project_forgejo_nullstone.md` and `STATE.md` line 26-27,
 nullstone runs a Forgejo runner with labels
 `ubuntu-24.04 + nullstone`. **No repo currently has a `.forgejo/`
 directory** — neither auth-limbo nor minecraft-server nor infra. CI
 in `auth-limbo` is GitHub Actions (`.github/workflows/`).
 `STATE.md` line 121-129 notes the v0.5.32 veilor-os ship is pending
 on flipping `runs-on:` to `nullstone` to use the Forgejo runner.
 **Implication for the audit agents:** if the AuthLimbo agent wants
 the fix to land via CI, two options:
 1. Keep `.github/workflows/build.yml`, since GH-mirror is now
   manual-only post-2026-05-06 (`STATE.md`:14-18) — workflow won't
   trigger automatically anymore, would need manual mirror push.
 2. Migrate to `.forgejo/workflows/build.yml` with
   `runs-on: ubuntu-24.04` (compatible with the runner). Cleaner,
   matches new direction. **Recommended.**
 Either path: pre-existing dependency on `AUTHME_JAR_URL` repo secret
 (see `.github/workflows/build.yml:21-26`) needs to be re-added on
 Forgejo if path 2 is taken.
 ---
 ## 6. Workspace-level `SYSTEM.md` updates needed after backup-strategy lands
 `/home/admin/ai-lab/SYSTEM.md` lines 665-779 has the canonical
 workspace-level Minecraft section. After the backup-strategy doc
 lands, the following blocks need editing (one PR, one paragraph
 each):
 | SYSTEM.md location | Existing content | Drift |
 |---|---|---|
 | Line 677 | "Minecraft Version: 1.21.10 (Purpur build 2532)" | Actual: 1.21.11 (compose line 10) |
 | Line 686-690 | "25 plugins loaded ... bulk-updated 2026-04-17" | Plugin set has shifted heavily since (LandClaimPlugin → Homestead, WorldEdit → FAWE, Vault → VaultUnlocked, LoginSecurity → AuthMe, AuthLimbo added, EZShop+AuctionHouse added). Real count ≈ 16. |
 | Line 692-706 | RAM 7GB idle, Purpur 1.21.10-2535, startup 47s | Out of date; would-be benefit re-measure as part of "spark baseline" TODO. |
 | Line 765-771 | "Known Issues" block | Add YOU500 incident closure note (post-fix), F10 RCON wildcard already promised in Wave 2. |
 | Line 776 | "Backup frequency: Add 6-hourly world snapshots for active play sessions" | This is the existing wishlist item the backup-strategy agent will likely satisfy. Strike or replace with "Done — see infra/runbooks/MC-BACKUP-2026-05-07.md" (or wherever the strategy lands). |
 **Per `CLAUDE.md` workspace rules**, technical detail belongs in
 SYSTEM.md, not README.md. The README device-table line for
 nullstone won't change.
 ---
 ## 7. Integration recommendations — where each parallel agent's doc lands
 | Agent | Output should land at | Rationale |
 |---|---|---|
 | **Backup hunt** (find existing backups) | `_github/minecraft-server/docs/INCIDENT-2026-05-07-you500-backup-hunt.md` | Date-prefixed, follows REBRAND_2026-04-30.md format. Forensic in nature → minecraft-server `docs/`. |
 | **AuthLimbo audit** (root-cause + code patch) | (1) `_github/auth-limbo/docs/INCIDENT-2026-05-07-teleportasync-failure.md` for forensic write-up; (2) source patch + `CHANGELOG.md` bump in same repo; (3) optional cross-link from `minecraft-server/docs/INCIDENT-2026-05-07-you500-backup-hunt.md` | Plugin source repo owns plugin bugs. INCIDENT- naming convention matches REBRAND_*.md. |
 | **Backup strategy** (forward-looking design) | `_github/infra/runbooks/MC-BACKUP-strategy-2026-05-07.md` (or extend `HA-CLUSTER-distribute-and-sync.md` with a Phase 1.5 sub-section) | infra owns nullstone-side cron + restic. Cross-link from `minecraft-server/docs/BACKUP.md` (replace its current contents with a thin pointer). |
 | **Server audit** (broader hardening — CapDrop, plugin folder, MySQL, etc) | `_github/minecraft-server/docs/AUDIT-2026-05-07.md` (synthesis), then **append individual TODOs to README.md "Roadmap / TODO"** | Matches `infra/AUDIT-2026-05-05.md` precedent. README is the canonical TODO surface for this repo per existing convention. |
 **Files needing edits AFTER all four agents finish:**
 | File | Change |
 |---|---|
 | `_github/minecraft-server/README.md` | Append new TODO entries from server-audit agent: SHA256→BCRYPT, EZShop drop, CapDrop, tracking-range, CO MySQL, TPS webhook, spark baseline, plugin folder cleanup. Add `[x]` for the YOU500 incident under "Done" once fix shipped. |
 | `_github/minecraft-server/docs/BACKUP.md` | Rewrite to point to infra runbook; current Schedule/Strategy/Manual sections move to infra. Or replace contents with thin "see infra/runbooks/MC-BACKUP-strategy-2026-05-07.md". |
 | `_github/minecraft-server/VIBE.md` | Drop or revise lines 54-58 — *"no world backups"* contradicts reality and is the philosophical claim that may have justified treating backups as low-priority. Important narrative fix. |
 | `_github/minecraft-server/scripts/backup.sh` | Fix orphaned line 119-122 dead-code block. Independent of strategy agent's output. |
 | `_github/minecraft-server/docker-compose.yml` | If EZShop drop accepted: remove line 51. (Server-audit agent decision.) |
 | `_github/auth-limbo/CHANGELOG.md` | New `## [1.0.1] - 2026-05-07` entry. |
 | `_github/auth-limbo/pom.xml` | Version bump 1.0.0 → 1.0.1 if patch shipped. |
 | `_github/auth-limbo/src/main/java/ru/authlimbo/LoginListener.java` | Code fix per AuthLimbo agent. |
 | `_github/infra/STATE.md` | Add 2026-05-07 changelog entry referencing the incident; check off "/opt/docker/backup.sh fixes" pending decision (line 97) when backup script repaired. |
 | `_github/infra/AUDIT-2026-05-05.md` | Append addendum or leave dated; the new audit replaces/augments the F-numbered findings related to MC backups. |
 | `/home/admin/ai-lab/SYSTEM.md` | Update Minecraft section per §6 above. Add note in Known Issues (line 765). Update Last Updated. |
 | `/home/admin/ai-lab/README.md` | "Last Updated" stamp; one-line status mention if user wants it surfaced at workspace level. |
 ---
 ## 8. Open conflicts and duplications
 1. **VIBE.md vs reality** (most important narrative conflict). VIBE
   says no world backups; backup.sh + BACKUP.md + REBRAND_2026-04-30
   prove worlds **are** archived nightly. The YOU500 inventory loss
   means either (a) backups didn't run that day, (b) backup ran but
   the rollback isn't operationally feasible (would lose other
   players' progress between 02:00 and the death), or (c) operator
   chose not to rollback. **The backup-strategy agent must address
   this explicitly** rather than just propose a new scheme.
 2. **`docs/ROADMAP.md` scope vs README "Roadmap / TODO"** — the
   docs file is narrowly about plugin-acquisition Phases 1-3, while
   the README has the all-up living checklist. Future agents should
   not put generic TODO items into `docs/ROADMAP.md`. Keep its scope
   tight or rename it `docs/PLUGIN-ACQUISITION-ROADMAP.md`.
 3. **infra `HA-CLUSTER-distribute-and-sync.md` vs new MC-backup
   strategy** — there's a real risk the backup-strategy agent
   designs Restic-to-B2 in isolation while HA-CLUSTER already plans
   that exact service for both nullstone+cobblestone. Strategy doc
   must reference and extend the HA-CLUSTER plan (specifically the
   "Backups (offsite)" row in its layer table, line 51).
 4. **CoreProtect MySQL migration** — proposed in session TODOs.
   `MISSION.md:24` codifies CoreProtect-CE as "the one acknowledged
   license exception". Switching its DB backend to MySQL is fine
   under that policy (config, not plugin swap), but the server-audit
   agent should explicitly note "this is a config change, not a
   plugin swap, so MISSION.md:24 still holds" so the policy isn't
   accidentally diluted.
 5. **AuthLimbo CI host** — `.github/workflows/` lives in repo but
   GH push-mirror is off as of 2026-05-06. Builds will only run if
   someone manually pushes to GH. Worth flagging to the AuthLimbo
   agent that any CI step they propose may need a `.forgejo/`
   variant, otherwise the patched 1.0.1 release won't auto-build.
 6. **`_github/minecraft-client` is not a git repo** — nothing to
   worry about for this incident, but anyone iterating on the
   incident later may try to commit something there expecting it to
   work. Worth recording.
 ---
 ## 9. Summary table — convention by repo
 | Repo | Audit doc convention | Incident doc convention | TODO surface | CHANGELOG style |
 |---|---|---|---|---|
 | `auth-limbo` | (none yet) | (none yet — recommend `docs/INCIDENT-YYYY-MM-DD-<slug>.md`) | (none — small repo) | Keep a Changelog + SemVer, `## [X.Y.Z] - YYYY-MM-DD` |
 | `minecraft-server` | (none yet — recommend `docs/AUDIT-YYYY-MM-DD.md` matching infra style) | follow `docs/REBRAND_2026-04-30.md` template | README "Roadmap / TODO" with `[P0..P3]` tags | (none — uses git log) |
 | `infra` | `AUDIT-YYYY-MM-DD.md` at root | (use runbooks for forward-looking; no incident files yet) | `STATE.md` "Pending decisions" table | (none — uses git log + STATE.md) |
 | `minecraft-launcher` | n/a | n/a | (none) | (none) |
 | `veilor-os` | (separate brand — out of scope) | — | — | — |
 ---
 *End of survey. Read-only. No files modified. No commits pushed.*
--- a/docs/RUNBOOK-BACKUP-RESTORE.md
+++ b/docs/RUNBOOK-BACKUP-RESTORE.md
@ -0,0 +1,156 @@
 # Runbook — Backup & Restore (Minecraft, racked.ru on nullstone)
 Strategy doc: [`../BACKUP-STRATEGY.md`](../BACKUP-STRATEGY.md). This runbook is the **operator-facing** procedure for the three scenarios that come up in practice. Keep it short, copy-paste-able, and reachable from the player support workflow.
 > **Status (2026-05-07):** This runbook is written **ahead** of the implementation it describes. The `mc-backup-frequent` timer and onyx mirror are NOT yet deployed. The "What if no snapshot exists yet?" section at the bottom covers today's reality.
 ---
 ## TL;DR — restore one player's `.dat` from N minutes ago
 ```bash
 # On nullstone, as `user`:
 PUUID=<player-uuid>          # e.g. from /opt/docker/minecraft/usercache.json
 WHEN=latest                  # or "5 min ago", or a snapshot id
 RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
 restic -r /home/user/restic/mc-frequent \
    restore "$WHEN" \
    --target /tmp/restore-$$ \
    --include "world/playerdata/${PUUID}.dat"
 # Verify the file is well-formed NBT before applying:
 file /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat
 # Expected: "gzip compressed data"
 # Apply (server must be running so playerdata is writable; the player
 # MUST be offline or we're racing the writer):
 mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "kick ${PUUID_NICK} Restore in progress"
 mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-off"
 mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-all flush"
 cp /opt/docker/minecraft/world/playerdata/${PUUID}.dat \
   /opt/docker/minecraft/world/playerdata/${PUUID}.dat.preFix-$(date +%s)
 cp /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat \
   /opt/docker/minecraft/world/playerdata/${PUUID}.dat
 chown 100000:100000 /opt/docker/minecraft/world/playerdata/${PUUID}.dat   # userns-remap
 mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-on"
 # Tell the player to log back in.
 ```
 **Why kick + `save-off`:** if the player is online, the server holds their NBT in memory and rewrites the `.dat` on next save tick — clobbering the restore. `save-off` halts auto-save; kicking guarantees the in-memory state for that player won't be flushed.
 **Userns-remap reminder:** the host sees container-uid `100000` for files written by the MC process. Restored files written by `user` (uid 1000) will appear empty/permission-denied to the container. Always `chown 100000:100000` (or `chmod 666`) after restore. Memory: `project_nullstone_docker_userns`.
 ---
 ## Scenario 1 — Player lost inventory (T1, the void-death case)
 This is what the strategy was written for. RTO target: **< 2 minutes**.
 1. Find the UUID:
   ```bash
   grep -i 'NICK' /opt/docker/minecraft/usercache.json
   ```
 2. Pick a snapshot just **before** the loss. `restic snapshots --tag playerdata` shows timestamps.
 3. Run the TL;DR block above with that snapshot id (or `latest` if loss happened in the last 5 min).
 4. Inform the player: "Your inventory from HH:MM has been restored. Anything you picked up after that point is gone."
 5. Log the incident: append to `docs/INCIDENTS.md` (create if absent) — date, player, snapshot id, cause.
 ---
 ## Scenario 2 — Whole world rolled back (T2/T3, griefing or corruption)
 RTO target: **15 minutes**. Server downtime expected.
 1. Announce, kick, stop:
   ```bash
   mcrcon ... "say Server going down for restore — back in ~15 min"
   mcrcon ... "kick @a Restore in progress"
   cd /opt/docker/minecraft && docker compose down
   ```
 2. Move live data aside (do not delete):
   ```bash
   mv /opt/docker/minecraft /opt/docker/minecraft.broken-$(date +%F)
   mkdir -p /opt/docker/minecraft
   ```
 3. Restore from the world repo:
   ```bash
   RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
   restic -r /home/user/restic/mc-world \
       restore <snapshot-id> --target /tmp/world-restore
   rsync -aHAX /tmp/world-restore/opt/docker/minecraft/ /opt/docker/minecraft/
   ```
 4. **Re-apply userns-remap perms** (critical — see memory):
   ```bash
   chmod -R 777 /opt/docker/minecraft   # quickfix; or chown -R 100000:100000
   ```
 5. Boot:
   ```bash
   cd /opt/docker/minecraft && docker compose up -d
   docker logs -f minecraft-mc   # watch for "Done" line
   ```
 6. Verify with a known-good UUID's `.dat` parse, then announce server up.
 7. Keep `minecraft.broken-YYYY-MM-DD/` for at least 7 days for forensic comparison.
 ---
 ## Scenario 3 — Host disk dead (T5)
 RTO target: **few hours, depends on hardware swap**.
 1. New host: install Debian 13 + Docker per `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md`.
 2. `apt install restic`. Pull the password from operator's password manager into `/etc/mc-backup.pw`.
 3. Initialise destination dir, then restore from **onyx mirror** (not local — local is gone):
   ```bash
   restic -r sftp:mc-backup@100.64.0.1:/backups/nullstone-mc-restic \
       restore latest --target /tmp/world-restore
   ```
 4. Continue Scenario 2 from step 4.
 5. Stand up the timers on the new host. **Do not** point them at the same off-host repo until the new host has been re-keyed (rotate restic passwords as part of disaster recovery).
 ---
 ## Drill log (monthly)
 | Date | Operator | Snapshot age | Class A restore time | Off-host restore time | Result |
 |------|----------|--------------|----------------------|------------------------|--------|
 | (first drill — 2026-06-06) | s8n | TBD | TBD | TBD | TBD |
 Procedure: see `BACKUP-STRATEGY.md` §7.
 ---
 ## What if no snapshot exists yet? (CURRENT REALITY 2026-05-07)
 Until phases 1–4 of `BACKUP-STRATEGY.md` are deployed, the only recovery resources are:
 | Source | What's there | Recoverable? |
 |---|---|---|
 | `/opt/backups/202604xx_020001/mc-world-backup-*.tar.gz` | World tar from Apr 29 + May 2 (others FAILED) | **GONE** — pruned by 7-day retention |
 | `/opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz` | Plugin jars only, no world | Not useful for player data |
 | Live `/opt/docker/minecraft/world/playerdata/<uuid>.dat_old` | MC's own .dat_old shadow file from previous save | **YES** — last save tick before current. **First-line defence right now.** |
 | CoreProtect DB (`plugins/CoreProtect/database.db`) | Block + container actions, NOT inventory state | Partial — can roll back grief, can't restore lost items |
 **Today's playbook for inventory-loss reports:**
 1. Server console → `co lookup u:NICK` to confirm the loss event in CoreProtect.
 2. **Stop the server immediately** if the report comes in within the same play session — every save tick overwrites `.dat_old`. `docker compose down` buys time.
 3. Inspect `world/playerdata/<uuid>.dat_old` — if it predates the loss, copy it over `<uuid>.dat`, fix perms (uid 100000), restart.
 4. If `.dat_old` is too new (already overwritten): **the loss is unrecoverable until BACKUP-STRATEGY phases 1–4 are deployed.** Apologise to the player. Spawn-in compensation per operator discretion (ops creative-mode replacement is the customary remedy).
 5. Log the incident — adds urgency to deploying the new strategy.
 ---
 ## TODO — open items (links into BACKUP-STRATEGY.md §11)
 - [ ] Phase 1: fix `/opt/docker/backup.sh` orphan-line bug (F-backup-1).
 - [ ] Phase 2: deploy `mc-backup-frequent.timer` (Class A, 5-min playerdata).
 - [ ] Phase 3: deploy `mc-backup-world.timer` (Class B/C/D, hourly).
 - [ ] Phase 4: provision `mc-backup` user on onyx + `restic copy` job.
 - [ ] Phase 5: schedule monthly drill calendar entry, run first drill.
 - [ ] Phase 6: ntfy / Matrix alert wiring (depends on ntfy deployment).
 - [ ] Phase 7: friend RTX 4080 PC as secondary off-host.
 - [ ] Verify `usercache.json` on this host: confirm UUID lookup workflow above resolves to the right `.dat`.
 - [ ] Decide: `mcrcon` package vs lightweight Python `mcrcon` lib.
 - [ ] Document compensation policy for unrecoverable losses (operator discretion right now).