From a1cc3940cfe0bc8ac72c6ca4d3ea7f3291226fe0 Mon Sep 17 00:00:00 2001 From: s8n Date: Thu, 7 May 2026 17:33:24 +0100 Subject: [PATCH] docs: 2026-05-07 incident audit + backup strategy MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Player YOU500 lost full inventory to AuthLimbo void-death at 17:13:39. Investigation revealed deployed /opt/docker/backup.sh is an 88-line stub missing the Minecraft block; last successful world backup 2026-05-02 (already pruned). No recoverable .dat exists. Files: - AUDIT-2026-05-07.md — server-side findings F-01..F-06 (P0 backups, no-keepInventory, AuthLimbo silent failure, chunk preload race, Xmx > container headroom, container hardening gaps) - BACKUP-HUNT-2026-05-07.md — exhaustive backup scan; only 6-week-old archive at _archive/minecraft-old-2026-04-27.tar.gz - BACKUP-STRATEGY.md — restic-based plan; 5min/hourly/daily classes, off-host to onyx via Tailscale, monthly drill - CROSS-REFERENCE-2026-05-07.md — repo+doc landing map; flags pre-existing infra/STATE.md backup-broken note + HA-CLUSTER restic draft to extend rather than duplicate - docs/RUNBOOK-BACKUP-RESTORE.md — operator runbook for .dat restore, full-world restore, host-loss restore, drill log --- AUDIT-2026-05-07.md | 184 +++++++++++++++ BACKUP-HUNT-2026-05-07.md | 118 ++++++++++ BACKUP-STRATEGY.md | 393 +++++++++++++++++++++++++++++++++ CROSS-REFERENCE-2026-05-07.md | 364 ++++++++++++++++++++++++++++++ docs/RUNBOOK-BACKUP-RESTORE.md | 156 +++++++++++++ 5 files changed, 1215 insertions(+) create mode 100644 AUDIT-2026-05-07.md create mode 100644 BACKUP-HUNT-2026-05-07.md create mode 100644 BACKUP-STRATEGY.md create mode 100644 CROSS-REFERENCE-2026-05-07.md create mode 100644 docs/RUNBOOK-BACKUP-RESTORE.md diff --git a/AUDIT-2026-05-07.md b/AUDIT-2026-05-07.md new file mode 100644 index 0000000..0b638cc --- /dev/null +++ b/AUDIT-2026-05-07.md @@ -0,0 +1,184 @@ +# Minecraft Server Audit — racked.ru +**Container:** `minecraft-mc` on nullstone (192.168.0.100) +**Date:** 2026-05-07 +**Audit type:** Operational / data-integrity (NOT a network-security audit) +**Auditor:** Claude (Opus 4.7) via SSH read-only inspection +**Catalyst:** Player **YOU500** void-died at login (~17:13:39 BST), inventory lost. No usable backup existed. + +--- + +## Executive Summary + +**Status:** Critical issues found. +**Risk score model:** Likelihood (1-5) x Impact (1-5) = 1-25. >=15 = High, >=20 = Critical. + +A live AuthLimbo `teleportAsync returned false` warning fired during YOU500's first login of the day, immediately after `YOU500 left the confines of this world` (void death in `auth_limbo` world). The player retried twice. On retry #3 they were teleported to (-264.6, 86, -49.8) and 23 seconds later `was blown up by Creeper`. Console operator (s8n) attempted recovery via RCON but neither the void death nor the creeper death had item-restore data because: + +1. **No working backups.** `/opt/docker/backup.sh` deployed on nullstone is a stale 88-line copy missing the entire Minecraft block. The repo version (`scripts/backup.sh`) has the block but **was never deployed**. Daily 02:00 cron has been running for at least 7 days producing 8-12K archives that contain no world / playerdata / plugins. `BACKUP.md` claims the script handles MC; it does not. +2. **CoreProtect tracks inventory transactions but not death drops.** `co inspect` will not surface "dropped on death" entries the way it does pickup/drop, and even if it did, the 1.5 GB SQLite blob is approaching the point where `/co rollback` over an inventory radius is operationally slow. +3. **No `keepInventory` rule, no death-drop rescue plugin.** With `difficulty=hard`, `gamemode=survival`, and no Essentials `keepinv` permission flow visible, every death is a total loss. +4. **AuthLimbo has no death-listener and no failure remediation.** When `teleportAsync` returns false, the player is dropped at limbo spawn and the warning is logged at WARN level only — no alert, no rollback, no temp-stash of inventory. +5. **JVM heap sized larger than container limit.** `JVM_OPTS=-Xmx16384M` inside an `18G` container limit with `MEMORY_SIZE=16G`; if Aikar G1 heap actually grows to Xmx, plus off-heap (Netty, mmaps, zip cache) >2 GB, **kernel OOM kills the container**. Restart-on-OOM has no warning hook to discord/Matrix. + +**Three biggest exposures** +1. Backups silently broken for 7+ days. (Critical — 5x4=20) +2. No item-loss safety net for any cause of death. (Critical — 4x5=20) +3. AuthLimbo failure path has no recovery. (High — 4x4=16) + +--- + +## Findings Table + +Severity = Likelihood x Impact. P0 = act this week, P1 = this month, P2 = this quarter. + +| ID | Severity | Finding | Recommendation | Effort | +|----|----------|---------|----------------|--------| +| F-01 | **P0 / 20** | `/opt/docker/backup.sh` on nullstone is missing the entire MC backup block. Repo `scripts/backup.sh` has it but was never deployed. Daily backups since 2026-04-30 are 8-12K (effectively empty). | Sync the deployed script with repo, run a manual backup, verify world tarball >= 5 GB. Add a sentinel check to backup.sh that fails the run if `mc-world-backup-*.tar.gz` < 1 GB. | 30 min | +| F-02 | **P0 / 20** | No `keepInventory` rule and no `essentials.keepinv` permission. Every death is total loss. | Decide policy: (a) `gamerule keepInventory true` server-wide, (b) keep-inv only when death cause is "void"/"plugin teleport", or (c) auto-restore-on-AuthLimbo-failure. The narrow option (b) preserves survival pain while plugging the AuthLimbo data-loss vector. Plugin candidates: `KeepInventoryOnVoid`, `DeathChestPro`, custom listener in AuthLimbo. | 1-2h research, 1d implement | +| F-03 | **P0 / 18** | AuthLimbo logs WARN on teleport failure but has no alerting or recovery. The player is left at limbo spawn (y128 platform) where they re-disconnect and on retry get teleported normally — but the warning never reaches an operator. | (a) Bump `teleportAsync returned false` to ERROR. (b) Add a Discord/Matrix webhook alert via existing webhook stack. (c) On failure: snapshot player inventory, kick with friendly message, write recovery file `auth_limbo/incident--.dat` for ops replay. | 1d | +| F-04 | **P0 / 18** | YOU500's first failed teleport target was (2380.4, 69.9, -11358.4) — that's 11k blocks out and the chunk likely was not loaded yet. AuthLimbo's `preload-chunks: true` setting fires on `AuthMeAsyncPreLoginEvent` which may not run before `LoginEvent` in HaHaWTH's AuthMe fork. Exact timing race is unverified. | Add chunk-loaded assertion in AuthLimbo before calling `teleportAsync`; if not loaded, force-load synchronously OR delay teleport another 10-20 ticks. Add debug logging of chunk-load state in the WARN line. | 0.5d | +| F-05 | **P0 / 16** | JVM `-Xmx16384M` inside container `mem_limit=18G` with no headroom for off-heap (Netty buffers, native mmaps, mod metadata). Aikar flags + 25 plugins easily push native to 2-3 GB. Kernel OOM kill is silent. | Either (a) lower `-Xmx` to 12-14 GB and `MaxRAMPercentage`-style flag, OR (b) raise `mem_limit` to 24 GB. Also add `oom_score_adj` and a `docker events --filter event=oom` watcher that pings Discord. | 1h config + 2h alerting | +| F-06 | **P0 / 16** | No `pids_limit`, no `cap_drop: ALL`, no `read_only: true`. Container runs with the default Docker capability set (CAP_NET_RAW, CAP_SYS_CHROOT, etc.) it does not need. | Add `cap_drop: [ALL]`, `cap_add: [NET_BIND_SERVICE]` (only if binding <1024; 25565 is high so likely none), `pids_limit: 4096`, `security_opt: [no-new-privileges:true]`. Test boot, watch for startup failures. | 1h test | +| F-07 | **P1 / 15** | CoreProtect SQLite at 1.5 GB. Performance and reliability degrade past 2-3 GB. `database.db` is the only copy; no WAL checkpoint or vacuum schedule. | (a) Migrate to MySQL/MariaDB sidecar container. (b) Add monthly cron `co purge t:30d` (purge entries older than 30 days; CoreProtect docs). (c) Schedule `VACUUM` after purge. | 1d for MySQL migration, 1h for purge cron | +| F-08 | **P1 / 12** | AuthMe still on `passwordHash: SHA256` (legacy). Migration plan for SHA256 -> BCRYPT is on TODO list and still pending. | Set `legacyHashes: [SHA256]` and `passwordHash: BCRYPT`. AuthMe re-hashes on next successful login. Communicate "your password works as before, no action needed". | 30 min config + monitoring | +| F-09 | **P1 / 12** | `online-mode=false`. Server depends entirely on AuthMe + EpicGuard for identity. EpicGuard config not audited in this pass. | Verify `enableProtection: false` in AuthMe (currently false) is intentional, since geofencing is `US, GB, LOCALHOST` only — any user from another country is locked out if protection re-enabled. Document the choice in `RULES.md`. | 1h doc only | +| F-10 | **P1 / 12** | `auto-save-interval: 2400` (= 2 minutes at 20 TPS) is fine, BUT `paper-global.yml` has `player-auto-save: rate: -1` (= use auto-save-interval, so also 2 min). A player who joins, dies, and disconnects within 2 min may have NO post-death snapshot persisted before the player.dat is overwritten by their next login. Player save *does* fire on quit, but if the death happens and the player keeps moving / interacting before logout, items in chunks not yet saved are at risk for tar-while-running backups. | Set `player-auto-save: rate: 1200` (= 1 min). Switch backup strategy to `save-off` + `save-all flush` + tar + `save-on` to guarantee consistency, OR snapshot the host bind-mount with a filesystem-level snapshot (LVM / btrfs / ZFS). | 30 min config, 0.5d for snapshot path | +| F-11 | **P2 / 10** | `EZShop-1.0-SNAPSHOT.jar` is bundled alongside `AuctionHouse-1.4.6.jar`. PLUGIN_ALTERNATIVES.md TODO calls for dropping EZShop. | Remove EZShop, migrate any active shops to AuctionHouse, document the migration in `docs/migrations/`. | 0.5d player communication, 1h technical | +| F-12 | **P2 / 10** | Spigot `entity-tracking-range`: monsters 96, misc 96. Roadmap suggests tightening to monster=32, misc=16 for TPS / network savings. | Tune on next maintenance window, re-baseline TPS with `spark` profile. | 1h config, 1d to verify under load | +| F-13 | **P2 / 9** | 21 plugin folders without matching jar (orphans): `bStats`, `CarbonChat`, `ComfyWhitelist`, `EpicGuard`, `Essentials`, `faststats`, `GrimAC`, `Homestead`, `Lands`, `LPC`, `MarriageMaster`, `MiniMOTD`, `Multiverse-Core`, `PhantomSMP`, `TAB`, `UltimateTimber`, `UnexpectedSpawn`, `Vault`, `WorldEdit`, plus `.bak-*` directories. Most have a renamed jar (`carbonchat-paper-...jar`, `EssentialsX-...jar`) so this is mostly cosmetic. `Lands`, `LPC`, `MarriageMaster`, `PhantomSMP`, `UltimateTimber`, `UnexpectedSpawn` truly orphaned: jars not present. | Audit each: delete data dirs of plugins truly removed; the bStats/Essentials/Vault names are normal. Document plugin-name <-> jar-name pattern in `PLUGINS.md`. | 1h | +| F-14 | **P2 / 9** | No TPS Discord webhook alert (mentioned on TODO). spark is installed but auto-profile + alerting are not wired up. | spark already supports `spark profile --thresholds`; route to Discord via existing webhook stack. | 0.5d | +| F-15 | **P2 / 8** | RCON output for async commands (CoreProtect, LuckPerms) does not return to the issuing rcon-cli session. Found while trying `co inspect` from RCON. Async command results land in console only. | Document this in `docs/OPERATIONS.md` (does not exist yet — create it). For automation, attach to `docker logs -f minecraft-mc` in parallel. | 30 min doc | +| F-16 | **P2 / 8** | `gamerule keepInventory` could not be queried via `rcon-cli` due to `execute in run` argument parsing bug in itzg's rcon-cli wrapper (or RCON quoting). State unknown without in-game console. | Verify in-game by op user, document the rcon-cli limitation. | 5 min in-game | +| F-17 | **P2 / 6** | `RCON_PASSWORD` is committed to `docker-compose.yml` in plaintext (`*redacted*`). RCON port (25575) is bound to `127.0.0.1` so the blast radius is local only — but the secret is still in git history. | Rotate password, move to `.env` (gitignored), confirm `127.0.0.1`-only binding stays. | 30 min | +| F-18 | **P2 / 6** | `restart: unless-stopped` with no `start_period` re-evaluation on rapid OOM-restart loops. If the container OOMs every 60s, Docker keeps restarting indefinitely. | Add `restart_policy: { condition: on-failure, max_attempts: 5, window: 300s }` (compose v3+ deploy block) and a watchdog alert. | 30 min | + +--- + +## Detailed Methodology + +### Inputs inspected (read-only, no writes) + +| Source | Path | Method | +|--------|------|--------| +| Container env | `docker inspect minecraft-mc` | host shell | +| docker-compose | `/opt/docker/minecraft/docker-compose.yml` | host cat | +| AuthLimbo config | `/data/plugins/AuthLimbo/config.yml` | `docker exec cat` | +| AuthLimbo logs | `/data/plugins/AuthLimbo/` (no log files exist; only `config.yml`) | `docker exec ls` | +| AuthMe config | `/data/plugins/AuthMe/config.yml` | `docker exec cat` | +| AuthMe DB record for YOU500 | `/data/plugins/AuthMe/authme.db` | `docker exec python3 sqlite3` | +| CoreProtect config | `/data/plugins/CoreProtect/config.yml` | `docker exec cat` | +| CoreProtect DB size | `/data/plugins/CoreProtect/database.db` | `docker exec du -sh` | +| Server log | `/data/logs/latest.log` | `docker exec grep` | +| Paper / Spigot / Purpur configs | `/data/config/paper-*.yml`, `/data/spigot.yml`, `/data/purpur.yml` | `docker exec cat` | +| World sizes | `/data/world*/` | `docker exec du -sh` | +| Backup script (deployed) | `/opt/docker/backup.sh` | host cat | +| Backup script (repo) | `/home/admin/ai-lab/_github/minecraft-server/scripts/backup.sh` | local cat | +| Backup output | `/opt/backups/` | host stat | +| Backup log | `/opt/backups/backup.log` | host tail | +| Live state | RCON `tps`, `list` | `docker exec rcon-cli` | + +### YOU500 incident timeline (reconstructed from `latest.log`) + +| Time (BST 2026-05-07) | Event | +|-----------------------|-------| +| 17:13:34 | Login from 45.157.234.219, UUID c7c2df8e-...-686b | +| 17:13:35 | Spawned in `auth_limbo` (0.5, 128, 0.5) per AuthLimbo platform default | +| 17:13:38 | AuthMe: "YOU500 logged in" | +| 17:13:39 | AuthLimbo: "Restoring YOU500 to world(2380.4, 69.9, -11358.4)" | +| 17:13:39 | **`YOU500 left the confines of this world`** — void death | +| 17:13:39 | **`[AuthLimbo] teleportAsync returned false for YOU500 — Paper may have rejected the location.`** | +| 17:15:33 | Disconnect | +| 17:15:39 | Re-login from 82.22.5.229. Stored auth-loc has now been UPDATED to (-264.6, 86, -49.8) — different from the first attempt. Either user `/sethome`'d previously or AuthMe overwrote on the void death. | +| 17:15:44 | AuthLimbo: "Restoring YOU500 to world(-264.6, 86.0, -49.8)" — no WARN this time | +| 17:15:53 | Disconnect | +| 17:16:00 | Re-login from 82.22.5.230 | +| 17:16:05 | AuthLimbo: "Restoring YOU500 to world(-264.6, 86.0, -49.8)" | +| 17:16:28 | **`YOU500 was blown up by Creeper`** | +| 17:16:57 | Operator (s8n) RCON: `tpa YOU500 -264 86 -50` + `tell YOU500 grab items fast 5min despawn` | +| 17:17:02 | RCON teleport executed | +| 17:18:22 | s8n in-game: `/tp2p YOU500 s8n` | + +The void death at 17:13:39 is the data-loss event. AuthMe had `SaveQuitLocation: true` so the (2380, 70, -11358) was a real prior position but the chunk was almost certainly not loaded yet (11k blocks out, no recent player there). `teleportAsync` returned false either because: +- the chunk failed to load within Paper's async generation budget, or +- the entity was already dead (void death raced ahead of teleport). + +### What CoreProtect WOULD have caught (and didn't) + +CoreProtect inventory tracking is enabled (`item-transactions: true`, `item-drops: true`, `item-pickups: true`, `rollback-items: true`). However: +- A void-death drops items into the world for ~5 min then despawns. Drops are item entities, not container transactions; CoreProtect logs them as drops only if a player was the immediate cause of the drop. +- A death-drop in the `auth_limbo` world (where the void death happened) drops into y<0 air which is itself a non-event for CP. +- Thus there was no item-rollback path even if `co inspect` had been run within minutes. + +**Implication:** CoreProtect is the wrong tool for death-drop recovery. A real death-drop plugin or `keepInventory` is the only fix. + +### Backup script forensics + +- Deployed: 88 lines, last block is "Prune old backups". No Minecraft block. No `umask 077`. +- Repo: 131 lines (with malformed lines 119-122 leftover from a bad merge — ALSO a bug to fix on the next push). Has the Minecraft block. Has `umask 077`. +- `/opt/backups/backup.log` shows last 5 days of "Backup complete" entries averaging 8-12K. None contain MC data. None mention MC. The log line `Configs: partial (some files missing)` is the configs section misfiring on Matrix paths and was never the MC block. +- Last verified-good MC archive on host: `/opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz` (one-shot pre-rebrand snapshot; contents not verified in this audit). + +--- + +## Action Items (Prioritised) + +### P0 — this week (by 2026-05-14) + +1. **F-01 / Backups.** Sync deployed backup.sh with repo. Fix the lines 119-122 corruption in repo first. Add post-run sentinel: `[ "$(stat -c%s mc-world-backup-*.tar.gz)" -gt 1073741824 ] || log "WORLD BACKUP TOO SMALL — ABORT"`. Run manual backup, verify >= 5 GB on disk. Test a restore into a scratch dir. +2. **F-02 / Item-loss safety net.** Decide policy. Recommend: enable `keepInventory true` in `auth_limbo` world only (cheap, narrow), and write a 50-line AuthLimbo extension `OnPlayerDeath` listener that detects "death in auth_limbo" -> restore inventory snapshot taken at AuthMeAsyncPreLogin. Survival pain preserved everywhere else. +3. **F-03 / AuthLimbo recovery.** Bump WARN to ERROR. Wire to existing Discord webhook (per workspace memory: webhook stack on nullstone). On failure, write player snapshot to `auth_limbo/incidents/-.dat`. +4. **F-04 / Chunk preload race.** Add chunk-loaded check + sync force-load before `teleportAsync`. If still false, kick with friendly message instead of letting the player drop into limbo. +5. **F-05 / OOM headroom.** Lower `-Xmx` to 14 GB and add `docker events` watcher. +6. **F-06 / Container hardening.** Add `cap_drop`, `pids_limit`, `no-new-privileges`. Boot test in a window. + +### P1 — this month + +7. **F-07** CoreProtect prune cron, plan MySQL migration. +8. **F-08** SHA256 -> BCRYPT migration with legacyHashes fallback. +9. **F-09** Document `online-mode=false` rationale in RULES.md. +10. **F-10** Consider LVM/ZFS snapshot for backup atomicity. + +### P2 — this quarter + +11. **F-11** Drop EZShop after player communication window. +12. **F-12** Tighten entity tracking range, re-profile with spark. +13. **F-13** Clean orphan plugin folders. +14. **F-14** Wire spark TPS alerts to Discord. +15. **F-15** Document RCON async-command behaviour. +16. **F-17** Rotate RCON password, move to .env. +17. **F-18** Add restart-policy max_attempts. + +--- + +## Open Questions for the Operator + +1. **Inventory restoration policy.** Is silent `keepInventory` only in `auth_limbo` acceptable, or do you want a manual ops-restore-from-snapshot approval gate? +2. **YOU500 specifically.** Is there an out-of-band record of what they were carrying (Discord screenshot, witness)? If yes, manual NBT injection into player.dat is feasible. CoreProtect cannot help. +3. **Chunk preload trade-off.** Force-loading distant chunks at login adds 200-2000ms to login time. Acceptable vs the void-death risk? +4. **MySQL for CoreProtect.** Adds an operational dependency (another container, another backup target). Worth the complexity, or is monthly purge to keep SQLite under 1 GB sufficient? +5. **RCON password rotation.** The committed value should be rotated on principle. Schedule a maintenance window? +6. **online-mode=false.** Confirm long-term stance. Mojang ToS implications for racked.ru? +7. **Backups offsite.** Currently `/opt/backups/` is on the same host. Plan for offsite copy (B2, restic to friend-PC, anything)? + +--- + +## What was NOT in scope this audit + +- Network firewall, fail2ban, host-side security (nullstone-server has its own audit folder). +- Plugin source-supply-chain audit (covered by `docs/ROADMAP.md` "plugin acquisition overhaul"). +- Performance profiling under load (deferred per F-12). +- LuckPerms permission graph correctness. +- Rules / chat-format / prefix audit (workspace memory: do NOT touch LP prefixes). +- Per-region (Lands / Homestead) data integrity. + +--- + +## Sign-off + +| Field | Value | +|-------|-------| +| Audit date | 2026-05-07 | +| Method | Read-only SSH inspection, no fixes applied | +| Workspace rule applied | "Audit findings -> docs first, then fix" | +| Next action | Operator review + go/no-go on each P0 item | +| Next audit due | 2026-08-07 (quarterly), or sooner after backups remediated | diff --git a/BACKUP-HUNT-2026-05-07.md b/BACKUP-HUNT-2026-05-07.md new file mode 100644 index 0000000..08091fe --- /dev/null +++ b/BACKUP-HUNT-2026-05-07.md @@ -0,0 +1,118 @@ +# YOU500 Inventory Recovery — Backup Hunt Report + +**Date:** 2026-05-07 +**Player:** YOU500 (UUID `c7c2df8e-8783-30b5-891c-86ec9343686b`) +**Incident:** Full inventory loss at 17:13:39 BST. AuthLimbo `teleportAsync returned false`, player teleport into world from auth_limbo failed → `YOU500 left the confines of this world` (void death). Vanilla `/data/world/playerdata` overwritten on respawn with empty inventory; vanilla void = no drops in world. +**Host:** nullstone (192.168.0.100), live MC data at `/home/docker/minecraft/` (== `/opt/docker/minecraft/`, same FS, inode 18877649 confirmed). +**SSH user:** `user` (no sudo). All `/opt/backups/2026*` dated subdirs are root-owned 0700 → unreadable. `/var/lib/docker/volumes/` unreadable. + +--- + +## Summary + +**Recoverable backup exists: YES — partial.** The pre-rebrand world archive `/home/user/ai-lab/_archive/minecraft-old-2026-04-27.tar.gz` contains YOU500's playerdata `.dat` from **2026-03-25 18:53** (size 9617 B vs current 9192 B — bigger = inventory likely populated). It is the **only known full-inventory snapshot for this UUID** anywhere on the host. + +**Caveat:** This is a 6-week-old snapshot. Items gained between 2026-03-25 and 2026-05-07 17:13 are NOT recoverable from any file backup. **CoreProtect** is installed and has been logging since 2026-05-01 → use `/co inventory YOU500` and `/co rollback` to retrieve anything stored in containers post-2026-05-01. + +**No scheduled world backups exist.** `/opt/docker/backup.sh` stopped backing up the MC world after 2026-05-02 (the world-backup branch was removed when the script was last edited; only configs/Matrix/RC are now dumped). Last world tarball that landed on disk: `/opt/backups/20260430_020001/minecraft-configs-20260430_020001.tar.gz` (12 KB → configs only, no playerdata). + +--- + +## Inventory of Backup Artifacts (oldest → newest) + +| When | Path | Size | Owner | Contains YOU500 .dat? | Notes | +|------|------|------|-------|----------------------|-------| +| 2026-03-25 18:53 (file mtime inside) | `/home/user/ai-lab/_archive/minecraft-old-2026-04-27.tar.gz` | ~? large | user | **YES** — `minecraft/world/playerdata/c7c2df8e-…dat` 9617 B + `.dat_old` 9616 B (2026-03-25 18:49) | **Best candidate.** 133 player .dat files, full world tree, Essentials/LitePlaytimeRewards/LandClaim DBs, advancements, stats. | +| 2026-04-30 02:01 | `/opt/backups/20260430_020001/minecraft-configs-20260430_020001.tar.gz` | 12 KB | root (UNREADABLE) | NO — configs only | Cannot read without sudo; size implies no world data anyway. | +| 2026-04-30 02:01 | `/opt/backups/20260430_020001/configs-20260430_020001.tar.gz` | 2.4 KB | root | NO | Traefik/Matrix/RC configs. | +| 2026-04-30 19:21 | `/opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz` | 224 MB | user | NO playerdata `.dat` files. Has `plugins/AuthMe/playerdata/` (empty), `plugins/AuthMe.bak-20260430-144204/playerdata/` (empty), `plugins/SkinsRestorer/cache/YOU500.mojangcache`. Vanilla world NOT included. | Plugin trees only — useful for password DB (`plugins/AuthMe.bak-…/authme.db`), not inventory. | +| 2026-05-03 02:00 | `/opt/backups/20260503_020001/configs-20260503_020001.tar.gz` | 2.4 KB | root | NO | Configs. | +| 2026-05-04 02:00 → 2026-05-07 02:00 | `/opt/backups/20260504_020001` … `20260507_020001` | 0700 dirs | root (UNREADABLE) | Inferred NO from log: backup.log shows only "configs OK" / "Matrix Postgres skipping" / "Volumes skipping" — world not touched after 2026-05-02. | All four dirs report 12 KB. | +| 2026-05-07 17:15 | `/home/docker/minecraft/world/playerdata/c7c2df8e-…dat_old` | 9181 B | uid 101000 | YES — but POST-DEATH (empty inventory). | Identical to live state right after first respawn. | +| 2026-05-07 17:21 | `/home/docker/minecraft/world/playerdata/c7c2df8e-…dat` | 9192 B | uid 101000 | YES — current live, empty inventory. | | +| 2026-05-07 17:15 | `/tmp/you500.dat` | 9181 B | user | YES — but byte-identical-size to `.dat_old`; gunzip strings show only base attribute schema (no item/Slot tags) → already empty. | Someone (you) already extracted the empty post-death dat. Useless for recovery. | + +### Misc archives checked, NOT relevant + +- `/opt/source-endpoint/source.tar.gz` — Misskey AGPL source dump. +- `/opt/backups/misskey/*` — Misskey DB/files. +- `/home/user/ai-lab/.stversions/_projects/_minecraft/launcher/java/java21.tar~*.gz` — JDK. +- `/home/user/ai-lab/_projects/_minecraft/resources/racked.ru.-.minecraft.7z` — launcher resources. +- `/home/user/ai-lab/.stversions/**` — Syncthing versions hold only **server config files** (`server.properties`, `bukkit.yml`, `purpur.yml` etc.) under `_github/online/minecraft-server/config/`. **No `.dat` or `playerdata/`** anywhere in `.stversions`. `.stignore` does not list `world/`, but the synced repo never contained the world dir to begin with (it's `_github/minecraft-server/` = configs + docker-compose only). + +--- + +## CoreProtect — Live Rollback Source + +| Path | Size | Born | Last modified | +|------|------|------|---------------| +| `/data/plugins/CoreProtect/database.db` (in container) | 1.59 GB | 2026-05-01 10:11:53 | 2026-05-07 17:27 | + +CoreProtect logs container interactions, item drops, deaths, inventory changes since **2026-05-01**. For YOU500's items stored in chests/shulkers/ender chests within the world, an in-game rollback can recover them: + +- Inspect deaths: `/co lookup user:YOU500 action:#kill time:1d` +- Inspect inventory transactions: `/co inventory YOU500` (CoreProtect-CE feature) +- Rollback drops/voids near death: `/co rollback time:1h user:YOU500 radius:#global action:-drop,#kill` + +(Items YOU500 carried in person and lost to void at 17:13:39 are unlikely to appear in CoreProtect — vanilla void death deletes drops without a kill event in some versions; CoreProtect's `#kill` may or may not have logged it. Worth a `/co lookup user:YOU500 time:30m` to confirm.) + +--- + +## Best Recovery Candidate + +**File:** `/home/user/ai-lab/_archive/minecraft-old-2026-04-27.tar.gz` +**Internal path:** `minecraft/world/playerdata/c7c2df8e-8783-30b5-891c-86ec9343686b.dat` +**Snapshot date:** 2026-03-25 18:53 (~6 weeks before incident). + +### Extraction command (DO NOT RUN — for review only) + +```bash +# Extract just the YOU500 dat to a staging area, do NOT touch live data +mkdir -p /tmp/you500-recovery +tar -xzvf /home/user/ai-lab/_archive/minecraft-old-2026-04-27.tar.gz \ + -C /tmp/you500-recovery \ + minecraft/world/playerdata/c7c2df8e-8783-30b5-891c-86ec9343686b.dat \ + minecraft/world/playerdata/c7c2df8e-8783-30b5-891c-86ec9343686b.dat_old + +# Confirm and inspect (NBT viewer or zcat | strings) before any restore +ls -la /tmp/you500-recovery/minecraft/world/playerdata/ +zcat /tmp/you500-recovery/minecraft/world/playerdata/c7c2df8e-8783-30b5-891c-86ec9343686b.dat \ + | strings | grep -E 'Slot|count|minecraft:diamond|minecraft:netherite' | head -40 +``` + +### Restore plan (operator decision — NOT executed) + +1. Stop the server (or kick YOU500) so file is not held open. +2. With sudo (uid 101000 owns the file): copy the extracted `.dat` over `/home/docker/minecraft/world/playerdata/c7c2df8e-8783-30b5-891c-86ec9343686b.dat`, preserve mode/owner. +3. Also overwrite `.dat_old`. +4. Optional: replace `Essentials/userdata/c7c2df8e-…yml` from same archive if the YML matters. +5. Restart server. Player rejoins with March 25 inventory + position. + +**Tradeoff:** YOU500 will lose all progress 2026-03-25 → 2026-05-07. Communicate before applying. Combine with CoreProtect rollback to minimise loss. + +--- + +## Gaps + +- **No scheduled world backups since 2026-05-02.** `/opt/docker/backup.sh` no longer dumps `world/`. The 2026-04-30 daily contains a 12 KB "minecraft-configs" tarball (configs, not world). Action: re-add a world tarball to the daily script. +- **No off-host backup.** No restic / borg / duplicity / rsnapshot installed. No rclone. No second host pulling MC data. Syncthing does not sync the world dir. +- **No filesystem snapshots.** Root is ext4 on LVM (no LVM thinpool snapshots in use), `/home` is ext4 (no btrfs/ZFS). +- **`/var/lib/docker/volumes/` unreadable** without sudo. Confirmed via `docker volume ls | grep -iE mine|back|world` returning empty (named volumes not used for MC — bind mount only). +- **`/opt/backups/2026*_020001` subdirs unreadable** (mode 0700 root). Cannot diff their contents byte-for-byte; relied on `backup.log` text + indirect listing. They almost certainly contain only configs (12 KB dirs, log entries match). +- **`docker exec minecraft-mc env | grep -i backup` returned nothing** — no env-driven autosave/backup plugin enabled (e.g. `itzg/mc-backup` sidecar absent, no AutomatedBackup / EasyBackup jar in `/data/plugins`). +- **AuthMe `playerdata/` dirs are empty** in both live and `.bak-20260430-144204` — AuthMe is configured without inventory protection (no logged-out inv snapshots). +- **No InvSee / InventoryRollback plugin.** Only CoreProtect (logs, not snapshots). + +--- + +## Permission-Limited Reads (no sudo via SSH) + +| Path | What we couldn't see | Likely contents | +|------|----------------------|-----------------| +| `/opt/backups/20260504_020001/` … `20260507_020001/` | Directory listings (0700 root) | Daily configs tarballs, ~12 KB each — confirmed via `du` in backup.log | +| `/opt/backups/20260430_020001/minecraft-configs-20260430_020001.tar.gz` | tar listing (root-owned, 0600) | MC config bind-mount tarball, 12914 B | +| `/var/lib/docker/volumes/` | Directory listing | Named volumes — not used by MC (bind mount only) | +| `/var/backups/` (host) | Listing | Standard Debian dpkg/apt backups, not MC | +| `/root/` | Anything | — | + +Re-run with `sudo` if any of these need confirmation, but content is improbable to change the conclusion. diff --git a/BACKUP-STRATEGY.md b/BACKUP-STRATEGY.md new file mode 100644 index 0000000..2a69bb8 --- /dev/null +++ b/BACKUP-STRATEGY.md @@ -0,0 +1,393 @@ +# Minecraft Backup Strategy — racked.ru on nullstone + +**Status:** PROPOSAL (2026-05-07) — not yet implemented. +**Author trigger:** Player lost full inventory to void death today; rollback impossible because the existing 02:00 daily backup had **silently failed for 5 of the last 7 days** and there is **zero off-host copy**. +**Owner:** `s8n` (operator). +**Target host:** `nullstone` (192.168.0.100, Debian 13 trixie). + +--- + +## 0. Current state (audited 2026-05-07) + +Existing system in `/opt/docker/backup.sh` + `cron.d/docker-backup` (02:00 daily, 7-day retention in `/opt/backups/`). + +Findings from `/opt/backups/backup.log`: + +| Date | MC world result | Backup dir total | +|------|-----------------|------------------| +| 2026-04-26 | FAILED | — | +| 2026-04-27 | FAILED | — | +| 2026-04-28 | FAILED | — | +| 2026-04-29 | OK (3.6 G) | — | +| 2026-04-30 | FAILED | — | +| 2026-05-01 | FAILED | — | +| 2026-05-02 | OK (3.6 G) | — | +| 2026-05-03 | (no MC log line) | 8 K | +| 2026-05-04 | (no MC log line) | 8 K | +| 2026-05-05 | (no MC log line) | 8 K | +| 2026-05-06 | (no MC log line) | 12 K | +| 2026-05-07 | (no MC log line) | 12 K | + +After 2026-05-02 the entire MC block stopped emitting log lines. The script appears to be exiting before reaching it (the duplicated stray `chmod 600 ... synapse-signing-key` lines at L119–122 are orphaned from a botched edit and may now break `set -e`). Effective state: **two MC backups in the last 12 days**, both already pruned by 7-day retention. **No usable backup exists right now.** + +Cross-references: +- `_github/infra/STATE.md` Top-5 weakness #2 ("backup.sh broken silently") and #5 ("No off-host backup"). +- `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md` §5 already names this `F-backup-1` and proposes "Restic + autorestic to B2/Wasabi or to nullstone-as-spare". This strategy refines that to use on-hand resources rather than paid storage. + +### Available resources (no purchasing required) + +| Asset | Location | Free | Reachability | Role | +|---|---|---|---|---| +| nullstone `/home` | local NVMe (ext4 LVM) | 142 G of 399 G | local | Primary repo + restic cache | +| onyx `/home` | LUKS NVMe | 1.6 T of 1.9 T | Tailscale 100.64.0.1 (LAN ~5 ms) | **Off-host primary** | +| friend RTX 4080 PC | DESKTOP-LR0RILA | unknown (Windows, large) | Tailscale 100.64.0.3 (WAN, IP-stable via tailnet) | **Off-host secondary** (defer) | +| nullstone `/opt/backups` | same disk as `/opt/docker` | 142 G | local | *Not* a real backup target — same-disk SPOF | + +**No purchased B2 / Wasabi / S3 in this proposal.** Tailscale + onyx covers off-host today. B2 stays in the future-options annex. + +--- + +## 1. Threat model + +| # | Threat | Concrete example | Frequency | Mitigation in this plan | +|---|---|---|---|---| +| T1 | Player accidental loss (void death, lava, fall) | YOU500, 2026-05-07 | weekly | 5-min playerdata snapshots (RPO ≤ 5 min) | +| T2 | Griefing / theft / chest emptied by ban-evader | possible | monthly | 5-min playerdata + 1-h world snapshots | +| T3 | World corruption (chunk error, region-file truncate) | rare | — | 6-h pre-flight validated full world snapshot | +| T4 | Plugin / config bad change (LuckPerms wipe, server.properties) | edits during ops | weekly | daily configs + DB dump + git history (`live-server/` repo) | +| T5 | Host disk failure (single NVMe) | low/year | — | nightly off-host copy to onyx (Tailscale) | +| T6 | Ransomware / host compromise | low | — | append-only Restic repo on onyx; nullstone holds **no** delete key | +| T7 | Operator `rm -rf` or wrong `docker compose down -v` | low | — | retention floor (4 weekly + 12 monthly) survives a recent rm | +| T8 | Backup script silently failing (current state) | OBSERVED | — | heartbeat alert + monthly restore drill (§7) | + +T8 is the one that just bit us. The single most important addition is **alerting on missed runs**, not the storage tech. + +--- + +## 2. RPO / RTO + +| Class | Data | RPO | RTO | Backup mechanism | +|---|---|---|---|---| +| A | playerdata (`world/playerdata/*.dat`, `stats/`, `advancements/`) | **5 min** | < 2 min per player | rcon `save-all flush` → rsync to local snapshot, then restic-add | +| B | full world (region files, end + nether) | **1 h** during play, **6 h** otherwise | 15 min | restic of `world*/` | +| C | plugin configs + LuckPerms YAML | 24 h | 30 min | tar of `plugins/*/config*.yml` + LP file dump | +| D | LuckPerms / Homestead SQLite DBs (`*.db`, `homestead_data.db`) | 1 h | 5 min | sqlite `.backup` then restic-add | +| E | host-level configs (`docker-compose.yml`, `server.properties`, `purpur.yml`, `bukkit.yml`, `paper-*.yml`, `whitelist.json`, `ops.json`, `banned-*.json`, `config/`) | 24 h | 5 min | already in git repo `_github/minecraft-server/`; backup just covers drift | + +**Justification for RPO=5 min on Class A:** the void-death case rebuilds in seconds — recovering one `.dat` is a ~30 s operation if a 5-min-old snapshot exists. Snapshotting just the 1.3 MB `playerdata/` dir is cheap (single-digit MB/day after dedup). + +--- + +## 3. Tool choice — Restic + +Compared: + +| Tool | Dedup | Encryption | Snapshots | Network destinations | Verdict | +|---|---|---|---|---|---| +| **restic** | content-addressed, very effective on MC region files | AES-256, repo-key | yes | sftp (Tailscale), local, B2, S3, Azure, rclone | **WINNER** | +| borgbackup | similar | yes | yes | ssh only, lock-on-write | Equally good; restic chosen because operator already plans `restic + autorestic` per `infra/STATE.md` line 112; sftp dest is simpler than borg's required serverside binary | +| rsnapshot | hardlinks, no dedup | none | rotated dirs | local + rsync | No encryption ⇒ off-host copy on Tailscale (already encrypted) is fine, but no dedup means 18 G × N snapshots is painful. Reject. | +| zfs send | block-level | (zfs native) | snapshots | yes | nullstone is **ext4/LVM**, no ZFS, no btrfs. Reject. | +| LVM snapshot | COW | none | yes | local only | Same-disk only, doesn't survive disk failure. Useful as a *staging* primitive only. | +| custom rsync + cp -al | hardlinks | none | yes | yes | Reinventing rsnapshot. Reject. | +| itzg `BACKUP_*` env | tar to volume | none | rotation | local | Already tried in spirit by current `backup.sh`; same-disk; not granular. Reject as primary. | + +**Decision:** `restic` for Classes A, B, C, D. Continue using a thin tar wrapper for Class E (configs are already in the git repo, this is just safety). + +Restic strengths for our case: +- Region files dedup *very* well (chunks unchanged across snapshots). +- A 5-min Class-A snapshot adds ~MB to the repo, not the full 1.3 MB × N. +- One repo on local disk + one mirror to onyx via `rclone serve restic` or direct `sftp:` — no agent needed on onyx beyond ssh. +- `restic check --read-data-subset=5%` is the canonical scrub. + +Apt: `apt install restic` on trixie ships 0.16.x — sufficient. + +--- + +## 4. Schedule + +All times Europe/London (matches `TZ` in compose file). + +| Job | Cadence | Source | Destination | Mechanism | +|---|---|---|---|---| +| **A — playerdata** | every **5 min** | `world/playerdata/`, `world/stats/`, `world/advancements/`, `world*/level.dat`, `*.db` (LP+homestead) | restic repo `/home/user/restic/mc-frequent/` | systemd timer `mc-backup-frequent.timer` | +| **B — full world** | every **1 h** during play (07:00–01:00), **6 h** otherwise | `world/`, `world_nether/`, `world_the_end/` | restic repo `/home/user/restic/mc-world/` | systemd timer `mc-backup-world.timer` | +| **C — configs + plugins** | **daily 02:00** | `/opt/docker/minecraft/*.yml`, `*.json`, `plugins/*/config*.yml`, `plugins/LuckPerms/`, `docker-compose.yml` | restic repo `mc-world` (path-tagged) | reuse same timer with second backup target | +| **D — DB dumps** | every **1 h** | `homestead_data.db`, `plugins/CoreProtect/database.db`, `plugins/LuckPerms/luckperms-h2-*` | restic repo `mc-world` | timer hooks `sqlite3 .backup` first | +| **E — off-host mirror** | **nightly 03:30** | nullstone `/home/user/restic/` | onyx `100.64.0.1:/home/admin/backups/nullstone-mc-restic/` | `restic copy` over sftp (Tailscale) — append-only key on onyx side | +| **F — verify** | **weekly Sun 04:00** | both repos | — | `restic check --read-data-subset=5%` then alert on rc | +| **G — drill** | **monthly 1st Sat 11:00** | random snapshot | scratch dir | §7 procedure | + +### Why this works for the void-death case + +T1 hits at 18:42. By 18:45 a Class-A snapshot exists containing the player's `.dat` from 18:40. Restore: `restic -r ... restore --target /tmp/r --include 'world/playerdata/.dat' latest`, stop server (or `/save-off` + minimanip), copy file into place, `/save-on`. Total RTO < 2 min. + +--- + +## 5. Retention + +Restic policy (passed to `restic forget --keep-*`): + +``` +--keep-last 24 # 24 most recent (covers 2h of 5-min snapshots) +--keep-hourly 24 # 24h of hourly +--keep-daily 7 # 7 days +--keep-weekly 4 # 4 weeks +--keep-monthly 12 # 12 months +``` + +Applied per-tag — Class A snapshots tagged `playerdata`, B/C/D tagged `world`. Forget is run **only on the local repo**; the onyx mirror inherits via `restic copy` with same policy after the local forget+prune. + +### Storage budget + +- Class A: 1.3 MB raw × dedup (~20× on `.dat`, mostly empty NBT slots) → ~70 KB / snapshot **net**. + - 24/h × 24h × 7 = 4 032 snapshots/week → < 300 MB/week. +- Class B/C/D: 18 G raw → ~6.5 G compressed (per current 3.6 G figure × adjustment for nether/end now active). Restic dedup on hourly snapshots: ~50–200 MB delta/snapshot during active play. + - 24h hourly + 7 daily + 4 weekly + 12 monthly ≈ 47 retained → estimate **15–25 GB total** at steady state. +- E (off-host): same as above on onyx (1.6 TB free — 30× headroom). + +**Conclusion:** comfortably fits in nullstone's 142 G free. Onyx is essentially unconstrained. + +--- + +## 6. Off-host destination — onyx via Tailscale + +**Choice:** `onyx` (100.64.0.1, 1.6 TB free on `/home`). Reasons: +- Already in the tailnet (`tag:admin`), already trusted, already SSH-reachable. +- 1.6 TB is 100× the dataset. +- Operator's daily-driver: a missed-backup alert on onyx is *seen*. +- Deferred (phase 2): replicate to friend's RTX 4080 PC (100.64.0.3) for true geographic separation. Tailnet IP is stable across the friend's ISP IP changes per memory `project_friend_gpu`. + +**Mechanics:** +1. On onyx: create restricted user `mc-backup` with `~/backups/nullstone-mc-restic/` and a `~/.ssh/authorized_keys` entry that **only allows `internal-sftp` chrooted to that dir**, no shell, no port-forward. (`Match User mc-backup ... ChrootDirectory %h, ForceCommand internal-sftp -d /backups/nullstone-mc-restic`). +2. On nullstone: install nullstone's ssh public key on onyx for that user. Use a second **append-only** restic key (separate password) so a compromised nullstone cannot run `forget`/`prune` on the onyx repo. Restic supports this via per-key `--no-cache`-friendly flags, but the harder lock comes from sftp chroot perms (set parent dir owner to root, give `mc-backup` write inside but no `unlink` on rotated lockfiles? — practical compromise: rely on `restic copy` adding-only and audit `forget` runs). +3. Nightly job on nullstone: `restic -r sftp:mc-backup@100.64.0.1:/backups/nullstone-mc-restic copy --from-repo /home/user/restic/mc-world latest && ... mc-frequent ...`. +4. Onyx-side cron weekly: `restic check` on the mirror (independent verification). + +**Why not friend's GPU PC?** Windows host, no built-in SSH, asymmetric trust. Defer to phase 2 once an SMB or `rclone serve` target is set up there. + +--- + +## 7. Restore drill (monthly, 1st Saturday 11:00) + +Runbook: `docs/RUNBOOK-BACKUP-RESTORE.md` (created alongside this proposal). + +Drill scenario: "YOU500 lost his inventory to a void death 6 minutes ago." Steps: + +1. Pick a known UUID from `world/playerdata/` (operator's own UUID). +2. `restic -r /home/user/restic/mc-frequent snapshots --tag playerdata | tail -5` — confirm freshest snapshot is ≤ 6 min old. +3. `restic -r ... restore latest --target /tmp/drill-$(date +%s) --include 'world/playerdata/.dat'`. +4. `nbted` or `python -m nbtlib` parse the `.dat` — confirm it's a valid GZIP NBT structure (not zero bytes, not partial). +5. `diff` against the live `.dat` — log the differences (expected: at least the inventory NBT path differs because player kept playing). +6. Repeat from the **onyx mirror** repo to prove off-host works end-to-end. +7. Log result to `docs/RUNBOOK-BACKUP-RESTORE.md` § Drill log. + +Drill is **non-destructive** — never overwrite live `.dat` during a drill. Real restores follow §3 of the runbook. + +Pass criteria: both restores complete in < 2 min wall-clock and the parsed NBT root tag is well-formed. + +--- + +## 8. Implementation — concrete drafts + +Two layers: a **fix** to the existing daily script (Class C/E) and a **new sidecar timer** for Classes A/B/D. + +### 8.1 Fix `/opt/docker/backup.sh` (F-backup-1) + +Already documented in `infra/runbooks/MIGRATION-nullstone-to-cobblestone.md` §5. Minimum work: +- Drop dead `matrix-postgres` block (Synapse retired). +- Drop / fix `mongodb` block (RC stopped 2026-05-06). +- Remove orphaned `chmod 600 ...synapse-signing-key...` block at L119–122 (causing `set -e` exit before MC block on most days). +- Wrap each module in `( ... ) || log "module FAILED"` so one module's failure doesn't skip the rest. + +Out-of-scope for this strategy doc — track in infra audit. + +### 8.2 New: `mc-backup-frequent` (Class A) and `mc-backup-world` (Classes B/C/D) + +Drop-in files (operator review before deploy): + +**`/etc/systemd/system/mc-backup-frequent.service`** +```ini +[Unit] +Description=Minecraft frequent backup (playerdata, every 5 min) +After=docker.service +Wants=docker.service + +[Service] +Type=oneshot +User=user +Group=docker +EnvironmentFile=/etc/mc-backup.env +ExecStart=/usr/local/bin/mc-backup-frequent.sh +Nice=10 +IOSchedulingClass=best-effort +IOSchedulingPriority=7 +``` + +**`/etc/systemd/system/mc-backup-frequent.timer`** +```ini +[Unit] +Description=Run mc-backup-frequent every 5 minutes + +[Timer] +OnBootSec=2min +OnUnitActiveSec=5min +AccuracySec=30s +Persistent=true + +[Install] +WantedBy=timers.target +``` + +**`/etc/mc-backup.env`** (mode 0600, owner `user:docker`) +``` +RESTIC_REPOSITORY_FREQUENT=/home/user/restic/mc-frequent +RESTIC_REPOSITORY_WORLD=/home/user/restic/mc-world +RESTIC_PASSWORD_FILE=/etc/mc-backup.pw +MC_DATA=/opt/docker/minecraft +RCON_HOST=127.0.0.1 +RCON_PORT=25575 +RCON_PASS=*redacted* +HEARTBEAT_URL=https://ntfy.s8n.ru/mc-backup-frequent +ALERT_URL=https://ntfy.s8n.ru/mc-backup-alerts +TS_OFFHOST_USER=mc-backup +TS_OFFHOST_HOST=100.64.0.1 +TS_OFFHOST_PATH=/backups/nullstone-mc-restic +``` + +**`/usr/local/bin/mc-backup-frequent.sh`** +```bash +#!/usr/bin/env bash +set -euo pipefail +. /etc/mc-backup.env + +trap 'curl -fsS -m 10 -d "fail rc=$?" "$ALERT_URL" >/dev/null || true' ERR + +# 1. Ask MC to flush via rcon (best-effort; don't fail backup if rcon down) +if command -v mcrcon >/dev/null 2>&1; then + mcrcon -H "$RCON_HOST" -P "$RCON_PORT" -p "$RCON_PASS" -w 1 \ + "save-all flush" >/dev/null 2>&1 || true +fi + +# 2. Snapshot just the small fast-changing things +restic backup \ + --tag playerdata \ + --tag auto-5min \ + --host nullstone \ + --exclude='*.lock' \ + "$MC_DATA/world/playerdata" \ + "$MC_DATA/world/stats" \ + "$MC_DATA/world/advancements" \ + "$MC_DATA/world/level.dat" \ + "$MC_DATA/world_nether/level.dat" \ + "$MC_DATA/world_the_end/level.dat" \ + "$MC_DATA/homestead_data.db" \ + "$MC_DATA/plugins/LuckPerms" \ + "$MC_DATA/plugins/CoreProtect/database.db" 2>/dev/null || true + +# 3. Cheap retention (only on local repo) +restic forget --tag auto-5min \ + --keep-last 24 --keep-hourly 24 --keep-daily 7 \ + --prune --quiet + +# 4. Heartbeat — alert if NOT received in 15 min via ntfy server +curl -fsS -m 5 "$HEARTBEAT_URL" >/dev/null || true +``` + +**`mc-backup-world.{service,timer,sh}`** — same shape, runs hourly during play / 6h otherwise (use `OnCalendar=*-*-* 07,08,...,01:00:00` or two timers), backs up full `world*/`, configs, DB dumps. After local backup, runs: + +```bash +restic copy \ + --from-repo "$RESTIC_REPOSITORY_WORLD" \ + -r "sftp:$TS_OFFHOST_USER@$TS_OFFHOST_HOST:$TS_OFFHOST_PATH" \ + latest +``` + +And once nightly (separate timer) the same `copy` for `mc-frequent`. + +### 8.3 docker-compose.override.yml — alternative path (rejected) + +Considered: itzg image supports `BACKUP_INTERVAL`, `BACKUP_METHOD=restic`. Pros: in-container, knows when world is loaded. Cons: +- Bind-mount to host restic repo crosses userns-remap boundary (uid 100000 vs host uid 1000) — already a known nullstone footgun (memory `project_nullstone_docker_userns`). +- Container restart wipes restic cache, slow first run after every reboot. +- Mixing in-image and host-cron backup logic doubles failure surfaces. + +**Decision:** keep backups in systemd on the host; container is unaware. Override file is **not** part of this proposal. + +--- + +## 9. Monitoring & alerting + +Three signals, all routed to ntfy on the existing self-hosted `ntfy.s8n.ru` (assumed to exist; if not, add as part of phase 1 — single-container deploy). DiscordSRV was dropped on 2026-04-30 per README.md L170, so Discord is not an option. + +| Signal | Trigger | Channel | +|---|---|---| +| `mc-backup-frequent` heartbeat | timer fires successfully | ntfy topic `mc-backup-frequent` (silent on success) | +| Heartbeat **missing > 15 min** | dead-man's switch on ntfy server, or external (`healthchecks.io` is free + self-hostable) | ntfy topic `mc-backup-alerts` (high priority) | +| `restic check` weekly | non-zero rc | ntfy topic `mc-backup-alerts` (high priority) | +| Off-host mirror failure | `restic copy` non-zero rc | ntfy topic `mc-backup-alerts` (high priority) | + +Operator subscribes onyx + phone to `mc-backup-alerts` only. The `-frequent` topic is a heartbeat sink (not a notification stream). + +**Alternative if no ntfy yet:** write to `/var/log/mc-backup.log` AND a tiny status file `/var/lib/mc-backup/last-success` (mtime checked by an external monitor — Gatus on roadmap, Beszel on roadmap). Until either of those lands, a simple cron on **onyx** doing `ssh user@nullstone 'find /var/lib/mc-backup/last-success -mmin -15 | grep .'` and triggering a desktop `notify-send` is enough. + +This addresses T8 (the silent-failure threat) directly. + +--- + +## 10. Cost & capacity + +**Hardware cost:** £0. Uses existing nullstone NVMe + onyx NVMe + existing Tailscale mesh. + +**Disk consumption (steady state, both repos):** + +| Where | Estimate | Headroom | +|---|---|---| +| nullstone `/home/user/restic/mc-frequent` | < 1 GB | 142 G free → ~140× | +| nullstone `/home/user/restic/mc-world` | 15–25 GB | ~6× | +| onyx `~/backups/nullstone-mc-restic/` | 16–26 GB | 1.6 T free → ~60× | + +**Days of retention given current free space:** even if the world doubles to 36 GB raw, dedup keeps growth linear at ~5 % per snapshot — well over a year of monthly retention fits. + +**Network:** Tailscale LAN-direct (5 ms onyx ↔ nullstone). Nightly delta typically < 500 MB after dedup. Negligible. + +**Operator time:** ~2 h initial deploy, ~10 min/month for the drill, ~zero on autopilot. + +--- + +## 11. Phase plan + +| Phase | What | When | Blocker | +|---|---|---|---| +| 0 | This doc + runbook stub written, reviewed | TODAY | — | +| 1 | Stop the bleeding: fix `backup.sh` orphan lines so daily MC tar at least runs again | TODAY (15 min) | — | +| 2 | Stand up `mc-backup-frequent` timer + local restic repo (Class A) | this week | needs `apt install restic mcrcon` | +| 3 | Add `mc-backup-world` timer + Class B/C/D | this week | — | +| 4 | Onyx off-host SFTP target + `restic copy` job | this week | onyx user provisioning + ssh key | +| 5 | First monthly drill | next 1st Saturday | — | +| 6 | Wire ntfy alerts | when ntfy/Gatus deployed (infra roadmap) | external | +| 7 | Friend RTX 4080 PC as second off-host (geographic) | phase 2 | Windows-side tooling | + +Phases 1–4 are doable today with what's on hand. Nothing in phases 1–5 requires purchasing. + +--- + +## 12. Open questions for operator + +1. **ntfy.s8n.ru — does it exist yet?** Memory hints at Tuwunel + Matrix on `txt.s8n.ru`. If ntfy isn't deployed, decide: deploy ntfy *now*, or use Matrix room via Tuwunel webhook bridge as alert sink. +2. **Onyx user `mc-backup`** — create today or reuse existing `admin` with restricted authorized_keys? Restricted user is cleaner; reusing `admin` is faster. +3. **Append-only enforcement** on the onyx side — accept "sftp chroot + no shell" as good-enough, or invest in a per-repo restic key with `--no-delete`-style isolation (more work, partial mitigation only)? +4. **Pre-flight world validation** — run `region-fixer` against the latest snapshot weekly to catch silent corruption (T3)? Adds ~5 min compute weekly. Recommend yes. +5. **Class-E (host configs) — already in `live-server/` git repo via Syncthing/manual?** If yes, drop Class E from this scheme; if no, add it. + +--- + +## 13. References + +- `docs/BACKUP.md` — current (broken) state docs. +- `docs/RUNBOOK-BACKUP-RESTORE.md` — operational runbook (this commit). +- `scripts/backup.sh` — to-be-fixed daily script (F-backup-1 in `infra/STATE.md`). +- `_github/infra/STATE.md` — Top-5 weakness #2 + #5 tracking this work. +- `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md` §5 — F-backup-1 detail; nullstone-as-spare hint. +- Memory: `project_friend_gpu` (Tailscale stable IP for friend), `project_tailscale_mesh` (mesh layout), `project_nullstone_docker_userns` (why container-side backup is rejected). +- `CLAUDE.md` Device Registry — onyx 192.168.0.28 / 100.64.0.1. diff --git a/CROSS-REFERENCE-2026-05-07.md b/CROSS-REFERENCE-2026-05-07.md new file mode 100644 index 0000000..946b4a9 --- /dev/null +++ b/CROSS-REFERENCE-2026-05-07.md @@ -0,0 +1,364 @@ + + +# Cross-Reference Survey — 2026-05-07 + +**Trigger:** racked.ru player **YOU500** void-died via AuthLimbo +`teleportAsync` failure, lost full inventory, no backups exist. +Four parallel agents are writing audit + plan docs. This doc maps +them onto existing infra so nothing collides or gets orphaned. + +--- + +## 1. Per-repo state snapshot + +### `auth-limbo` (Paper plugin source) + +| Field | Value | +|---|---| +| Origin | `ssh://git@192.168.0.100:222/s8n-ru/auth-limbo.git` ⚠️ stale (`s8n-ru` rename) | +| Latest tag in CHANGELOG | **1.0.0** (2026-04-30) — single release | +| Last commit | `b686380 readme: restyle to match minecraft-launcher format` | +| Recent commits | README rewrites, AGPL switch, rename chain `RackedLimbo → LoginLimbo → AuthLimbo` | +| CI | `.github/workflows/build.yml` + `release.yml` (GitHub Actions, **not** `.forgejo/`) | +| Tests | **None.** `src/test/` does not exist. | +| Source | 5 Java files: `AuthLimbo`, `AuthMeDatabase`, `LimboWorldManager`, `LoginListener`, `VoidGenerator` | +| Docs | `docs/{compatibility,configuration,how-it-works,installation}.md` | +| CHANGELOG style | **Keep a Changelog + SemVer**, date-suffixed `## [1.0.0] - 2026-04-30` | +| License | AGPL-3.0-or-later, SPDX header in every Java file | + +**Key existing detail relevant to the bug** — `LoginListener.java` +already implements the documented Paper #4085 fix (chunk-ticket pin +in `AuthMeAsyncPreLoginEvent` + `getChunkAtAsyncUrgently` chained +with `teleportAsync` at MONITOR priority on `LoginEvent`, with +configurable `authme.teleport-delay-ticks`). If YOU500 still +void-died, the bug is in **how** that chain handled a return-value +of `false` / a thrown exception — the current code only logs a +`warning` and lets the player stay wherever they were (which on +login is the limbo void). See `LoginListener.java:166-191`. + +The AuthLimbo audit agent's findings should land as: +- **`docs/INCIDENT-2026-05-07-you500.md`** (new) — forensic root-cause + doc, follow `docs/REBRAND_2026-04-30.md` style (date-prefixed, + scope/apply/result/rollback sections — convention shown below). +- **`CHANGELOG.md`** — bump to `## [1.0.1] - 2026-05-07` with + `### Fixed` block, follow Keep-a-Changelog format. +- **`src/main/java/ru/authlimbo/LoginListener.java`** — code patch. + Likely changes: handle `success == false` and `exceptionally` + with a kick or retry rather than silent log; consider raising + default `teleport-delay-ticks` from 10 → 20. +- **`src/test/`** (new directory) — unit tests for the listener. + No precedent here, but pom.xml needs JUnit added. + +--- + +### `minecraft-server` (server repo — this repo) + +| Field | Value | +|---|---| +| Origin | `ssh://git@192.168.0.100:222/s8n-ru/minecraft-server.git` ⚠️ stale | +| Last commit | `ede6029 proantitab: allow lp/luckperms in global; deny essentials.motd default` | +| Top-level docs | `MISSION.md`, `README.md`, `RULES.md`, `THANKS.md`, `VIBE.md`, `TELEMETRY_AUDIT.md` | +| `docs/` | `BACKUP.md`, `DEPLOY.md`, `PERMISSIONS.md`, `PLUGINS.md`, `PLUGIN_ALTERNATIVES.md`, `RACKED_BRAND.md`, `REBRAND_2026-04-30.md`, `ROADMAP.md`, `migrations/lands-to-landclaim.md`, `plugins/.md` (20 files) | +| Existing TODO | The README "Roadmap / TODO" section (lines 91-180) is the canonical living checklist. Tagged `[P0]` blocker / `[P1]` vision / `[P2]` improvement / `[P3]` nice-to-have. `docs/ROADMAP.md` is **scoped narrowly** to plugin-acquisition overhaul (Phases 1-3). | +| `live-server/` | live config snapshot (purpur.yml, server.properties, ops.json, plugins/) — **mirrors prod state**, not a build input. | +| Backup script | `scripts/backup.sh` — note **bug at line 119** (orphaned `"${BACKUP_PATH}/synapse-signing-key-${TIMESTAMP}.key"` block sits outside any `if`, will fail at runtime if signing-key path absent) | +| CI | `.github/workflows/` is empty. `.github/ISSUE_TEMPLATE/` empty. No `.forgejo/`. | + +**No existing files named** `AUDIT*`, `INCIDENT*`, `RUNBOOK*`, +`TODO*`, `CHANGELOG*` at root or in `docs/`. The closest precedents: +- `docs/REBRAND_2026-04-30.md` — date-prefixed event log w/ + Apply/Side incident/Rollback sections. **Use this as the format + template for any new INCIDENT-* doc.** +- `docs/migrations/lands-to-landclaim.md` — multi-section migration + plan (Current State / Target / Plan / Rollback). Format template + for future strategy docs. +- `MISSION.md` / `VIBE.md` / `RULES.md` — top-level "values" docs. + Don't add new top-level capitalised md files unless the doc is + similarly load-bearing for the project's identity. Detail goes in + `docs/`. + +--- + +### `infra` (nullstone+cobblestone runbooks) + +| Field | Value | +|---|---| +| Origin | `ssh://git@192.168.0.100:222/veilor-org/infra.git` ✅ org-scoped, no rename impact | +| Last commit | `381f923 runbook: distribute load + sync data (operator's HA vision)` | +| Layout | `forgejo/`, `runbooks/`, `repos/`, root `STATE.md` + `AUDIT-2026-05-05.md` | +| Runbooks | `COBBLESTONE-INTAKE.md`, `DE-DECISION-cobblestone.md`, **`HA-CLUSTER-distribute-and-sync.md`** (already covers MC backup placement!), `MIGRATION-nullstone-to-cobblestone.md` | + +**Critical pre-existing context:** +- `STATE.md` already lists *"`/opt/docker/backup.sh` fixes — + matrix-postgres + rocketchat-mongodb + literal CHANGE_ME pw"* as + open issue (line 97), AND lists Restic+autorestic as the **#1** + recommended addition (lines 113, 283-285 of `AUDIT-2026-05-05.md`). +- `runbooks/HA-CLUSTER-distribute-and-sync.md` line 51 already plans + *"Backups (offsite) — Restic to B2/Wasabi nightly"* and line 72 + pins MC to nullstone with *"World data ZFS-replicated for DR + only"*. The backup-strategy agent's plan must reconcile with this + — don't propose a parallel scheme; either extend the HA runbook or + cross-link it as the parent design. +- `AUDIT-2026-05-05.md` lines 200-203 already flag the backup script + as silently broken (RC + ex-Matrix not dumping). Confirms the + symptom that caused YOU500's loss. + +**Format conventions in `infra/`:** +- Audit reports: `# 5-Agent Audit Report — YYYY-MM-DD` header, + TL;DR section, severity-ordered Action items section, file index. +- Runbooks: `# Runbook — ` header, Goal blockquote, North-star + diagram if applicable, phase plan, failure scenarios + RTO table, + open decisions, related links. +- Dating: filenames always `-YYYY-MM-DD.md`. + +--- + +### `minecraft-launcher` + +| Field | Value | +|---|---| +| Origin | `ssh://git@192.168.0.100:222/s8n-ru/minecraft-launcher.git` ⚠️ stale | +| Last commit | `31d25f8 readme: shrink license section to single sub line` | +| Relevance to incident | None direct. Would only matter if the incident agent recommends a launcher-side patch (e.g. forced relog on void death detection) — unlikely. | + +### `minecraft-client` + +**Not a git repo** (`fatal: not a git repository`). No remote to +worry about. Excluded from any rewrite list. + +### `veilor-os` + +| Field | Value | +|---|---| +| Origin | `ssh://git@192.168.0.100:222/veilor-org/veilor-os.git` ✅ no rename impact | +| Relevance | None — separate brand (security distro), not Minecraft. Skipped per instructions. | + +--- + +## 2. Stale `s8n-ru` origin URLs (per 2026-05-07 rename) + +Per workspace memory `user_git_identity.md` the Forgejo user `s8n-ru` +was renamed to `s8n` on 2026-05-07. Forgejo serves a 307 redirect for +now but the canonical path is `s8n/`. The following local +clones still have the old origin: + +| Repo (local clone) | Current origin | Should become | +|---|---|---| +| `_github/auth-limbo` | `ssh://git@192.168.0.100:222/s8n-ru/auth-limbo.git` | `ssh://git@192.168.0.100:222/s8n/auth-limbo.git` | +| `_github/minecraft-server` | `ssh://git@192.168.0.100:222/s8n-ru/minecraft-server.git` | `ssh://git@192.168.0.100:222/s8n/minecraft-server.git` | +| `_github/minecraft-launcher` | `ssh://git@192.168.0.100:222/s8n-ru/minecraft-launcher.git` | `ssh://git@192.168.0.100:222/s8n/minecraft-launcher.git` | + +**No rename required for:** `_github/infra` (`veilor-org/`), +`_github/veilor-os` (`veilor-org/`), `_github/minecraft-client` (not +a repo). + +Recommended one-shot fix (deferred — not part of these four agents): + +```bash +for r in auth-limbo minecraft-server minecraft-launcher; do + cd /home/admin/ai-lab/_github/$r + git remote set-url origin ssh://git@192.168.0.100:222/s8n/$r.git +done +``` + +Also update the in-doc URL references: +- `auth-limbo/src/main/resources/plugin.yml` line 7: `website: https://github.com/s8n-ru/auth-limbo` +- `auth-limbo/src/main/java/ru/authlimbo/*.java` SPDX header: `Copyright (C) 2026 s8n-ru` +- `minecraft-server/VIBE.md` line 38: `github.com/s8n-ru/auth-limbo` + +--- + +## 3. Overlap with session-noted TODO items + +The session noted these TODOs that the four agents may want to fold +into recommendations. State as of HEAD: + +| Item | Existing mention? | Where | Status | +|---|---|---|---| +| **SHA256 → BCRYPT** (AuthMe hashing) | ✅ flagged 2026-05-02 | `security/nullstone-server/2026-05-02-mc-audit.md` summary: *"AuthMe also uses unsalted SHA-256, no tempban, no captcha, and 5-char minimum passwords"* | **Not yet addressed in repo.** No TODO entry in README. New. | +| **EZShop drop** | ⚠️ Plugin loaded via `PLUGINS:` in `docker-compose.yml:51` | docker-compose.yml | No TODO entry yet. New. | +| **CapDrop** (Linux capabilities) | ❌ No mention | — | Net-new infra-side item (deploy.security level). Belongs in server-audit agent's report. | +| **tracking-range** | ❌ No mention | — | Net-new (purpur.yml tuning). New. | +| **CO DB → MySQL** (CoreProtect) | ❌ No mention | — | Net-new. Touches plugin policy (CoreProtect-CE is the one acknowledged license exception per MISSION.md — CO config change OK, plugin swap not). | +| **TPS webhook** | ⚠️ "Prometheus exporter + Grafana" entry exists in README:105 (P2). Webhook would be lighter-weight alternative. | README.md:105 | Adjacent to existing TODO; consider replacing or augmenting it. | +| **spark baseline** | ✅ spark already loaded in `PLUGINS:` (compose:54) and listed in VIBE.md:78 | docker-compose.yml, VIBE.md | "Baseline" = capture a profiling run for ref. Net-new. | +| **plugin folder cleanup** | ⚠️ `live-server/plugins/` is checked-in live config snapshot. Past cleanup happened in REBRAND_2026-04-30 (Side incident — disk full). | docs/REBRAND_2026-04-30.md:65-74 | Operational, not docs. Net-new. | + +**None of the eight overlap with the existing `docs/ROADMAP.md`** +(which is scoped narrowly to *plugin-acquisition* — manifest + +lockfile + CI). They all belong in the **README.md "Roadmap / TODO" +checklist** by current convention. The server-audit agent should +append them there, not create a new ROADMAP-* doc. + +--- + +## 4. Existing backup-related mentions + +| File | Line | Content | +|---|---|---| +| `docs/BACKUP.md` | all | Documents the daily 02:00 cron + retention. **Critical drift:** describes worlds being backed up but VIBE.md:54-58 says *"no world backups"*. Direct contradiction. | +| `scripts/backup.sh` | 80-117 | Minecraft block: docker-exec tar of world/world_nether/world_the_end + configs. **Real, working code.** | +| `scripts/backup.sh` | 119-122 | **Orphaned dead-code block** outside any `if` (dangling from `synapse-signing-key`). Will trigger script failure if signing-key path missing. | +| `README.md` | 23, 45, 164, 179 | Mentions backup feature. README:179 records "freed 11G+ (old backups, ...)". | +| `VIBE.md` | 54-58 | *"Daily configs, no world backups (it'd eat too much disk). If you lose a base to grief, that's the game."* — **conflicts with reality.** | +| `docs/REBRAND_2026-04-30.md` | 53, 65-74 | Records 2026-04-30 backup tarball and 2026-05-01 disk-full incident from accumulated backups. Confirms backups *were* running. | +| `SYSTEM.md` | 737-749 | Workspace-level system reference says backups run daily, ~5-7GB compressed. Out-of-date plugin counts (says 25, actual ~16) and Purpur version (says 1.21.10, actual 1.21.11). | + +**Major contradiction the backup-strategy agent must resolve:** +either VIBE.md must drop the *"no world backups"* line (recommended +— reality is that worlds **are** being backed up), or the operator +must accept that the YOU500 loss happened because the worlds were +**logically excluded from the policy** even though they were +mechanically being archived. The latter is unlikely — daily 02:00 +tarball would have caught a 2026-05-07 daytime void death. + +**Backup-hunt agent finding to verify:** does `/opt/backups/` on +nullstone actually contain any usable `mc-world-backup-*.tar.gz` +files? `STATE.md` line 97 + `AUDIT-2026-05-05.md` lines 200-203 +suggest the script *runs* but its other arms are failing silently; +the MC arm at lines 80-117 of backup.sh has no obvious bug, so +backups should exist. If they don't, that's the deepest finding. + +--- + +## 5. Forgejo runner / CI integration + +Per memory `project_forgejo_nullstone.md` and `STATE.md` line 26-27, +nullstone runs a Forgejo runner with labels +`ubuntu-24.04 + nullstone`. **No repo currently has a `.forgejo/` +directory** — neither auth-limbo nor minecraft-server nor infra. CI +in `auth-limbo` is GitHub Actions (`.github/workflows/`). + +`STATE.md` line 121-129 notes the v0.5.32 veilor-os ship is pending +on flipping `runs-on:` to `nullstone` to use the Forgejo runner. + +**Implication for the audit agents:** if the AuthLimbo agent wants +the fix to land via CI, two options: +1. Keep `.github/workflows/build.yml`, since GH-mirror is now + manual-only post-2026-05-06 (`STATE.md`:14-18) — workflow won't + trigger automatically anymore, would need manual mirror push. +2. Migrate to `.forgejo/workflows/build.yml` with + `runs-on: ubuntu-24.04` (compatible with the runner). Cleaner, + matches new direction. **Recommended.** + +Either path: pre-existing dependency on `AUTHME_JAR_URL` repo secret +(see `.github/workflows/build.yml:21-26`) needs to be re-added on +Forgejo if path 2 is taken. + +--- + +## 6. Workspace-level `SYSTEM.md` updates needed after backup-strategy lands + +`/home/admin/ai-lab/SYSTEM.md` lines 665-779 has the canonical +workspace-level Minecraft section. After the backup-strategy doc +lands, the following blocks need editing (one PR, one paragraph +each): + +| SYSTEM.md location | Existing content | Drift | +|---|---|---| +| Line 677 | "Minecraft Version: 1.21.10 (Purpur build 2532)" | Actual: 1.21.11 (compose line 10) | +| Line 686-690 | "25 plugins loaded ... bulk-updated 2026-04-17" | Plugin set has shifted heavily since (LandClaimPlugin → Homestead, WorldEdit → FAWE, Vault → VaultUnlocked, LoginSecurity → AuthMe, AuthLimbo added, EZShop+AuctionHouse added). Real count ≈ 16. | +| Line 692-706 | RAM 7GB idle, Purpur 1.21.10-2535, startup 47s | Out of date; would-be benefit re-measure as part of "spark baseline" TODO. | +| Line 765-771 | "Known Issues" block | Add YOU500 incident closure note (post-fix), F10 RCON wildcard already promised in Wave 2. | +| Line 776 | "Backup frequency: Add 6-hourly world snapshots for active play sessions" | This is the existing wishlist item the backup-strategy agent will likely satisfy. Strike or replace with "Done — see infra/runbooks/MC-BACKUP-2026-05-07.md" (or wherever the strategy lands). | + +**Per `CLAUDE.md` workspace rules**, technical detail belongs in +SYSTEM.md, not README.md. The README device-table line for +nullstone won't change. + +--- + +## 7. Integration recommendations — where each parallel agent's doc lands + +| Agent | Output should land at | Rationale | +|---|---|---| +| **Backup hunt** (find existing backups) | `_github/minecraft-server/docs/INCIDENT-2026-05-07-you500-backup-hunt.md` | Date-prefixed, follows REBRAND_2026-04-30.md format. Forensic in nature → minecraft-server `docs/`. | +| **AuthLimbo audit** (root-cause + code patch) | (1) `_github/auth-limbo/docs/INCIDENT-2026-05-07-teleportasync-failure.md` for forensic write-up; (2) source patch + `CHANGELOG.md` bump in same repo; (3) optional cross-link from `minecraft-server/docs/INCIDENT-2026-05-07-you500-backup-hunt.md` | Plugin source repo owns plugin bugs. INCIDENT- naming convention matches REBRAND_*.md. | +| **Backup strategy** (forward-looking design) | `_github/infra/runbooks/MC-BACKUP-strategy-2026-05-07.md` (or extend `HA-CLUSTER-distribute-and-sync.md` with a Phase 1.5 sub-section) | infra owns nullstone-side cron + restic. Cross-link from `minecraft-server/docs/BACKUP.md` (replace its current contents with a thin pointer). | +| **Server audit** (broader hardening — CapDrop, plugin folder, MySQL, etc) | `_github/minecraft-server/docs/AUDIT-2026-05-07.md` (synthesis), then **append individual TODOs to README.md "Roadmap / TODO"** | Matches `infra/AUDIT-2026-05-05.md` precedent. README is the canonical TODO surface for this repo per existing convention. | + +**Files needing edits AFTER all four agents finish:** + +| File | Change | +|---|---| +| `_github/minecraft-server/README.md` | Append new TODO entries from server-audit agent: SHA256→BCRYPT, EZShop drop, CapDrop, tracking-range, CO MySQL, TPS webhook, spark baseline, plugin folder cleanup. Add `[x]` for the YOU500 incident under "Done" once fix shipped. | +| `_github/minecraft-server/docs/BACKUP.md` | Rewrite to point to infra runbook; current Schedule/Strategy/Manual sections move to infra. Or replace contents with thin "see infra/runbooks/MC-BACKUP-strategy-2026-05-07.md". | +| `_github/minecraft-server/VIBE.md` | Drop or revise lines 54-58 — *"no world backups"* contradicts reality and is the philosophical claim that may have justified treating backups as low-priority. Important narrative fix. | +| `_github/minecraft-server/scripts/backup.sh` | Fix orphaned line 119-122 dead-code block. Independent of strategy agent's output. | +| `_github/minecraft-server/docker-compose.yml` | If EZShop drop accepted: remove line 51. (Server-audit agent decision.) | +| `_github/auth-limbo/CHANGELOG.md` | New `## [1.0.1] - 2026-05-07` entry. | +| `_github/auth-limbo/pom.xml` | Version bump 1.0.0 → 1.0.1 if patch shipped. | +| `_github/auth-limbo/src/main/java/ru/authlimbo/LoginListener.java` | Code fix per AuthLimbo agent. | +| `_github/infra/STATE.md` | Add 2026-05-07 changelog entry referencing the incident; check off "/opt/docker/backup.sh fixes" pending decision (line 97) when backup script repaired. | +| `_github/infra/AUDIT-2026-05-05.md` | Append addendum or leave dated; the new audit replaces/augments the F-numbered findings related to MC backups. | +| `/home/admin/ai-lab/SYSTEM.md` | Update Minecraft section per §6 above. Add note in Known Issues (line 765). Update Last Updated. | +| `/home/admin/ai-lab/README.md` | "Last Updated" stamp; one-line status mention if user wants it surfaced at workspace level. | + +--- + +## 8. Open conflicts and duplications + +1. **VIBE.md vs reality** (most important narrative conflict). VIBE + says no world backups; backup.sh + BACKUP.md + REBRAND_2026-04-30 + prove worlds **are** archived nightly. The YOU500 inventory loss + means either (a) backups didn't run that day, (b) backup ran but + the rollback isn't operationally feasible (would lose other + players' progress between 02:00 and the death), or (c) operator + chose not to rollback. **The backup-strategy agent must address + this explicitly** rather than just propose a new scheme. + +2. **`docs/ROADMAP.md` scope vs README "Roadmap / TODO"** — the + docs file is narrowly about plugin-acquisition Phases 1-3, while + the README has the all-up living checklist. Future agents should + not put generic TODO items into `docs/ROADMAP.md`. Keep its scope + tight or rename it `docs/PLUGIN-ACQUISITION-ROADMAP.md`. + +3. **infra `HA-CLUSTER-distribute-and-sync.md` vs new MC-backup + strategy** — there's a real risk the backup-strategy agent + designs Restic-to-B2 in isolation while HA-CLUSTER already plans + that exact service for both nullstone+cobblestone. Strategy doc + must reference and extend the HA-CLUSTER plan (specifically the + "Backups (offsite)" row in its layer table, line 51). + +4. **CoreProtect MySQL migration** — proposed in session TODOs. + `MISSION.md:24` codifies CoreProtect-CE as "the one acknowledged + license exception". Switching its DB backend to MySQL is fine + under that policy (config, not plugin swap), but the server-audit + agent should explicitly note "this is a config change, not a + plugin swap, so MISSION.md:24 still holds" so the policy isn't + accidentally diluted. + +5. **AuthLimbo CI host** — `.github/workflows/` lives in repo but + GH push-mirror is off as of 2026-05-06. Builds will only run if + someone manually pushes to GH. Worth flagging to the AuthLimbo + agent that any CI step they propose may need a `.forgejo/` + variant, otherwise the patched 1.0.1 release won't auto-build. + +6. **`_github/minecraft-client` is not a git repo** — nothing to + worry about for this incident, but anyone iterating on the + incident later may try to commit something there expecting it to + work. Worth recording. + +--- + +## 9. Summary table — convention by repo + +| Repo | Audit doc convention | Incident doc convention | TODO surface | CHANGELOG style | +|---|---|---|---|---| +| `auth-limbo` | (none yet) | (none yet — recommend `docs/INCIDENT-YYYY-MM-DD-.md`) | (none — small repo) | Keep a Changelog + SemVer, `## [X.Y.Z] - YYYY-MM-DD` | +| `minecraft-server` | (none yet — recommend `docs/AUDIT-YYYY-MM-DD.md` matching infra style) | follow `docs/REBRAND_2026-04-30.md` template | README "Roadmap / TODO" with `[P0..P3]` tags | (none — uses git log) | +| `infra` | `AUDIT-YYYY-MM-DD.md` at root | (use runbooks for forward-looking; no incident files yet) | `STATE.md` "Pending decisions" table | (none — uses git log + STATE.md) | +| `minecraft-launcher` | n/a | n/a | (none) | (none) | +| `veilor-os` | (separate brand — out of scope) | — | — | — | + +--- + +*End of survey. Read-only. No files modified. No commits pushed.* diff --git a/docs/RUNBOOK-BACKUP-RESTORE.md b/docs/RUNBOOK-BACKUP-RESTORE.md new file mode 100644 index 0000000..c8b467d --- /dev/null +++ b/docs/RUNBOOK-BACKUP-RESTORE.md @@ -0,0 +1,156 @@ +# Runbook — Backup & Restore (Minecraft, racked.ru on nullstone) + +Strategy doc: [`../BACKUP-STRATEGY.md`](../BACKUP-STRATEGY.md). This runbook is the **operator-facing** procedure for the three scenarios that come up in practice. Keep it short, copy-paste-able, and reachable from the player support workflow. + +> **Status (2026-05-07):** This runbook is written **ahead** of the implementation it describes. The `mc-backup-frequent` timer and onyx mirror are NOT yet deployed. The "What if no snapshot exists yet?" section at the bottom covers today's reality. + +--- + +## TL;DR — restore one player's `.dat` from N minutes ago + +```bash +# On nullstone, as `user`: +PUUID= # e.g. from /opt/docker/minecraft/usercache.json +WHEN=latest # or "5 min ago", or a snapshot id +RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \ +restic -r /home/user/restic/mc-frequent \ + restore "$WHEN" \ + --target /tmp/restore-$$ \ + --include "world/playerdata/${PUUID}.dat" + +# Verify the file is well-formed NBT before applying: +file /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat +# Expected: "gzip compressed data" + +# Apply (server must be running so playerdata is writable; the player +# MUST be offline or we're racing the writer): +mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "kick ${PUUID_NICK} Restore in progress" +mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-off" +mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-all flush" + +cp /opt/docker/minecraft/world/playerdata/${PUUID}.dat \ + /opt/docker/minecraft/world/playerdata/${PUUID}.dat.preFix-$(date +%s) +cp /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat \ + /opt/docker/minecraft/world/playerdata/${PUUID}.dat +chown 100000:100000 /opt/docker/minecraft/world/playerdata/${PUUID}.dat # userns-remap + +mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-on" +# Tell the player to log back in. +``` + +**Why kick + `save-off`:** if the player is online, the server holds their NBT in memory and rewrites the `.dat` on next save tick — clobbering the restore. `save-off` halts auto-save; kicking guarantees the in-memory state for that player won't be flushed. + +**Userns-remap reminder:** the host sees container-uid `100000` for files written by the MC process. Restored files written by `user` (uid 1000) will appear empty/permission-denied to the container. Always `chown 100000:100000` (or `chmod 666`) after restore. Memory: `project_nullstone_docker_userns`. + +--- + +## Scenario 1 — Player lost inventory (T1, the void-death case) + +This is what the strategy was written for. RTO target: **< 2 minutes**. + +1. Find the UUID: + ```bash + grep -i 'NICK' /opt/docker/minecraft/usercache.json + ``` +2. Pick a snapshot just **before** the loss. `restic snapshots --tag playerdata` shows timestamps. +3. Run the TL;DR block above with that snapshot id (or `latest` if loss happened in the last 5 min). +4. Inform the player: "Your inventory from HH:MM has been restored. Anything you picked up after that point is gone." +5. Log the incident: append to `docs/INCIDENTS.md` (create if absent) — date, player, snapshot id, cause. + +--- + +## Scenario 2 — Whole world rolled back (T2/T3, griefing or corruption) + +RTO target: **15 minutes**. Server downtime expected. + +1. Announce, kick, stop: + ```bash + mcrcon ... "say Server going down for restore — back in ~15 min" + mcrcon ... "kick @a Restore in progress" + cd /opt/docker/minecraft && docker compose down + ``` +2. Move live data aside (do not delete): + ```bash + mv /opt/docker/minecraft /opt/docker/minecraft.broken-$(date +%F) + mkdir -p /opt/docker/minecraft + ``` +3. Restore from the world repo: + ```bash + RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \ + restic -r /home/user/restic/mc-world \ + restore --target /tmp/world-restore + rsync -aHAX /tmp/world-restore/opt/docker/minecraft/ /opt/docker/minecraft/ + ``` +4. **Re-apply userns-remap perms** (critical — see memory): + ```bash + chmod -R 777 /opt/docker/minecraft # quickfix; or chown -R 100000:100000 + ``` +5. Boot: + ```bash + cd /opt/docker/minecraft && docker compose up -d + docker logs -f minecraft-mc # watch for "Done" line + ``` +6. Verify with a known-good UUID's `.dat` parse, then announce server up. +7. Keep `minecraft.broken-YYYY-MM-DD/` for at least 7 days for forensic comparison. + +--- + +## Scenario 3 — Host disk dead (T5) + +RTO target: **few hours, depends on hardware swap**. + +1. New host: install Debian 13 + Docker per `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md`. +2. `apt install restic`. Pull the password from operator's password manager into `/etc/mc-backup.pw`. +3. Initialise destination dir, then restore from **onyx mirror** (not local — local is gone): + ```bash + restic -r sftp:mc-backup@100.64.0.1:/backups/nullstone-mc-restic \ + restore latest --target /tmp/world-restore + ``` +4. Continue Scenario 2 from step 4. +5. Stand up the timers on the new host. **Do not** point them at the same off-host repo until the new host has been re-keyed (rotate restic passwords as part of disaster recovery). + +--- + +## Drill log (monthly) + +| Date | Operator | Snapshot age | Class A restore time | Off-host restore time | Result | +|------|----------|--------------|----------------------|------------------------|--------| +| (first drill — 2026-06-06) | s8n | TBD | TBD | TBD | TBD | + +Procedure: see `BACKUP-STRATEGY.md` §7. + +--- + +## What if no snapshot exists yet? (CURRENT REALITY 2026-05-07) + +Until phases 1–4 of `BACKUP-STRATEGY.md` are deployed, the only recovery resources are: + +| Source | What's there | Recoverable? | +|---|---|---| +| `/opt/backups/202604xx_020001/mc-world-backup-*.tar.gz` | World tar from Apr 29 + May 2 (others FAILED) | **GONE** — pruned by 7-day retention | +| `/opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz` | Plugin jars only, no world | Not useful for player data | +| Live `/opt/docker/minecraft/world/playerdata/.dat_old` | MC's own .dat_old shadow file from previous save | **YES** — last save tick before current. **First-line defence right now.** | +| CoreProtect DB (`plugins/CoreProtect/database.db`) | Block + container actions, NOT inventory state | Partial — can roll back grief, can't restore lost items | + +**Today's playbook for inventory-loss reports:** + +1. Server console → `co lookup u:NICK` to confirm the loss event in CoreProtect. +2. **Stop the server immediately** if the report comes in within the same play session — every save tick overwrites `.dat_old`. `docker compose down` buys time. +3. Inspect `world/playerdata/.dat_old` — if it predates the loss, copy it over `.dat`, fix perms (uid 100000), restart. +4. If `.dat_old` is too new (already overwritten): **the loss is unrecoverable until BACKUP-STRATEGY phases 1–4 are deployed.** Apologise to the player. Spawn-in compensation per operator discretion (ops creative-mode replacement is the customary remedy). +5. Log the incident — adds urgency to deploying the new strategy. + +--- + +## TODO — open items (links into BACKUP-STRATEGY.md §11) + +- [ ] Phase 1: fix `/opt/docker/backup.sh` orphan-line bug (F-backup-1). +- [ ] Phase 2: deploy `mc-backup-frequent.timer` (Class A, 5-min playerdata). +- [ ] Phase 3: deploy `mc-backup-world.timer` (Class B/C/D, hourly). +- [ ] Phase 4: provision `mc-backup` user on onyx + `restic copy` job. +- [ ] Phase 5: schedule monthly drill calendar entry, run first drill. +- [ ] Phase 6: ntfy / Matrix alert wiring (depends on ntfy deployment). +- [ ] Phase 7: friend RTX 4080 PC as secondary off-host. +- [ ] Verify `usercache.json` on this host: confirm UUID lookup workflow above resolves to the right `.dat`. +- [ ] Decide: `mcrcon` package vs lightweight Python `mcrcon` lib. +- [ ] Document compensation policy for unrecoverable losses (operator discretion right now).