docs: 2026-05-07 incident audit + backup strategy
Player YOU500 lost full inventory to AuthLimbo void-death at 17:13:39. Investigation revealed deployed /opt/docker/backup.sh is an 88-line stub missing the Minecraft block; last successful world backup 2026-05-02 (already pruned). No recoverable .dat exists. Files: - AUDIT-2026-05-07.md — server-side findings F-01..F-06 (P0 backups, no-keepInventory, AuthLimbo silent failure, chunk preload race, Xmx > container headroom, container hardening gaps) - BACKUP-HUNT-2026-05-07.md — exhaustive backup scan; only 6-week-old archive at _archive/minecraft-old-2026-04-27.tar.gz - BACKUP-STRATEGY.md — restic-based plan; 5min/hourly/daily classes, off-host to onyx via Tailscale, monthly drill - CROSS-REFERENCE-2026-05-07.md — repo+doc landing map; flags pre-existing infra/STATE.md backup-broken note + HA-CLUSTER restic draft to extend rather than duplicate - docs/RUNBOOK-BACKUP-RESTORE.md — operator runbook for .dat restore, full-world restore, host-loss restore, drill log
This commit is contained in:
parent
909eb7bbd6
commit
a1cc3940cf
5 changed files with 1215 additions and 0 deletions
184
AUDIT-2026-05-07.md
Normal file
184
AUDIT-2026-05-07.md
Normal file
|
|
@ -0,0 +1,184 @@
|
||||||
|
# Minecraft Server Audit — racked.ru
|
||||||
|
**Container:** `minecraft-mc` on nullstone (192.168.0.100)
|
||||||
|
**Date:** 2026-05-07
|
||||||
|
**Audit type:** Operational / data-integrity (NOT a network-security audit)
|
||||||
|
**Auditor:** Claude (Opus 4.7) via SSH read-only inspection
|
||||||
|
**Catalyst:** Player **YOU500** void-died at login (~17:13:39 BST), inventory lost. No usable backup existed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
**Status:** Critical issues found.
|
||||||
|
**Risk score model:** Likelihood (1-5) x Impact (1-5) = 1-25. >=15 = High, >=20 = Critical.
|
||||||
|
|
||||||
|
A live AuthLimbo `teleportAsync returned false` warning fired during YOU500's first login of the day, immediately after `YOU500 left the confines of this world` (void death in `auth_limbo` world). The player retried twice. On retry #3 they were teleported to (-264.6, 86, -49.8) and 23 seconds later `was blown up by Creeper`. Console operator (s8n) attempted recovery via RCON but neither the void death nor the creeper death had item-restore data because:
|
||||||
|
|
||||||
|
1. **No working backups.** `/opt/docker/backup.sh` deployed on nullstone is a stale 88-line copy missing the entire Minecraft block. The repo version (`scripts/backup.sh`) has the block but **was never deployed**. Daily 02:00 cron has been running for at least 7 days producing 8-12K archives that contain no world / playerdata / plugins. `BACKUP.md` claims the script handles MC; it does not.
|
||||||
|
2. **CoreProtect tracks inventory transactions but not death drops.** `co inspect` will not surface "dropped on death" entries the way it does pickup/drop, and even if it did, the 1.5 GB SQLite blob is approaching the point where `/co rollback` over an inventory radius is operationally slow.
|
||||||
|
3. **No `keepInventory` rule, no death-drop rescue plugin.** With `difficulty=hard`, `gamemode=survival`, and no Essentials `keepinv` permission flow visible, every death is a total loss.
|
||||||
|
4. **AuthLimbo has no death-listener and no failure remediation.** When `teleportAsync` returns false, the player is dropped at limbo spawn and the warning is logged at WARN level only — no alert, no rollback, no temp-stash of inventory.
|
||||||
|
5. **JVM heap sized larger than container limit.** `JVM_OPTS=-Xmx16384M` inside an `18G` container limit with `MEMORY_SIZE=16G`; if Aikar G1 heap actually grows to Xmx, plus off-heap (Netty, mmaps, zip cache) >2 GB, **kernel OOM kills the container**. Restart-on-OOM has no warning hook to discord/Matrix.
|
||||||
|
|
||||||
|
**Three biggest exposures**
|
||||||
|
1. Backups silently broken for 7+ days. (Critical — 5x4=20)
|
||||||
|
2. No item-loss safety net for any cause of death. (Critical — 4x5=20)
|
||||||
|
3. AuthLimbo failure path has no recovery. (High — 4x4=16)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Findings Table
|
||||||
|
|
||||||
|
Severity = Likelihood x Impact. P0 = act this week, P1 = this month, P2 = this quarter.
|
||||||
|
|
||||||
|
| ID | Severity | Finding | Recommendation | Effort |
|
||||||
|
|----|----------|---------|----------------|--------|
|
||||||
|
| F-01 | **P0 / 20** | `/opt/docker/backup.sh` on nullstone is missing the entire MC backup block. Repo `scripts/backup.sh` has it but was never deployed. Daily backups since 2026-04-30 are 8-12K (effectively empty). | Sync the deployed script with repo, run a manual backup, verify world tarball >= 5 GB. Add a sentinel check to backup.sh that fails the run if `mc-world-backup-*.tar.gz` < 1 GB. | 30 min |
|
||||||
|
| F-02 | **P0 / 20** | No `keepInventory` rule and no `essentials.keepinv` permission. Every death is total loss. | Decide policy: (a) `gamerule keepInventory true` server-wide, (b) keep-inv only when death cause is "void"/"plugin teleport", or (c) auto-restore-on-AuthLimbo-failure. The narrow option (b) preserves survival pain while plugging the AuthLimbo data-loss vector. Plugin candidates: `KeepInventoryOnVoid`, `DeathChestPro`, custom listener in AuthLimbo. | 1-2h research, 1d implement |
|
||||||
|
| F-03 | **P0 / 18** | AuthLimbo logs WARN on teleport failure but has no alerting or recovery. The player is left at limbo spawn (y128 platform) where they re-disconnect and on retry get teleported normally — but the warning never reaches an operator. | (a) Bump `teleportAsync returned false` to ERROR. (b) Add a Discord/Matrix webhook alert via existing webhook stack. (c) On failure: snapshot player inventory, kick with friendly message, write recovery file `auth_limbo/incident-<uuid>-<ts>.dat` for ops replay. | 1d |
|
||||||
|
| F-04 | **P0 / 18** | YOU500's first failed teleport target was (2380.4, 69.9, -11358.4) — that's 11k blocks out and the chunk likely was not loaded yet. AuthLimbo's `preload-chunks: true` setting fires on `AuthMeAsyncPreLoginEvent` which may not run before `LoginEvent` in HaHaWTH's AuthMe fork. Exact timing race is unverified. | Add chunk-loaded assertion in AuthLimbo before calling `teleportAsync`; if not loaded, force-load synchronously OR delay teleport another 10-20 ticks. Add debug logging of chunk-load state in the WARN line. | 0.5d |
|
||||||
|
| F-05 | **P0 / 16** | JVM `-Xmx16384M` inside container `mem_limit=18G` with no headroom for off-heap (Netty buffers, native mmaps, mod metadata). Aikar flags + 25 plugins easily push native to 2-3 GB. Kernel OOM kill is silent. | Either (a) lower `-Xmx` to 12-14 GB and `MaxRAMPercentage`-style flag, OR (b) raise `mem_limit` to 24 GB. Also add `oom_score_adj` and a `docker events --filter event=oom` watcher that pings Discord. | 1h config + 2h alerting |
|
||||||
|
| F-06 | **P0 / 16** | No `pids_limit`, no `cap_drop: ALL`, no `read_only: true`. Container runs with the default Docker capability set (CAP_NET_RAW, CAP_SYS_CHROOT, etc.) it does not need. | Add `cap_drop: [ALL]`, `cap_add: [NET_BIND_SERVICE]` (only if binding <1024; 25565 is high so likely none), `pids_limit: 4096`, `security_opt: [no-new-privileges:true]`. Test boot, watch for startup failures. | 1h test |
|
||||||
|
| F-07 | **P1 / 15** | CoreProtect SQLite at 1.5 GB. Performance and reliability degrade past 2-3 GB. `database.db` is the only copy; no WAL checkpoint or vacuum schedule. | (a) Migrate to MySQL/MariaDB sidecar container. (b) Add monthly cron `co purge t:30d` (purge entries older than 30 days; CoreProtect docs). (c) Schedule `VACUUM` after purge. | 1d for MySQL migration, 1h for purge cron |
|
||||||
|
| F-08 | **P1 / 12** | AuthMe still on `passwordHash: SHA256` (legacy). Migration plan for SHA256 -> BCRYPT is on TODO list and still pending. | Set `legacyHashes: [SHA256]` and `passwordHash: BCRYPT`. AuthMe re-hashes on next successful login. Communicate "your password works as before, no action needed". | 30 min config + monitoring |
|
||||||
|
| F-09 | **P1 / 12** | `online-mode=false`. Server depends entirely on AuthMe + EpicGuard for identity. EpicGuard config not audited in this pass. | Verify `enableProtection: false` in AuthMe (currently false) is intentional, since geofencing is `US, GB, LOCALHOST` only — any user from another country is locked out if protection re-enabled. Document the choice in `RULES.md`. | 1h doc only |
|
||||||
|
| F-10 | **P1 / 12** | `auto-save-interval: 2400` (= 2 minutes at 20 TPS) is fine, BUT `paper-global.yml` has `player-auto-save: rate: -1` (= use auto-save-interval, so also 2 min). A player who joins, dies, and disconnects within 2 min may have NO post-death snapshot persisted before the player.dat is overwritten by their next login. Player save *does* fire on quit, but if the death happens and the player keeps moving / interacting before logout, items in chunks not yet saved are at risk for tar-while-running backups. | Set `player-auto-save: rate: 1200` (= 1 min). Switch backup strategy to `save-off` + `save-all flush` + tar + `save-on` to guarantee consistency, OR snapshot the host bind-mount with a filesystem-level snapshot (LVM / btrfs / ZFS). | 30 min config, 0.5d for snapshot path |
|
||||||
|
| F-11 | **P2 / 10** | `EZShop-1.0-SNAPSHOT.jar` is bundled alongside `AuctionHouse-1.4.6.jar`. PLUGIN_ALTERNATIVES.md TODO calls for dropping EZShop. | Remove EZShop, migrate any active shops to AuctionHouse, document the migration in `docs/migrations/`. | 0.5d player communication, 1h technical |
|
||||||
|
| F-12 | **P2 / 10** | Spigot `entity-tracking-range`: monsters 96, misc 96. Roadmap suggests tightening to monster=32, misc=16 for TPS / network savings. | Tune on next maintenance window, re-baseline TPS with `spark` profile. | 1h config, 1d to verify under load |
|
||||||
|
| F-13 | **P2 / 9** | 21 plugin folders without matching jar (orphans): `bStats`, `CarbonChat`, `ComfyWhitelist`, `EpicGuard`, `Essentials`, `faststats`, `GrimAC`, `Homestead`, `Lands`, `LPC`, `MarriageMaster`, `MiniMOTD`, `Multiverse-Core`, `PhantomSMP`, `TAB`, `UltimateTimber`, `UnexpectedSpawn`, `Vault`, `WorldEdit`, plus `.bak-*` directories. Most have a renamed jar (`carbonchat-paper-...jar`, `EssentialsX-...jar`) so this is mostly cosmetic. `Lands`, `LPC`, `MarriageMaster`, `PhantomSMP`, `UltimateTimber`, `UnexpectedSpawn` truly orphaned: jars not present. | Audit each: delete data dirs of plugins truly removed; the bStats/Essentials/Vault names are normal. Document plugin-name <-> jar-name pattern in `PLUGINS.md`. | 1h |
|
||||||
|
| F-14 | **P2 / 9** | No TPS Discord webhook alert (mentioned on TODO). spark is installed but auto-profile + alerting are not wired up. | spark already supports `spark profile --thresholds`; route to Discord via existing webhook stack. | 0.5d |
|
||||||
|
| F-15 | **P2 / 8** | RCON output for async commands (CoreProtect, LuckPerms) does not return to the issuing rcon-cli session. Found while trying `co inspect` from RCON. Async command results land in console only. | Document this in `docs/OPERATIONS.md` (does not exist yet — create it). For automation, attach to `docker logs -f minecraft-mc` in parallel. | 30 min doc |
|
||||||
|
| F-16 | **P2 / 8** | `gamerule keepInventory` could not be queried via `rcon-cli` due to `execute in <world> run` argument parsing bug in itzg's rcon-cli wrapper (or RCON quoting). State unknown without in-game console. | Verify in-game by op user, document the rcon-cli limitation. | 5 min in-game |
|
||||||
|
| F-17 | **P2 / 6** | `RCON_PASSWORD` is committed to `docker-compose.yml` in plaintext (`*redacted*`). RCON port (25575) is bound to `127.0.0.1` so the blast radius is local only — but the secret is still in git history. | Rotate password, move to `.env` (gitignored), confirm `127.0.0.1`-only binding stays. | 30 min |
|
||||||
|
| F-18 | **P2 / 6** | `restart: unless-stopped` with no `start_period` re-evaluation on rapid OOM-restart loops. If the container OOMs every 60s, Docker keeps restarting indefinitely. | Add `restart_policy: { condition: on-failure, max_attempts: 5, window: 300s }` (compose v3+ deploy block) and a watchdog alert. | 30 min |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Detailed Methodology
|
||||||
|
|
||||||
|
### Inputs inspected (read-only, no writes)
|
||||||
|
|
||||||
|
| Source | Path | Method |
|
||||||
|
|--------|------|--------|
|
||||||
|
| Container env | `docker inspect minecraft-mc` | host shell |
|
||||||
|
| docker-compose | `/opt/docker/minecraft/docker-compose.yml` | host cat |
|
||||||
|
| AuthLimbo config | `/data/plugins/AuthLimbo/config.yml` | `docker exec cat` |
|
||||||
|
| AuthLimbo logs | `/data/plugins/AuthLimbo/` (no log files exist; only `config.yml`) | `docker exec ls` |
|
||||||
|
| AuthMe config | `/data/plugins/AuthMe/config.yml` | `docker exec cat` |
|
||||||
|
| AuthMe DB record for YOU500 | `/data/plugins/AuthMe/authme.db` | `docker exec python3 sqlite3` |
|
||||||
|
| CoreProtect config | `/data/plugins/CoreProtect/config.yml` | `docker exec cat` |
|
||||||
|
| CoreProtect DB size | `/data/plugins/CoreProtect/database.db` | `docker exec du -sh` |
|
||||||
|
| Server log | `/data/logs/latest.log` | `docker exec grep` |
|
||||||
|
| Paper / Spigot / Purpur configs | `/data/config/paper-*.yml`, `/data/spigot.yml`, `/data/purpur.yml` | `docker exec cat` |
|
||||||
|
| World sizes | `/data/world*/` | `docker exec du -sh` |
|
||||||
|
| Backup script (deployed) | `/opt/docker/backup.sh` | host cat |
|
||||||
|
| Backup script (repo) | `/home/admin/ai-lab/_github/minecraft-server/scripts/backup.sh` | local cat |
|
||||||
|
| Backup output | `/opt/backups/` | host stat |
|
||||||
|
| Backup log | `/opt/backups/backup.log` | host tail |
|
||||||
|
| Live state | RCON `tps`, `list` | `docker exec rcon-cli` |
|
||||||
|
|
||||||
|
### YOU500 incident timeline (reconstructed from `latest.log`)
|
||||||
|
|
||||||
|
| Time (BST 2026-05-07) | Event |
|
||||||
|
|-----------------------|-------|
|
||||||
|
| 17:13:34 | Login from 45.157.234.219, UUID c7c2df8e-...-686b |
|
||||||
|
| 17:13:35 | Spawned in `auth_limbo` (0.5, 128, 0.5) per AuthLimbo platform default |
|
||||||
|
| 17:13:38 | AuthMe: "YOU500 logged in" |
|
||||||
|
| 17:13:39 | AuthLimbo: "Restoring YOU500 to world(2380.4, 69.9, -11358.4)" |
|
||||||
|
| 17:13:39 | **`YOU500 left the confines of this world`** — void death |
|
||||||
|
| 17:13:39 | **`[AuthLimbo] teleportAsync returned false for YOU500 — Paper may have rejected the location.`** |
|
||||||
|
| 17:15:33 | Disconnect |
|
||||||
|
| 17:15:39 | Re-login from 82.22.5.229. Stored auth-loc has now been UPDATED to (-264.6, 86, -49.8) — different from the first attempt. Either user `/sethome`'d previously or AuthMe overwrote on the void death. |
|
||||||
|
| 17:15:44 | AuthLimbo: "Restoring YOU500 to world(-264.6, 86.0, -49.8)" — no WARN this time |
|
||||||
|
| 17:15:53 | Disconnect |
|
||||||
|
| 17:16:00 | Re-login from 82.22.5.230 |
|
||||||
|
| 17:16:05 | AuthLimbo: "Restoring YOU500 to world(-264.6, 86.0, -49.8)" |
|
||||||
|
| 17:16:28 | **`YOU500 was blown up by Creeper`** |
|
||||||
|
| 17:16:57 | Operator (s8n) RCON: `tpa YOU500 -264 86 -50` + `tell YOU500 grab items fast 5min despawn` |
|
||||||
|
| 17:17:02 | RCON teleport executed |
|
||||||
|
| 17:18:22 | s8n in-game: `/tp2p YOU500 s8n` |
|
||||||
|
|
||||||
|
The void death at 17:13:39 is the data-loss event. AuthMe had `SaveQuitLocation: true` so the (2380, 70, -11358) was a real prior position but the chunk was almost certainly not loaded yet (11k blocks out, no recent player there). `teleportAsync` returned false either because:
|
||||||
|
- the chunk failed to load within Paper's async generation budget, or
|
||||||
|
- the entity was already dead (void death raced ahead of teleport).
|
||||||
|
|
||||||
|
### What CoreProtect WOULD have caught (and didn't)
|
||||||
|
|
||||||
|
CoreProtect inventory tracking is enabled (`item-transactions: true`, `item-drops: true`, `item-pickups: true`, `rollback-items: true`). However:
|
||||||
|
- A void-death drops items into the world for ~5 min then despawns. Drops are item entities, not container transactions; CoreProtect logs them as drops only if a player was the immediate cause of the drop.
|
||||||
|
- A death-drop in the `auth_limbo` world (where the void death happened) drops into y<0 air which is itself a non-event for CP.
|
||||||
|
- Thus there was no item-rollback path even if `co inspect` had been run within minutes.
|
||||||
|
|
||||||
|
**Implication:** CoreProtect is the wrong tool for death-drop recovery. A real death-drop plugin or `keepInventory` is the only fix.
|
||||||
|
|
||||||
|
### Backup script forensics
|
||||||
|
|
||||||
|
- Deployed: 88 lines, last block is "Prune old backups". No Minecraft block. No `umask 077`.
|
||||||
|
- Repo: 131 lines (with malformed lines 119-122 leftover from a bad merge — ALSO a bug to fix on the next push). Has the Minecraft block. Has `umask 077`.
|
||||||
|
- `/opt/backups/backup.log` shows last 5 days of "Backup complete" entries averaging 8-12K. None contain MC data. None mention MC. The log line `Configs: partial (some files missing)` is the configs section misfiring on Matrix paths and was never the MC block.
|
||||||
|
- Last verified-good MC archive on host: `/opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz` (one-shot pre-rebrand snapshot; contents not verified in this audit).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Action Items (Prioritised)
|
||||||
|
|
||||||
|
### P0 — this week (by 2026-05-14)
|
||||||
|
|
||||||
|
1. **F-01 / Backups.** Sync deployed backup.sh with repo. Fix the lines 119-122 corruption in repo first. Add post-run sentinel: `[ "$(stat -c%s mc-world-backup-*.tar.gz)" -gt 1073741824 ] || log "WORLD BACKUP TOO SMALL — ABORT"`. Run manual backup, verify >= 5 GB on disk. Test a restore into a scratch dir.
|
||||||
|
2. **F-02 / Item-loss safety net.** Decide policy. Recommend: enable `keepInventory true` in `auth_limbo` world only (cheap, narrow), and write a 50-line AuthLimbo extension `OnPlayerDeath` listener that detects "death in auth_limbo" -> restore inventory snapshot taken at AuthMeAsyncPreLogin. Survival pain preserved everywhere else.
|
||||||
|
3. **F-03 / AuthLimbo recovery.** Bump WARN to ERROR. Wire to existing Discord webhook (per workspace memory: webhook stack on nullstone). On failure, write player snapshot to `auth_limbo/incidents/<uuid>-<ts>.dat`.
|
||||||
|
4. **F-04 / Chunk preload race.** Add chunk-loaded check + sync force-load before `teleportAsync`. If still false, kick with friendly message instead of letting the player drop into limbo.
|
||||||
|
5. **F-05 / OOM headroom.** Lower `-Xmx` to 14 GB and add `docker events` watcher.
|
||||||
|
6. **F-06 / Container hardening.** Add `cap_drop`, `pids_limit`, `no-new-privileges`. Boot test in a window.
|
||||||
|
|
||||||
|
### P1 — this month
|
||||||
|
|
||||||
|
7. **F-07** CoreProtect prune cron, plan MySQL migration.
|
||||||
|
8. **F-08** SHA256 -> BCRYPT migration with legacyHashes fallback.
|
||||||
|
9. **F-09** Document `online-mode=false` rationale in RULES.md.
|
||||||
|
10. **F-10** Consider LVM/ZFS snapshot for backup atomicity.
|
||||||
|
|
||||||
|
### P2 — this quarter
|
||||||
|
|
||||||
|
11. **F-11** Drop EZShop after player communication window.
|
||||||
|
12. **F-12** Tighten entity tracking range, re-profile with spark.
|
||||||
|
13. **F-13** Clean orphan plugin folders.
|
||||||
|
14. **F-14** Wire spark TPS alerts to Discord.
|
||||||
|
15. **F-15** Document RCON async-command behaviour.
|
||||||
|
16. **F-17** Rotate RCON password, move to .env.
|
||||||
|
17. **F-18** Add restart-policy max_attempts.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open Questions for the Operator
|
||||||
|
|
||||||
|
1. **Inventory restoration policy.** Is silent `keepInventory` only in `auth_limbo` acceptable, or do you want a manual ops-restore-from-snapshot approval gate?
|
||||||
|
2. **YOU500 specifically.** Is there an out-of-band record of what they were carrying (Discord screenshot, witness)? If yes, manual NBT injection into player.dat is feasible. CoreProtect cannot help.
|
||||||
|
3. **Chunk preload trade-off.** Force-loading distant chunks at login adds 200-2000ms to login time. Acceptable vs the void-death risk?
|
||||||
|
4. **MySQL for CoreProtect.** Adds an operational dependency (another container, another backup target). Worth the complexity, or is monthly purge to keep SQLite under 1 GB sufficient?
|
||||||
|
5. **RCON password rotation.** The committed value should be rotated on principle. Schedule a maintenance window?
|
||||||
|
6. **online-mode=false.** Confirm long-term stance. Mojang ToS implications for racked.ru?
|
||||||
|
7. **Backups offsite.** Currently `/opt/backups/` is on the same host. Plan for offsite copy (B2, restic to friend-PC, anything)?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What was NOT in scope this audit
|
||||||
|
|
||||||
|
- Network firewall, fail2ban, host-side security (nullstone-server has its own audit folder).
|
||||||
|
- Plugin source-supply-chain audit (covered by `docs/ROADMAP.md` "plugin acquisition overhaul").
|
||||||
|
- Performance profiling under load (deferred per F-12).
|
||||||
|
- LuckPerms permission graph correctness.
|
||||||
|
- Rules / chat-format / prefix audit (workspace memory: do NOT touch LP prefixes).
|
||||||
|
- Per-region (Lands / Homestead) data integrity.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Sign-off
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| Audit date | 2026-05-07 |
|
||||||
|
| Method | Read-only SSH inspection, no fixes applied |
|
||||||
|
| Workspace rule applied | "Audit findings -> docs first, then fix" |
|
||||||
|
| Next action | Operator review + go/no-go on each P0 item |
|
||||||
|
| Next audit due | 2026-08-07 (quarterly), or sooner after backups remediated |
|
||||||
118
BACKUP-HUNT-2026-05-07.md
Normal file
118
BACKUP-HUNT-2026-05-07.md
Normal file
|
|
@ -0,0 +1,118 @@
|
||||||
|
# YOU500 Inventory Recovery — Backup Hunt Report
|
||||||
|
|
||||||
|
**Date:** 2026-05-07
|
||||||
|
**Player:** YOU500 (UUID `c7c2df8e-8783-30b5-891c-86ec9343686b`)
|
||||||
|
**Incident:** Full inventory loss at 17:13:39 BST. AuthLimbo `teleportAsync returned false`, player teleport into world from auth_limbo failed → `YOU500 left the confines of this world` (void death). Vanilla `/data/world/playerdata` overwritten on respawn with empty inventory; vanilla void = no drops in world.
|
||||||
|
**Host:** nullstone (192.168.0.100), live MC data at `/home/docker/minecraft/` (== `/opt/docker/minecraft/`, same FS, inode 18877649 confirmed).
|
||||||
|
**SSH user:** `user` (no sudo). All `/opt/backups/2026*` dated subdirs are root-owned 0700 → unreadable. `/var/lib/docker/volumes/` unreadable.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
**Recoverable backup exists: YES — partial.** The pre-rebrand world archive `/home/user/ai-lab/_archive/minecraft-old-2026-04-27.tar.gz` contains YOU500's playerdata `.dat` from **2026-03-25 18:53** (size 9617 B vs current 9192 B — bigger = inventory likely populated). It is the **only known full-inventory snapshot for this UUID** anywhere on the host.
|
||||||
|
|
||||||
|
**Caveat:** This is a 6-week-old snapshot. Items gained between 2026-03-25 and 2026-05-07 17:13 are NOT recoverable from any file backup. **CoreProtect** is installed and has been logging since 2026-05-01 → use `/co inventory YOU500` and `/co rollback` to retrieve anything stored in containers post-2026-05-01.
|
||||||
|
|
||||||
|
**No scheduled world backups exist.** `/opt/docker/backup.sh` stopped backing up the MC world after 2026-05-02 (the world-backup branch was removed when the script was last edited; only configs/Matrix/RC are now dumped). Last world tarball that landed on disk: `/opt/backups/20260430_020001/minecraft-configs-20260430_020001.tar.gz` (12 KB → configs only, no playerdata).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Inventory of Backup Artifacts (oldest → newest)
|
||||||
|
|
||||||
|
| When | Path | Size | Owner | Contains YOU500 .dat? | Notes |
|
||||||
|
|------|------|------|-------|----------------------|-------|
|
||||||
|
| 2026-03-25 18:53 (file mtime inside) | `/home/user/ai-lab/_archive/minecraft-old-2026-04-27.tar.gz` | ~? large | user | **YES** — `minecraft/world/playerdata/c7c2df8e-…dat` 9617 B + `.dat_old` 9616 B (2026-03-25 18:49) | **Best candidate.** 133 player .dat files, full world tree, Essentials/LitePlaytimeRewards/LandClaim DBs, advancements, stats. |
|
||||||
|
| 2026-04-30 02:01 | `/opt/backups/20260430_020001/minecraft-configs-20260430_020001.tar.gz` | 12 KB | root (UNREADABLE) | NO — configs only | Cannot read without sudo; size implies no world data anyway. |
|
||||||
|
| 2026-04-30 02:01 | `/opt/backups/20260430_020001/configs-20260430_020001.tar.gz` | 2.4 KB | root | NO | Traefik/Matrix/RC configs. |
|
||||||
|
| 2026-04-30 19:21 | `/opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz` | 224 MB | user | NO playerdata `.dat` files. Has `plugins/AuthMe/playerdata/` (empty), `plugins/AuthMe.bak-20260430-144204/playerdata/` (empty), `plugins/SkinsRestorer/cache/YOU500.mojangcache`. Vanilla world NOT included. | Plugin trees only — useful for password DB (`plugins/AuthMe.bak-…/authme.db`), not inventory. |
|
||||||
|
| 2026-05-03 02:00 | `/opt/backups/20260503_020001/configs-20260503_020001.tar.gz` | 2.4 KB | root | NO | Configs. |
|
||||||
|
| 2026-05-04 02:00 → 2026-05-07 02:00 | `/opt/backups/20260504_020001` … `20260507_020001` | 0700 dirs | root (UNREADABLE) | Inferred NO from log: backup.log shows only "configs OK" / "Matrix Postgres skipping" / "Volumes skipping" — world not touched after 2026-05-02. | All four dirs report 12 KB. |
|
||||||
|
| 2026-05-07 17:15 | `/home/docker/minecraft/world/playerdata/c7c2df8e-…dat_old` | 9181 B | uid 101000 | YES — but POST-DEATH (empty inventory). | Identical to live state right after first respawn. |
|
||||||
|
| 2026-05-07 17:21 | `/home/docker/minecraft/world/playerdata/c7c2df8e-…dat` | 9192 B | uid 101000 | YES — current live, empty inventory. | |
|
||||||
|
| 2026-05-07 17:15 | `/tmp/you500.dat` | 9181 B | user | YES — but byte-identical-size to `.dat_old`; gunzip strings show only base attribute schema (no item/Slot tags) → already empty. | Someone (you) already extracted the empty post-death dat. Useless for recovery. |
|
||||||
|
|
||||||
|
### Misc archives checked, NOT relevant
|
||||||
|
|
||||||
|
- `/opt/source-endpoint/source.tar.gz` — Misskey AGPL source dump.
|
||||||
|
- `/opt/backups/misskey/*` — Misskey DB/files.
|
||||||
|
- `/home/user/ai-lab/.stversions/_projects/_minecraft/launcher/java/java21.tar~*.gz` — JDK.
|
||||||
|
- `/home/user/ai-lab/_projects/_minecraft/resources/racked.ru.-.minecraft.7z` — launcher resources.
|
||||||
|
- `/home/user/ai-lab/.stversions/**` — Syncthing versions hold only **server config files** (`server.properties`, `bukkit.yml`, `purpur.yml` etc.) under `_github/online/minecraft-server/config/`. **No `.dat` or `playerdata/`** anywhere in `.stversions`. `.stignore` does not list `world/`, but the synced repo never contained the world dir to begin with (it's `_github/minecraft-server/` = configs + docker-compose only).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## CoreProtect — Live Rollback Source
|
||||||
|
|
||||||
|
| Path | Size | Born | Last modified |
|
||||||
|
|------|------|------|---------------|
|
||||||
|
| `/data/plugins/CoreProtect/database.db` (in container) | 1.59 GB | 2026-05-01 10:11:53 | 2026-05-07 17:27 |
|
||||||
|
|
||||||
|
CoreProtect logs container interactions, item drops, deaths, inventory changes since **2026-05-01**. For YOU500's items stored in chests/shulkers/ender chests within the world, an in-game rollback can recover them:
|
||||||
|
|
||||||
|
- Inspect deaths: `/co lookup user:YOU500 action:#kill time:1d`
|
||||||
|
- Inspect inventory transactions: `/co inventory YOU500` (CoreProtect-CE feature)
|
||||||
|
- Rollback drops/voids near death: `/co rollback time:1h user:YOU500 radius:#global action:-drop,#kill`
|
||||||
|
|
||||||
|
(Items YOU500 carried in person and lost to void at 17:13:39 are unlikely to appear in CoreProtect — vanilla void death deletes drops without a kill event in some versions; CoreProtect's `#kill` may or may not have logged it. Worth a `/co lookup user:YOU500 time:30m` to confirm.)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Best Recovery Candidate
|
||||||
|
|
||||||
|
**File:** `/home/user/ai-lab/_archive/minecraft-old-2026-04-27.tar.gz`
|
||||||
|
**Internal path:** `minecraft/world/playerdata/c7c2df8e-8783-30b5-891c-86ec9343686b.dat`
|
||||||
|
**Snapshot date:** 2026-03-25 18:53 (~6 weeks before incident).
|
||||||
|
|
||||||
|
### Extraction command (DO NOT RUN — for review only)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Extract just the YOU500 dat to a staging area, do NOT touch live data
|
||||||
|
mkdir -p /tmp/you500-recovery
|
||||||
|
tar -xzvf /home/user/ai-lab/_archive/minecraft-old-2026-04-27.tar.gz \
|
||||||
|
-C /tmp/you500-recovery \
|
||||||
|
minecraft/world/playerdata/c7c2df8e-8783-30b5-891c-86ec9343686b.dat \
|
||||||
|
minecraft/world/playerdata/c7c2df8e-8783-30b5-891c-86ec9343686b.dat_old
|
||||||
|
|
||||||
|
# Confirm and inspect (NBT viewer or zcat | strings) before any restore
|
||||||
|
ls -la /tmp/you500-recovery/minecraft/world/playerdata/
|
||||||
|
zcat /tmp/you500-recovery/minecraft/world/playerdata/c7c2df8e-8783-30b5-891c-86ec9343686b.dat \
|
||||||
|
| strings | grep -E 'Slot|count|minecraft:diamond|minecraft:netherite' | head -40
|
||||||
|
```
|
||||||
|
|
||||||
|
### Restore plan (operator decision — NOT executed)
|
||||||
|
|
||||||
|
1. Stop the server (or kick YOU500) so file is not held open.
|
||||||
|
2. With sudo (uid 101000 owns the file): copy the extracted `.dat` over `/home/docker/minecraft/world/playerdata/c7c2df8e-8783-30b5-891c-86ec9343686b.dat`, preserve mode/owner.
|
||||||
|
3. Also overwrite `.dat_old`.
|
||||||
|
4. Optional: replace `Essentials/userdata/c7c2df8e-…yml` from same archive if the YML matters.
|
||||||
|
5. Restart server. Player rejoins with March 25 inventory + position.
|
||||||
|
|
||||||
|
**Tradeoff:** YOU500 will lose all progress 2026-03-25 → 2026-05-07. Communicate before applying. Combine with CoreProtect rollback to minimise loss.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Gaps
|
||||||
|
|
||||||
|
- **No scheduled world backups since 2026-05-02.** `/opt/docker/backup.sh` no longer dumps `world/`. The 2026-04-30 daily contains a 12 KB "minecraft-configs" tarball (configs, not world). Action: re-add a world tarball to the daily script.
|
||||||
|
- **No off-host backup.** No restic / borg / duplicity / rsnapshot installed. No rclone. No second host pulling MC data. Syncthing does not sync the world dir.
|
||||||
|
- **No filesystem snapshots.** Root is ext4 on LVM (no LVM thinpool snapshots in use), `/home` is ext4 (no btrfs/ZFS).
|
||||||
|
- **`/var/lib/docker/volumes/` unreadable** without sudo. Confirmed via `docker volume ls | grep -iE mine|back|world` returning empty (named volumes not used for MC — bind mount only).
|
||||||
|
- **`/opt/backups/2026*_020001` subdirs unreadable** (mode 0700 root). Cannot diff their contents byte-for-byte; relied on `backup.log` text + indirect listing. They almost certainly contain only configs (12 KB dirs, log entries match).
|
||||||
|
- **`docker exec minecraft-mc env | grep -i backup` returned nothing** — no env-driven autosave/backup plugin enabled (e.g. `itzg/mc-backup` sidecar absent, no AutomatedBackup / EasyBackup jar in `/data/plugins`).
|
||||||
|
- **AuthMe `playerdata/` dirs are empty** in both live and `.bak-20260430-144204` — AuthMe is configured without inventory protection (no logged-out inv snapshots).
|
||||||
|
- **No InvSee / InventoryRollback plugin.** Only CoreProtect (logs, not snapshots).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Permission-Limited Reads (no sudo via SSH)
|
||||||
|
|
||||||
|
| Path | What we couldn't see | Likely contents |
|
||||||
|
|------|----------------------|-----------------|
|
||||||
|
| `/opt/backups/20260504_020001/` … `20260507_020001/` | Directory listings (0700 root) | Daily configs tarballs, ~12 KB each — confirmed via `du` in backup.log |
|
||||||
|
| `/opt/backups/20260430_020001/minecraft-configs-20260430_020001.tar.gz` | tar listing (root-owned, 0600) | MC config bind-mount tarball, 12914 B |
|
||||||
|
| `/var/lib/docker/volumes/` | Directory listing | Named volumes — not used by MC (bind mount only) |
|
||||||
|
| `/var/backups/` (host) | Listing | Standard Debian dpkg/apt backups, not MC |
|
||||||
|
| `/root/` | Anything | — |
|
||||||
|
|
||||||
|
Re-run with `sudo` if any of these need confirmation, but content is improbable to change the conclusion.
|
||||||
393
BACKUP-STRATEGY.md
Normal file
393
BACKUP-STRATEGY.md
Normal file
|
|
@ -0,0 +1,393 @@
|
||||||
|
# Minecraft Backup Strategy — racked.ru on nullstone
|
||||||
|
|
||||||
|
**Status:** PROPOSAL (2026-05-07) — not yet implemented.
|
||||||
|
**Author trigger:** Player lost full inventory to void death today; rollback impossible because the existing 02:00 daily backup had **silently failed for 5 of the last 7 days** and there is **zero off-host copy**.
|
||||||
|
**Owner:** `s8n` (operator).
|
||||||
|
**Target host:** `nullstone` (192.168.0.100, Debian 13 trixie).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Current state (audited 2026-05-07)
|
||||||
|
|
||||||
|
Existing system in `/opt/docker/backup.sh` + `cron.d/docker-backup` (02:00 daily, 7-day retention in `/opt/backups/`).
|
||||||
|
|
||||||
|
Findings from `/opt/backups/backup.log`:
|
||||||
|
|
||||||
|
| Date | MC world result | Backup dir total |
|
||||||
|
|------|-----------------|------------------|
|
||||||
|
| 2026-04-26 | FAILED | — |
|
||||||
|
| 2026-04-27 | FAILED | — |
|
||||||
|
| 2026-04-28 | FAILED | — |
|
||||||
|
| 2026-04-29 | OK (3.6 G) | — |
|
||||||
|
| 2026-04-30 | FAILED | — |
|
||||||
|
| 2026-05-01 | FAILED | — |
|
||||||
|
| 2026-05-02 | OK (3.6 G) | — |
|
||||||
|
| 2026-05-03 | (no MC log line) | 8 K |
|
||||||
|
| 2026-05-04 | (no MC log line) | 8 K |
|
||||||
|
| 2026-05-05 | (no MC log line) | 8 K |
|
||||||
|
| 2026-05-06 | (no MC log line) | 12 K |
|
||||||
|
| 2026-05-07 | (no MC log line) | 12 K |
|
||||||
|
|
||||||
|
After 2026-05-02 the entire MC block stopped emitting log lines. The script appears to be exiting before reaching it (the duplicated stray `chmod 600 ... synapse-signing-key` lines at L119–122 are orphaned from a botched edit and may now break `set -e`). Effective state: **two MC backups in the last 12 days**, both already pruned by 7-day retention. **No usable backup exists right now.**
|
||||||
|
|
||||||
|
Cross-references:
|
||||||
|
- `_github/infra/STATE.md` Top-5 weakness #2 ("backup.sh broken silently") and #5 ("No off-host backup").
|
||||||
|
- `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md` §5 already names this `F-backup-1` and proposes "Restic + autorestic to B2/Wasabi or to nullstone-as-spare". This strategy refines that to use on-hand resources rather than paid storage.
|
||||||
|
|
||||||
|
### Available resources (no purchasing required)
|
||||||
|
|
||||||
|
| Asset | Location | Free | Reachability | Role |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| nullstone `/home` | local NVMe (ext4 LVM) | 142 G of 399 G | local | Primary repo + restic cache |
|
||||||
|
| onyx `/home` | LUKS NVMe | 1.6 T of 1.9 T | Tailscale 100.64.0.1 (LAN ~5 ms) | **Off-host primary** |
|
||||||
|
| friend RTX 4080 PC | DESKTOP-LR0RILA | unknown (Windows, large) | Tailscale 100.64.0.3 (WAN, IP-stable via tailnet) | **Off-host secondary** (defer) |
|
||||||
|
| nullstone `/opt/backups` | same disk as `/opt/docker` | 142 G | local | *Not* a real backup target — same-disk SPOF |
|
||||||
|
|
||||||
|
**No purchased B2 / Wasabi / S3 in this proposal.** Tailscale + onyx covers off-host today. B2 stays in the future-options annex.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Threat model
|
||||||
|
|
||||||
|
| # | Threat | Concrete example | Frequency | Mitigation in this plan |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| T1 | Player accidental loss (void death, lava, fall) | YOU500, 2026-05-07 | weekly | 5-min playerdata snapshots (RPO ≤ 5 min) |
|
||||||
|
| T2 | Griefing / theft / chest emptied by ban-evader | possible | monthly | 5-min playerdata + 1-h world snapshots |
|
||||||
|
| T3 | World corruption (chunk error, region-file truncate) | rare | — | 6-h pre-flight validated full world snapshot |
|
||||||
|
| T4 | Plugin / config bad change (LuckPerms wipe, server.properties) | edits during ops | weekly | daily configs + DB dump + git history (`live-server/` repo) |
|
||||||
|
| T5 | Host disk failure (single NVMe) | low/year | — | nightly off-host copy to onyx (Tailscale) |
|
||||||
|
| T6 | Ransomware / host compromise | low | — | append-only Restic repo on onyx; nullstone holds **no** delete key |
|
||||||
|
| T7 | Operator `rm -rf` or wrong `docker compose down -v` | low | — | retention floor (4 weekly + 12 monthly) survives a recent rm |
|
||||||
|
| T8 | Backup script silently failing (current state) | OBSERVED | — | heartbeat alert + monthly restore drill (§7) |
|
||||||
|
|
||||||
|
T8 is the one that just bit us. The single most important addition is **alerting on missed runs**, not the storage tech.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. RPO / RTO
|
||||||
|
|
||||||
|
| Class | Data | RPO | RTO | Backup mechanism |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| A | playerdata (`world/playerdata/*.dat`, `stats/`, `advancements/`) | **5 min** | < 2 min per player | rcon `save-all flush` → rsync to local snapshot, then restic-add |
|
||||||
|
| B | full world (region files, end + nether) | **1 h** during play, **6 h** otherwise | 15 min | restic of `world*/` |
|
||||||
|
| C | plugin configs + LuckPerms YAML | 24 h | 30 min | tar of `plugins/*/config*.yml` + LP file dump |
|
||||||
|
| D | LuckPerms / Homestead SQLite DBs (`*.db`, `homestead_data.db`) | 1 h | 5 min | sqlite `.backup` then restic-add |
|
||||||
|
| E | host-level configs (`docker-compose.yml`, `server.properties`, `purpur.yml`, `bukkit.yml`, `paper-*.yml`, `whitelist.json`, `ops.json`, `banned-*.json`, `config/`) | 24 h | 5 min | already in git repo `_github/minecraft-server/`; backup just covers drift |
|
||||||
|
|
||||||
|
**Justification for RPO=5 min on Class A:** the void-death case rebuilds in seconds — recovering one `<uuid>.dat` is a ~30 s operation if a 5-min-old snapshot exists. Snapshotting just the 1.3 MB `playerdata/` dir is cheap (single-digit MB/day after dedup).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Tool choice — Restic
|
||||||
|
|
||||||
|
Compared:
|
||||||
|
|
||||||
|
| Tool | Dedup | Encryption | Snapshots | Network destinations | Verdict |
|
||||||
|
|---|---|---|---|---|---|
|
||||||
|
| **restic** | content-addressed, very effective on MC region files | AES-256, repo-key | yes | sftp (Tailscale), local, B2, S3, Azure, rclone | **WINNER** |
|
||||||
|
| borgbackup | similar | yes | yes | ssh only, lock-on-write | Equally good; restic chosen because operator already plans `restic + autorestic` per `infra/STATE.md` line 112; sftp dest is simpler than borg's required serverside binary |
|
||||||
|
| rsnapshot | hardlinks, no dedup | none | rotated dirs | local + rsync | No encryption ⇒ off-host copy on Tailscale (already encrypted) is fine, but no dedup means 18 G × N snapshots is painful. Reject. |
|
||||||
|
| zfs send | block-level | (zfs native) | snapshots | yes | nullstone is **ext4/LVM**, no ZFS, no btrfs. Reject. |
|
||||||
|
| LVM snapshot | COW | none | yes | local only | Same-disk only, doesn't survive disk failure. Useful as a *staging* primitive only. |
|
||||||
|
| custom rsync + cp -al | hardlinks | none | yes | yes | Reinventing rsnapshot. Reject. |
|
||||||
|
| itzg `BACKUP_*` env | tar to volume | none | rotation | local | Already tried in spirit by current `backup.sh`; same-disk; not granular. Reject as primary. |
|
||||||
|
|
||||||
|
**Decision:** `restic` for Classes A, B, C, D. Continue using a thin tar wrapper for Class E (configs are already in the git repo, this is just safety).
|
||||||
|
|
||||||
|
Restic strengths for our case:
|
||||||
|
- Region files dedup *very* well (chunks unchanged across snapshots).
|
||||||
|
- A 5-min Class-A snapshot adds ~MB to the repo, not the full 1.3 MB × N.
|
||||||
|
- One repo on local disk + one mirror to onyx via `rclone serve restic` or direct `sftp:` — no agent needed on onyx beyond ssh.
|
||||||
|
- `restic check --read-data-subset=5%` is the canonical scrub.
|
||||||
|
|
||||||
|
Apt: `apt install restic` on trixie ships 0.16.x — sufficient.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Schedule
|
||||||
|
|
||||||
|
All times Europe/London (matches `TZ` in compose file).
|
||||||
|
|
||||||
|
| Job | Cadence | Source | Destination | Mechanism |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| **A — playerdata** | every **5 min** | `world/playerdata/`, `world/stats/`, `world/advancements/`, `world*/level.dat`, `*.db` (LP+homestead) | restic repo `/home/user/restic/mc-frequent/` | systemd timer `mc-backup-frequent.timer` |
|
||||||
|
| **B — full world** | every **1 h** during play (07:00–01:00), **6 h** otherwise | `world/`, `world_nether/`, `world_the_end/` | restic repo `/home/user/restic/mc-world/` | systemd timer `mc-backup-world.timer` |
|
||||||
|
| **C — configs + plugins** | **daily 02:00** | `/opt/docker/minecraft/*.yml`, `*.json`, `plugins/*/config*.yml`, `plugins/LuckPerms/`, `docker-compose.yml` | restic repo `mc-world` (path-tagged) | reuse same timer with second backup target |
|
||||||
|
| **D — DB dumps** | every **1 h** | `homestead_data.db`, `plugins/CoreProtect/database.db`, `plugins/LuckPerms/luckperms-h2-*` | restic repo `mc-world` | timer hooks `sqlite3 .backup` first |
|
||||||
|
| **E — off-host mirror** | **nightly 03:30** | nullstone `/home/user/restic/` | onyx `100.64.0.1:/home/admin/backups/nullstone-mc-restic/` | `restic copy` over sftp (Tailscale) — append-only key on onyx side |
|
||||||
|
| **F — verify** | **weekly Sun 04:00** | both repos | — | `restic check --read-data-subset=5%` then alert on rc |
|
||||||
|
| **G — drill** | **monthly 1st Sat 11:00** | random snapshot | scratch dir | §7 procedure |
|
||||||
|
|
||||||
|
### Why this works for the void-death case
|
||||||
|
|
||||||
|
T1 hits at 18:42. By 18:45 a Class-A snapshot exists containing the player's `<uuid>.dat` from 18:40. Restore: `restic -r ... restore --target /tmp/r --include 'world/playerdata/<uuid>.dat' latest`, stop server (or `/save-off` + minimanip), copy file into place, `/save-on`. Total RTO < 2 min.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Retention
|
||||||
|
|
||||||
|
Restic policy (passed to `restic forget --keep-*`):
|
||||||
|
|
||||||
|
```
|
||||||
|
--keep-last 24 # 24 most recent (covers 2h of 5-min snapshots)
|
||||||
|
--keep-hourly 24 # 24h of hourly
|
||||||
|
--keep-daily 7 # 7 days
|
||||||
|
--keep-weekly 4 # 4 weeks
|
||||||
|
--keep-monthly 12 # 12 months
|
||||||
|
```
|
||||||
|
|
||||||
|
Applied per-tag — Class A snapshots tagged `playerdata`, B/C/D tagged `world`. Forget is run **only on the local repo**; the onyx mirror inherits via `restic copy` with same policy after the local forget+prune.
|
||||||
|
|
||||||
|
### Storage budget
|
||||||
|
|
||||||
|
- Class A: 1.3 MB raw × dedup (~20× on `.dat`, mostly empty NBT slots) → ~70 KB / snapshot **net**.
|
||||||
|
- 24/h × 24h × 7 = 4 032 snapshots/week → < 300 MB/week.
|
||||||
|
- Class B/C/D: 18 G raw → ~6.5 G compressed (per current 3.6 G figure × adjustment for nether/end now active). Restic dedup on hourly snapshots: ~50–200 MB delta/snapshot during active play.
|
||||||
|
- 24h hourly + 7 daily + 4 weekly + 12 monthly ≈ 47 retained → estimate **15–25 GB total** at steady state.
|
||||||
|
- E (off-host): same as above on onyx (1.6 TB free — 30× headroom).
|
||||||
|
|
||||||
|
**Conclusion:** comfortably fits in nullstone's 142 G free. Onyx is essentially unconstrained.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Off-host destination — onyx via Tailscale
|
||||||
|
|
||||||
|
**Choice:** `onyx` (100.64.0.1, 1.6 TB free on `/home`). Reasons:
|
||||||
|
- Already in the tailnet (`tag:admin`), already trusted, already SSH-reachable.
|
||||||
|
- 1.6 TB is 100× the dataset.
|
||||||
|
- Operator's daily-driver: a missed-backup alert on onyx is *seen*.
|
||||||
|
- Deferred (phase 2): replicate to friend's RTX 4080 PC (100.64.0.3) for true geographic separation. Tailnet IP is stable across the friend's ISP IP changes per memory `project_friend_gpu`.
|
||||||
|
|
||||||
|
**Mechanics:**
|
||||||
|
1. On onyx: create restricted user `mc-backup` with `~/backups/nullstone-mc-restic/` and a `~/.ssh/authorized_keys` entry that **only allows `internal-sftp` chrooted to that dir**, no shell, no port-forward. (`Match User mc-backup ... ChrootDirectory %h, ForceCommand internal-sftp -d /backups/nullstone-mc-restic`).
|
||||||
|
2. On nullstone: install nullstone's ssh public key on onyx for that user. Use a second **append-only** restic key (separate password) so a compromised nullstone cannot run `forget`/`prune` on the onyx repo. Restic supports this via per-key `--no-cache`-friendly flags, but the harder lock comes from sftp chroot perms (set parent dir owner to root, give `mc-backup` write inside but no `unlink` on rotated lockfiles? — practical compromise: rely on `restic copy` adding-only and audit `forget` runs).
|
||||||
|
3. Nightly job on nullstone: `restic -r sftp:mc-backup@100.64.0.1:/backups/nullstone-mc-restic copy --from-repo /home/user/restic/mc-world latest && ... mc-frequent ...`.
|
||||||
|
4. Onyx-side cron weekly: `restic check` on the mirror (independent verification).
|
||||||
|
|
||||||
|
**Why not friend's GPU PC?** Windows host, no built-in SSH, asymmetric trust. Defer to phase 2 once an SMB or `rclone serve` target is set up there.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Restore drill (monthly, 1st Saturday 11:00)
|
||||||
|
|
||||||
|
Runbook: `docs/RUNBOOK-BACKUP-RESTORE.md` (created alongside this proposal).
|
||||||
|
|
||||||
|
Drill scenario: "YOU500 lost his inventory to a void death 6 minutes ago." Steps:
|
||||||
|
|
||||||
|
1. Pick a known UUID from `world/playerdata/` (operator's own UUID).
|
||||||
|
2. `restic -r /home/user/restic/mc-frequent snapshots --tag playerdata | tail -5` — confirm freshest snapshot is ≤ 6 min old.
|
||||||
|
3. `restic -r ... restore latest --target /tmp/drill-$(date +%s) --include 'world/playerdata/<uuid>.dat'`.
|
||||||
|
4. `nbted` or `python -m nbtlib` parse the `.dat` — confirm it's a valid GZIP NBT structure (not zero bytes, not partial).
|
||||||
|
5. `diff` against the live `.dat` — log the differences (expected: at least the inventory NBT path differs because player kept playing).
|
||||||
|
6. Repeat from the **onyx mirror** repo to prove off-host works end-to-end.
|
||||||
|
7. Log result to `docs/RUNBOOK-BACKUP-RESTORE.md` § Drill log.
|
||||||
|
|
||||||
|
Drill is **non-destructive** — never overwrite live `.dat` during a drill. Real restores follow §3 of the runbook.
|
||||||
|
|
||||||
|
Pass criteria: both restores complete in < 2 min wall-clock and the parsed NBT root tag is well-formed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Implementation — concrete drafts
|
||||||
|
|
||||||
|
Two layers: a **fix** to the existing daily script (Class C/E) and a **new sidecar timer** for Classes A/B/D.
|
||||||
|
|
||||||
|
### 8.1 Fix `/opt/docker/backup.sh` (F-backup-1)
|
||||||
|
|
||||||
|
Already documented in `infra/runbooks/MIGRATION-nullstone-to-cobblestone.md` §5. Minimum work:
|
||||||
|
- Drop dead `matrix-postgres` block (Synapse retired).
|
||||||
|
- Drop / fix `mongodb` block (RC stopped 2026-05-06).
|
||||||
|
- Remove orphaned `chmod 600 ...synapse-signing-key...` block at L119–122 (causing `set -e` exit before MC block on most days).
|
||||||
|
- Wrap each module in `( ... ) || log "module FAILED"` so one module's failure doesn't skip the rest.
|
||||||
|
|
||||||
|
Out-of-scope for this strategy doc — track in infra audit.
|
||||||
|
|
||||||
|
### 8.2 New: `mc-backup-frequent` (Class A) and `mc-backup-world` (Classes B/C/D)
|
||||||
|
|
||||||
|
Drop-in files (operator review before deploy):
|
||||||
|
|
||||||
|
**`/etc/systemd/system/mc-backup-frequent.service`**
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=Minecraft frequent backup (playerdata, every 5 min)
|
||||||
|
After=docker.service
|
||||||
|
Wants=docker.service
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
User=user
|
||||||
|
Group=docker
|
||||||
|
EnvironmentFile=/etc/mc-backup.env
|
||||||
|
ExecStart=/usr/local/bin/mc-backup-frequent.sh
|
||||||
|
Nice=10
|
||||||
|
IOSchedulingClass=best-effort
|
||||||
|
IOSchedulingPriority=7
|
||||||
|
```
|
||||||
|
|
||||||
|
**`/etc/systemd/system/mc-backup-frequent.timer`**
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=Run mc-backup-frequent every 5 minutes
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnBootSec=2min
|
||||||
|
OnUnitActiveSec=5min
|
||||||
|
AccuracySec=30s
|
||||||
|
Persistent=true
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
|
```
|
||||||
|
|
||||||
|
**`/etc/mc-backup.env`** (mode 0600, owner `user:docker`)
|
||||||
|
```
|
||||||
|
RESTIC_REPOSITORY_FREQUENT=/home/user/restic/mc-frequent
|
||||||
|
RESTIC_REPOSITORY_WORLD=/home/user/restic/mc-world
|
||||||
|
RESTIC_PASSWORD_FILE=/etc/mc-backup.pw
|
||||||
|
MC_DATA=/opt/docker/minecraft
|
||||||
|
RCON_HOST=127.0.0.1
|
||||||
|
RCON_PORT=25575
|
||||||
|
RCON_PASS=*redacted*
|
||||||
|
HEARTBEAT_URL=https://ntfy.s8n.ru/mc-backup-frequent
|
||||||
|
ALERT_URL=https://ntfy.s8n.ru/mc-backup-alerts
|
||||||
|
TS_OFFHOST_USER=mc-backup
|
||||||
|
TS_OFFHOST_HOST=100.64.0.1
|
||||||
|
TS_OFFHOST_PATH=/backups/nullstone-mc-restic
|
||||||
|
```
|
||||||
|
|
||||||
|
**`/usr/local/bin/mc-backup-frequent.sh`**
|
||||||
|
```bash
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
. /etc/mc-backup.env
|
||||||
|
|
||||||
|
trap 'curl -fsS -m 10 -d "fail rc=$?" "$ALERT_URL" >/dev/null || true' ERR
|
||||||
|
|
||||||
|
# 1. Ask MC to flush via rcon (best-effort; don't fail backup if rcon down)
|
||||||
|
if command -v mcrcon >/dev/null 2>&1; then
|
||||||
|
mcrcon -H "$RCON_HOST" -P "$RCON_PORT" -p "$RCON_PASS" -w 1 \
|
||||||
|
"save-all flush" >/dev/null 2>&1 || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
# 2. Snapshot just the small fast-changing things
|
||||||
|
restic backup \
|
||||||
|
--tag playerdata \
|
||||||
|
--tag auto-5min \
|
||||||
|
--host nullstone \
|
||||||
|
--exclude='*.lock' \
|
||||||
|
"$MC_DATA/world/playerdata" \
|
||||||
|
"$MC_DATA/world/stats" \
|
||||||
|
"$MC_DATA/world/advancements" \
|
||||||
|
"$MC_DATA/world/level.dat" \
|
||||||
|
"$MC_DATA/world_nether/level.dat" \
|
||||||
|
"$MC_DATA/world_the_end/level.dat" \
|
||||||
|
"$MC_DATA/homestead_data.db" \
|
||||||
|
"$MC_DATA/plugins/LuckPerms" \
|
||||||
|
"$MC_DATA/plugins/CoreProtect/database.db" 2>/dev/null || true
|
||||||
|
|
||||||
|
# 3. Cheap retention (only on local repo)
|
||||||
|
restic forget --tag auto-5min \
|
||||||
|
--keep-last 24 --keep-hourly 24 --keep-daily 7 \
|
||||||
|
--prune --quiet
|
||||||
|
|
||||||
|
# 4. Heartbeat — alert if NOT received in 15 min via ntfy server
|
||||||
|
curl -fsS -m 5 "$HEARTBEAT_URL" >/dev/null || true
|
||||||
|
```
|
||||||
|
|
||||||
|
**`mc-backup-world.{service,timer,sh}`** — same shape, runs hourly during play / 6h otherwise (use `OnCalendar=*-*-* 07,08,...,01:00:00` or two timers), backs up full `world*/`, configs, DB dumps. After local backup, runs:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
restic copy \
|
||||||
|
--from-repo "$RESTIC_REPOSITORY_WORLD" \
|
||||||
|
-r "sftp:$TS_OFFHOST_USER@$TS_OFFHOST_HOST:$TS_OFFHOST_PATH" \
|
||||||
|
latest
|
||||||
|
```
|
||||||
|
|
||||||
|
And once nightly (separate timer) the same `copy` for `mc-frequent`.
|
||||||
|
|
||||||
|
### 8.3 docker-compose.override.yml — alternative path (rejected)
|
||||||
|
|
||||||
|
Considered: itzg image supports `BACKUP_INTERVAL`, `BACKUP_METHOD=restic`. Pros: in-container, knows when world is loaded. Cons:
|
||||||
|
- Bind-mount to host restic repo crosses userns-remap boundary (uid 100000 vs host uid 1000) — already a known nullstone footgun (memory `project_nullstone_docker_userns`).
|
||||||
|
- Container restart wipes restic cache, slow first run after every reboot.
|
||||||
|
- Mixing in-image and host-cron backup logic doubles failure surfaces.
|
||||||
|
|
||||||
|
**Decision:** keep backups in systemd on the host; container is unaware. Override file is **not** part of this proposal.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Monitoring & alerting
|
||||||
|
|
||||||
|
Three signals, all routed to ntfy on the existing self-hosted `ntfy.s8n.ru` (assumed to exist; if not, add as part of phase 1 — single-container deploy). DiscordSRV was dropped on 2026-04-30 per README.md L170, so Discord is not an option.
|
||||||
|
|
||||||
|
| Signal | Trigger | Channel |
|
||||||
|
|---|---|---|
|
||||||
|
| `mc-backup-frequent` heartbeat | timer fires successfully | ntfy topic `mc-backup-frequent` (silent on success) |
|
||||||
|
| Heartbeat **missing > 15 min** | dead-man's switch on ntfy server, or external (`healthchecks.io` is free + self-hostable) | ntfy topic `mc-backup-alerts` (high priority) |
|
||||||
|
| `restic check` weekly | non-zero rc | ntfy topic `mc-backup-alerts` (high priority) |
|
||||||
|
| Off-host mirror failure | `restic copy` non-zero rc | ntfy topic `mc-backup-alerts` (high priority) |
|
||||||
|
|
||||||
|
Operator subscribes onyx + phone to `mc-backup-alerts` only. The `-frequent` topic is a heartbeat sink (not a notification stream).
|
||||||
|
|
||||||
|
**Alternative if no ntfy yet:** write to `/var/log/mc-backup.log` AND a tiny status file `/var/lib/mc-backup/last-success` (mtime checked by an external monitor — Gatus on roadmap, Beszel on roadmap). Until either of those lands, a simple cron on **onyx** doing `ssh user@nullstone 'find /var/lib/mc-backup/last-success -mmin -15 | grep .'` and triggering a desktop `notify-send` is enough.
|
||||||
|
|
||||||
|
This addresses T8 (the silent-failure threat) directly.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Cost & capacity
|
||||||
|
|
||||||
|
**Hardware cost:** £0. Uses existing nullstone NVMe + onyx NVMe + existing Tailscale mesh.
|
||||||
|
|
||||||
|
**Disk consumption (steady state, both repos):**
|
||||||
|
|
||||||
|
| Where | Estimate | Headroom |
|
||||||
|
|---|---|---|
|
||||||
|
| nullstone `/home/user/restic/mc-frequent` | < 1 GB | 142 G free → ~140× |
|
||||||
|
| nullstone `/home/user/restic/mc-world` | 15–25 GB | ~6× |
|
||||||
|
| onyx `~/backups/nullstone-mc-restic/` | 16–26 GB | 1.6 T free → ~60× |
|
||||||
|
|
||||||
|
**Days of retention given current free space:** even if the world doubles to 36 GB raw, dedup keeps growth linear at ~5 % per snapshot — well over a year of monthly retention fits.
|
||||||
|
|
||||||
|
**Network:** Tailscale LAN-direct (5 ms onyx ↔ nullstone). Nightly delta typically < 500 MB after dedup. Negligible.
|
||||||
|
|
||||||
|
**Operator time:** ~2 h initial deploy, ~10 min/month for the drill, ~zero on autopilot.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Phase plan
|
||||||
|
|
||||||
|
| Phase | What | When | Blocker |
|
||||||
|
|---|---|---|---|
|
||||||
|
| 0 | This doc + runbook stub written, reviewed | TODAY | — |
|
||||||
|
| 1 | Stop the bleeding: fix `backup.sh` orphan lines so daily MC tar at least runs again | TODAY (15 min) | — |
|
||||||
|
| 2 | Stand up `mc-backup-frequent` timer + local restic repo (Class A) | this week | needs `apt install restic mcrcon` |
|
||||||
|
| 3 | Add `mc-backup-world` timer + Class B/C/D | this week | — |
|
||||||
|
| 4 | Onyx off-host SFTP target + `restic copy` job | this week | onyx user provisioning + ssh key |
|
||||||
|
| 5 | First monthly drill | next 1st Saturday | — |
|
||||||
|
| 6 | Wire ntfy alerts | when ntfy/Gatus deployed (infra roadmap) | external |
|
||||||
|
| 7 | Friend RTX 4080 PC as second off-host (geographic) | phase 2 | Windows-side tooling |
|
||||||
|
|
||||||
|
Phases 1–4 are doable today with what's on hand. Nothing in phases 1–5 requires purchasing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Open questions for operator
|
||||||
|
|
||||||
|
1. **ntfy.s8n.ru — does it exist yet?** Memory hints at Tuwunel + Matrix on `txt.s8n.ru`. If ntfy isn't deployed, decide: deploy ntfy *now*, or use Matrix room via Tuwunel webhook bridge as alert sink.
|
||||||
|
2. **Onyx user `mc-backup`** — create today or reuse existing `admin` with restricted authorized_keys? Restricted user is cleaner; reusing `admin` is faster.
|
||||||
|
3. **Append-only enforcement** on the onyx side — accept "sftp chroot + no shell" as good-enough, or invest in a per-repo restic key with `--no-delete`-style isolation (more work, partial mitigation only)?
|
||||||
|
4. **Pre-flight world validation** — run `region-fixer` against the latest snapshot weekly to catch silent corruption (T3)? Adds ~5 min compute weekly. Recommend yes.
|
||||||
|
5. **Class-E (host configs) — already in `live-server/` git repo via Syncthing/manual?** If yes, drop Class E from this scheme; if no, add it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 13. References
|
||||||
|
|
||||||
|
- `docs/BACKUP.md` — current (broken) state docs.
|
||||||
|
- `docs/RUNBOOK-BACKUP-RESTORE.md` — operational runbook (this commit).
|
||||||
|
- `scripts/backup.sh` — to-be-fixed daily script (F-backup-1 in `infra/STATE.md`).
|
||||||
|
- `_github/infra/STATE.md` — Top-5 weakness #2 + #5 tracking this work.
|
||||||
|
- `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md` §5 — F-backup-1 detail; nullstone-as-spare hint.
|
||||||
|
- Memory: `project_friend_gpu` (Tailscale stable IP for friend), `project_tailscale_mesh` (mesh layout), `project_nullstone_docker_userns` (why container-side backup is rejected).
|
||||||
|
- `CLAUDE.md` Device Registry — onyx 192.168.0.28 / 100.64.0.1.
|
||||||
364
CROSS-REFERENCE-2026-05-07.md
Normal file
364
CROSS-REFERENCE-2026-05-07.md
Normal file
|
|
@ -0,0 +1,364 @@
|
||||||
|
<!--
|
||||||
|
Cross-reference survey for the 2026-05-07 racked.ru / YOU500 incident.
|
||||||
|
Read-only inventory of existing docs across local repo clones, written
|
||||||
|
to help the four parallel investigation outputs (backup hunt, AuthLimbo
|
||||||
|
audit, backup strategy, server audit) integrate without conflict.
|
||||||
|
|
||||||
|
Author: cross-reference agent (read-only)
|
||||||
|
Status: survey only — no fixes proposed here, that's the other agents' job.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# Cross-Reference Survey — 2026-05-07
|
||||||
|
|
||||||
|
**Trigger:** racked.ru player **YOU500** void-died via AuthLimbo
|
||||||
|
`teleportAsync` failure, lost full inventory, no backups exist.
|
||||||
|
Four parallel agents are writing audit + plan docs. This doc maps
|
||||||
|
them onto existing infra so nothing collides or gets orphaned.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Per-repo state snapshot
|
||||||
|
|
||||||
|
### `auth-limbo` (Paper plugin source)
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|---|---|
|
||||||
|
| Origin | `ssh://git@192.168.0.100:222/s8n-ru/auth-limbo.git` ⚠️ stale (`s8n-ru` rename) |
|
||||||
|
| Latest tag in CHANGELOG | **1.0.0** (2026-04-30) — single release |
|
||||||
|
| Last commit | `b686380 readme: restyle to match minecraft-launcher format` |
|
||||||
|
| Recent commits | README rewrites, AGPL switch, rename chain `RackedLimbo → LoginLimbo → AuthLimbo` |
|
||||||
|
| CI | `.github/workflows/build.yml` + `release.yml` (GitHub Actions, **not** `.forgejo/`) |
|
||||||
|
| Tests | **None.** `src/test/` does not exist. |
|
||||||
|
| Source | 5 Java files: `AuthLimbo`, `AuthMeDatabase`, `LimboWorldManager`, `LoginListener`, `VoidGenerator` |
|
||||||
|
| Docs | `docs/{compatibility,configuration,how-it-works,installation}.md` |
|
||||||
|
| CHANGELOG style | **Keep a Changelog + SemVer**, date-suffixed `## [1.0.0] - 2026-04-30` |
|
||||||
|
| License | AGPL-3.0-or-later, SPDX header in every Java file |
|
||||||
|
|
||||||
|
**Key existing detail relevant to the bug** — `LoginListener.java`
|
||||||
|
already implements the documented Paper #4085 fix (chunk-ticket pin
|
||||||
|
in `AuthMeAsyncPreLoginEvent` + `getChunkAtAsyncUrgently` chained
|
||||||
|
with `teleportAsync` at MONITOR priority on `LoginEvent`, with
|
||||||
|
configurable `authme.teleport-delay-ticks`). If YOU500 still
|
||||||
|
void-died, the bug is in **how** that chain handled a return-value
|
||||||
|
of `false` / a thrown exception — the current code only logs a
|
||||||
|
`warning` and lets the player stay wherever they were (which on
|
||||||
|
login is the limbo void). See `LoginListener.java:166-191`.
|
||||||
|
|
||||||
|
The AuthLimbo audit agent's findings should land as:
|
||||||
|
- **`docs/INCIDENT-2026-05-07-you500.md`** (new) — forensic root-cause
|
||||||
|
doc, follow `docs/REBRAND_2026-04-30.md` style (date-prefixed,
|
||||||
|
scope/apply/result/rollback sections — convention shown below).
|
||||||
|
- **`CHANGELOG.md`** — bump to `## [1.0.1] - 2026-05-07` with
|
||||||
|
`### Fixed` block, follow Keep-a-Changelog format.
|
||||||
|
- **`src/main/java/ru/authlimbo/LoginListener.java`** — code patch.
|
||||||
|
Likely changes: handle `success == false` and `exceptionally`
|
||||||
|
with a kick or retry rather than silent log; consider raising
|
||||||
|
default `teleport-delay-ticks` from 10 → 20.
|
||||||
|
- **`src/test/`** (new directory) — unit tests for the listener.
|
||||||
|
No precedent here, but pom.xml needs JUnit added.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### `minecraft-server` (server repo — this repo)
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|---|---|
|
||||||
|
| Origin | `ssh://git@192.168.0.100:222/s8n-ru/minecraft-server.git` ⚠️ stale |
|
||||||
|
| Last commit | `ede6029 proantitab: allow lp/luckperms in global; deny essentials.motd default` |
|
||||||
|
| Top-level docs | `MISSION.md`, `README.md`, `RULES.md`, `THANKS.md`, `VIBE.md`, `TELEMETRY_AUDIT.md` |
|
||||||
|
| `docs/` | `BACKUP.md`, `DEPLOY.md`, `PERMISSIONS.md`, `PLUGINS.md`, `PLUGIN_ALTERNATIVES.md`, `RACKED_BRAND.md`, `REBRAND_2026-04-30.md`, `ROADMAP.md`, `migrations/lands-to-landclaim.md`, `plugins/<name>.md` (20 files) |
|
||||||
|
| Existing TODO | The README "Roadmap / TODO" section (lines 91-180) is the canonical living checklist. Tagged `[P0]` blocker / `[P1]` vision / `[P2]` improvement / `[P3]` nice-to-have. `docs/ROADMAP.md` is **scoped narrowly** to plugin-acquisition overhaul (Phases 1-3). |
|
||||||
|
| `live-server/` | live config snapshot (purpur.yml, server.properties, ops.json, plugins/) — **mirrors prod state**, not a build input. |
|
||||||
|
| Backup script | `scripts/backup.sh` — note **bug at line 119** (orphaned `"${BACKUP_PATH}/synapse-signing-key-${TIMESTAMP}.key"` block sits outside any `if`, will fail at runtime if signing-key path absent) |
|
||||||
|
| CI | `.github/workflows/` is empty. `.github/ISSUE_TEMPLATE/` empty. No `.forgejo/`. |
|
||||||
|
|
||||||
|
**No existing files named** `AUDIT*`, `INCIDENT*`, `RUNBOOK*`,
|
||||||
|
`TODO*`, `CHANGELOG*` at root or in `docs/`. The closest precedents:
|
||||||
|
- `docs/REBRAND_2026-04-30.md` — date-prefixed event log w/
|
||||||
|
Apply/Side incident/Rollback sections. **Use this as the format
|
||||||
|
template for any new INCIDENT-* doc.**
|
||||||
|
- `docs/migrations/lands-to-landclaim.md` — multi-section migration
|
||||||
|
plan (Current State / Target / Plan / Rollback). Format template
|
||||||
|
for future strategy docs.
|
||||||
|
- `MISSION.md` / `VIBE.md` / `RULES.md` — top-level "values" docs.
|
||||||
|
Don't add new top-level capitalised md files unless the doc is
|
||||||
|
similarly load-bearing for the project's identity. Detail goes in
|
||||||
|
`docs/`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### `infra` (nullstone+cobblestone runbooks)
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|---|---|
|
||||||
|
| Origin | `ssh://git@192.168.0.100:222/veilor-org/infra.git` ✅ org-scoped, no rename impact |
|
||||||
|
| Last commit | `381f923 runbook: distribute load + sync data (operator's HA vision)` |
|
||||||
|
| Layout | `forgejo/`, `runbooks/`, `repos/`, root `STATE.md` + `AUDIT-2026-05-05.md` |
|
||||||
|
| Runbooks | `COBBLESTONE-INTAKE.md`, `DE-DECISION-cobblestone.md`, **`HA-CLUSTER-distribute-and-sync.md`** (already covers MC backup placement!), `MIGRATION-nullstone-to-cobblestone.md` |
|
||||||
|
|
||||||
|
**Critical pre-existing context:**
|
||||||
|
- `STATE.md` already lists *"`/opt/docker/backup.sh` fixes —
|
||||||
|
matrix-postgres + rocketchat-mongodb + literal CHANGE_ME pw"* as
|
||||||
|
open issue (line 97), AND lists Restic+autorestic as the **#1**
|
||||||
|
recommended addition (lines 113, 283-285 of `AUDIT-2026-05-05.md`).
|
||||||
|
- `runbooks/HA-CLUSTER-distribute-and-sync.md` line 51 already plans
|
||||||
|
*"Backups (offsite) — Restic to B2/Wasabi nightly"* and line 72
|
||||||
|
pins MC to nullstone with *"World data ZFS-replicated for DR
|
||||||
|
only"*. The backup-strategy agent's plan must reconcile with this
|
||||||
|
— don't propose a parallel scheme; either extend the HA runbook or
|
||||||
|
cross-link it as the parent design.
|
||||||
|
- `AUDIT-2026-05-05.md` lines 200-203 already flag the backup script
|
||||||
|
as silently broken (RC + ex-Matrix not dumping). Confirms the
|
||||||
|
symptom that caused YOU500's loss.
|
||||||
|
|
||||||
|
**Format conventions in `infra/`:**
|
||||||
|
- Audit reports: `# 5-Agent Audit Report — YYYY-MM-DD` header,
|
||||||
|
TL;DR section, severity-ordered Action items section, file index.
|
||||||
|
- Runbooks: `# Runbook — <topic>` header, Goal blockquote, North-star
|
||||||
|
diagram if applicable, phase plan, failure scenarios + RTO table,
|
||||||
|
open decisions, related links.
|
||||||
|
- Dating: filenames always `<TYPE>-YYYY-MM-DD.md`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### `minecraft-launcher`
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|---|---|
|
||||||
|
| Origin | `ssh://git@192.168.0.100:222/s8n-ru/minecraft-launcher.git` ⚠️ stale |
|
||||||
|
| Last commit | `31d25f8 readme: shrink license section to single sub line` |
|
||||||
|
| Relevance to incident | None direct. Would only matter if the incident agent recommends a launcher-side patch (e.g. forced relog on void death detection) — unlikely. |
|
||||||
|
|
||||||
|
### `minecraft-client`
|
||||||
|
|
||||||
|
**Not a git repo** (`fatal: not a git repository`). No remote to
|
||||||
|
worry about. Excluded from any rewrite list.
|
||||||
|
|
||||||
|
### `veilor-os`
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|---|---|
|
||||||
|
| Origin | `ssh://git@192.168.0.100:222/veilor-org/veilor-os.git` ✅ no rename impact |
|
||||||
|
| Relevance | None — separate brand (security distro), not Minecraft. Skipped per instructions. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Stale `s8n-ru` origin URLs (per 2026-05-07 rename)
|
||||||
|
|
||||||
|
Per workspace memory `user_git_identity.md` the Forgejo user `s8n-ru`
|
||||||
|
was renamed to `s8n` on 2026-05-07. Forgejo serves a 307 redirect for
|
||||||
|
now but the canonical path is `s8n/<repo>`. The following local
|
||||||
|
clones still have the old origin:
|
||||||
|
|
||||||
|
| Repo (local clone) | Current origin | Should become |
|
||||||
|
|---|---|---|
|
||||||
|
| `_github/auth-limbo` | `ssh://git@192.168.0.100:222/s8n-ru/auth-limbo.git` | `ssh://git@192.168.0.100:222/s8n/auth-limbo.git` |
|
||||||
|
| `_github/minecraft-server` | `ssh://git@192.168.0.100:222/s8n-ru/minecraft-server.git` | `ssh://git@192.168.0.100:222/s8n/minecraft-server.git` |
|
||||||
|
| `_github/minecraft-launcher` | `ssh://git@192.168.0.100:222/s8n-ru/minecraft-launcher.git` | `ssh://git@192.168.0.100:222/s8n/minecraft-launcher.git` |
|
||||||
|
|
||||||
|
**No rename required for:** `_github/infra` (`veilor-org/`),
|
||||||
|
`_github/veilor-os` (`veilor-org/`), `_github/minecraft-client` (not
|
||||||
|
a repo).
|
||||||
|
|
||||||
|
Recommended one-shot fix (deferred — not part of these four agents):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
for r in auth-limbo minecraft-server minecraft-launcher; do
|
||||||
|
cd /home/admin/ai-lab/_github/$r
|
||||||
|
git remote set-url origin ssh://git@192.168.0.100:222/s8n/$r.git
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
Also update the in-doc URL references:
|
||||||
|
- `auth-limbo/src/main/resources/plugin.yml` line 7: `website: https://github.com/s8n-ru/auth-limbo`
|
||||||
|
- `auth-limbo/src/main/java/ru/authlimbo/*.java` SPDX header: `Copyright (C) 2026 s8n-ru`
|
||||||
|
- `minecraft-server/VIBE.md` line 38: `github.com/s8n-ru/auth-limbo`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Overlap with session-noted TODO items
|
||||||
|
|
||||||
|
The session noted these TODOs that the four agents may want to fold
|
||||||
|
into recommendations. State as of HEAD:
|
||||||
|
|
||||||
|
| Item | Existing mention? | Where | Status |
|
||||||
|
|---|---|---|---|
|
||||||
|
| **SHA256 → BCRYPT** (AuthMe hashing) | ✅ flagged 2026-05-02 | `security/nullstone-server/2026-05-02-mc-audit.md` summary: *"AuthMe also uses unsalted SHA-256, no tempban, no captcha, and 5-char minimum passwords"* | **Not yet addressed in repo.** No TODO entry in README. New. |
|
||||||
|
| **EZShop drop** | ⚠️ Plugin loaded via `PLUGINS:` in `docker-compose.yml:51` | docker-compose.yml | No TODO entry yet. New. |
|
||||||
|
| **CapDrop** (Linux capabilities) | ❌ No mention | — | Net-new infra-side item (deploy.security level). Belongs in server-audit agent's report. |
|
||||||
|
| **tracking-range** | ❌ No mention | — | Net-new (purpur.yml tuning). New. |
|
||||||
|
| **CO DB → MySQL** (CoreProtect) | ❌ No mention | — | Net-new. Touches plugin policy (CoreProtect-CE is the one acknowledged license exception per MISSION.md — CO config change OK, plugin swap not). |
|
||||||
|
| **TPS webhook** | ⚠️ "Prometheus exporter + Grafana" entry exists in README:105 (P2). Webhook would be lighter-weight alternative. | README.md:105 | Adjacent to existing TODO; consider replacing or augmenting it. |
|
||||||
|
| **spark baseline** | ✅ spark already loaded in `PLUGINS:` (compose:54) and listed in VIBE.md:78 | docker-compose.yml, VIBE.md | "Baseline" = capture a profiling run for ref. Net-new. |
|
||||||
|
| **plugin folder cleanup** | ⚠️ `live-server/plugins/` is checked-in live config snapshot. Past cleanup happened in REBRAND_2026-04-30 (Side incident — disk full). | docs/REBRAND_2026-04-30.md:65-74 | Operational, not docs. Net-new. |
|
||||||
|
|
||||||
|
**None of the eight overlap with the existing `docs/ROADMAP.md`**
|
||||||
|
(which is scoped narrowly to *plugin-acquisition* — manifest +
|
||||||
|
lockfile + CI). They all belong in the **README.md "Roadmap / TODO"
|
||||||
|
checklist** by current convention. The server-audit agent should
|
||||||
|
append them there, not create a new ROADMAP-* doc.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Existing backup-related mentions
|
||||||
|
|
||||||
|
| File | Line | Content |
|
||||||
|
|---|---|---|
|
||||||
|
| `docs/BACKUP.md` | all | Documents the daily 02:00 cron + retention. **Critical drift:** describes worlds being backed up but VIBE.md:54-58 says *"no world backups"*. Direct contradiction. |
|
||||||
|
| `scripts/backup.sh` | 80-117 | Minecraft block: docker-exec tar of world/world_nether/world_the_end + configs. **Real, working code.** |
|
||||||
|
| `scripts/backup.sh` | 119-122 | **Orphaned dead-code block** outside any `if` (dangling from `synapse-signing-key`). Will trigger script failure if signing-key path missing. |
|
||||||
|
| `README.md` | 23, 45, 164, 179 | Mentions backup feature. README:179 records "freed 11G+ (old backups, ...)". |
|
||||||
|
| `VIBE.md` | 54-58 | *"Daily configs, no world backups (it'd eat too much disk). If you lose a base to grief, that's the game."* — **conflicts with reality.** |
|
||||||
|
| `docs/REBRAND_2026-04-30.md` | 53, 65-74 | Records 2026-04-30 backup tarball and 2026-05-01 disk-full incident from accumulated backups. Confirms backups *were* running. |
|
||||||
|
| `SYSTEM.md` | 737-749 | Workspace-level system reference says backups run daily, ~5-7GB compressed. Out-of-date plugin counts (says 25, actual ~16) and Purpur version (says 1.21.10, actual 1.21.11). |
|
||||||
|
|
||||||
|
**Major contradiction the backup-strategy agent must resolve:**
|
||||||
|
either VIBE.md must drop the *"no world backups"* line (recommended
|
||||||
|
— reality is that worlds **are** being backed up), or the operator
|
||||||
|
must accept that the YOU500 loss happened because the worlds were
|
||||||
|
**logically excluded from the policy** even though they were
|
||||||
|
mechanically being archived. The latter is unlikely — daily 02:00
|
||||||
|
tarball would have caught a 2026-05-07 daytime void death.
|
||||||
|
|
||||||
|
**Backup-hunt agent finding to verify:** does `/opt/backups/` on
|
||||||
|
nullstone actually contain any usable `mc-world-backup-*.tar.gz`
|
||||||
|
files? `STATE.md` line 97 + `AUDIT-2026-05-05.md` lines 200-203
|
||||||
|
suggest the script *runs* but its other arms are failing silently;
|
||||||
|
the MC arm at lines 80-117 of backup.sh has no obvious bug, so
|
||||||
|
backups should exist. If they don't, that's the deepest finding.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Forgejo runner / CI integration
|
||||||
|
|
||||||
|
Per memory `project_forgejo_nullstone.md` and `STATE.md` line 26-27,
|
||||||
|
nullstone runs a Forgejo runner with labels
|
||||||
|
`ubuntu-24.04 + nullstone`. **No repo currently has a `.forgejo/`
|
||||||
|
directory** — neither auth-limbo nor minecraft-server nor infra. CI
|
||||||
|
in `auth-limbo` is GitHub Actions (`.github/workflows/`).
|
||||||
|
|
||||||
|
`STATE.md` line 121-129 notes the v0.5.32 veilor-os ship is pending
|
||||||
|
on flipping `runs-on:` to `nullstone` to use the Forgejo runner.
|
||||||
|
|
||||||
|
**Implication for the audit agents:** if the AuthLimbo agent wants
|
||||||
|
the fix to land via CI, two options:
|
||||||
|
1. Keep `.github/workflows/build.yml`, since GH-mirror is now
|
||||||
|
manual-only post-2026-05-06 (`STATE.md`:14-18) — workflow won't
|
||||||
|
trigger automatically anymore, would need manual mirror push.
|
||||||
|
2. Migrate to `.forgejo/workflows/build.yml` with
|
||||||
|
`runs-on: ubuntu-24.04` (compatible with the runner). Cleaner,
|
||||||
|
matches new direction. **Recommended.**
|
||||||
|
|
||||||
|
Either path: pre-existing dependency on `AUTHME_JAR_URL` repo secret
|
||||||
|
(see `.github/workflows/build.yml:21-26`) needs to be re-added on
|
||||||
|
Forgejo if path 2 is taken.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Workspace-level `SYSTEM.md` updates needed after backup-strategy lands
|
||||||
|
|
||||||
|
`/home/admin/ai-lab/SYSTEM.md` lines 665-779 has the canonical
|
||||||
|
workspace-level Minecraft section. After the backup-strategy doc
|
||||||
|
lands, the following blocks need editing (one PR, one paragraph
|
||||||
|
each):
|
||||||
|
|
||||||
|
| SYSTEM.md location | Existing content | Drift |
|
||||||
|
|---|---|---|
|
||||||
|
| Line 677 | "Minecraft Version: 1.21.10 (Purpur build 2532)" | Actual: 1.21.11 (compose line 10) |
|
||||||
|
| Line 686-690 | "25 plugins loaded ... bulk-updated 2026-04-17" | Plugin set has shifted heavily since (LandClaimPlugin → Homestead, WorldEdit → FAWE, Vault → VaultUnlocked, LoginSecurity → AuthMe, AuthLimbo added, EZShop+AuctionHouse added). Real count ≈ 16. |
|
||||||
|
| Line 692-706 | RAM 7GB idle, Purpur 1.21.10-2535, startup 47s | Out of date; would-be benefit re-measure as part of "spark baseline" TODO. |
|
||||||
|
| Line 765-771 | "Known Issues" block | Add YOU500 incident closure note (post-fix), F10 RCON wildcard already promised in Wave 2. |
|
||||||
|
| Line 776 | "Backup frequency: Add 6-hourly world snapshots for active play sessions" | This is the existing wishlist item the backup-strategy agent will likely satisfy. Strike or replace with "Done — see infra/runbooks/MC-BACKUP-2026-05-07.md" (or wherever the strategy lands). |
|
||||||
|
|
||||||
|
**Per `CLAUDE.md` workspace rules**, technical detail belongs in
|
||||||
|
SYSTEM.md, not README.md. The README device-table line for
|
||||||
|
nullstone won't change.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Integration recommendations — where each parallel agent's doc lands
|
||||||
|
|
||||||
|
| Agent | Output should land at | Rationale |
|
||||||
|
|---|---|---|
|
||||||
|
| **Backup hunt** (find existing backups) | `_github/minecraft-server/docs/INCIDENT-2026-05-07-you500-backup-hunt.md` | Date-prefixed, follows REBRAND_2026-04-30.md format. Forensic in nature → minecraft-server `docs/`. |
|
||||||
|
| **AuthLimbo audit** (root-cause + code patch) | (1) `_github/auth-limbo/docs/INCIDENT-2026-05-07-teleportasync-failure.md` for forensic write-up; (2) source patch + `CHANGELOG.md` bump in same repo; (3) optional cross-link from `minecraft-server/docs/INCIDENT-2026-05-07-you500-backup-hunt.md` | Plugin source repo owns plugin bugs. INCIDENT- naming convention matches REBRAND_*.md. |
|
||||||
|
| **Backup strategy** (forward-looking design) | `_github/infra/runbooks/MC-BACKUP-strategy-2026-05-07.md` (or extend `HA-CLUSTER-distribute-and-sync.md` with a Phase 1.5 sub-section) | infra owns nullstone-side cron + restic. Cross-link from `minecraft-server/docs/BACKUP.md` (replace its current contents with a thin pointer). |
|
||||||
|
| **Server audit** (broader hardening — CapDrop, plugin folder, MySQL, etc) | `_github/minecraft-server/docs/AUDIT-2026-05-07.md` (synthesis), then **append individual TODOs to README.md "Roadmap / TODO"** | Matches `infra/AUDIT-2026-05-05.md` precedent. README is the canonical TODO surface for this repo per existing convention. |
|
||||||
|
|
||||||
|
**Files needing edits AFTER all four agents finish:**
|
||||||
|
|
||||||
|
| File | Change |
|
||||||
|
|---|---|
|
||||||
|
| `_github/minecraft-server/README.md` | Append new TODO entries from server-audit agent: SHA256→BCRYPT, EZShop drop, CapDrop, tracking-range, CO MySQL, TPS webhook, spark baseline, plugin folder cleanup. Add `[x]` for the YOU500 incident under "Done" once fix shipped. |
|
||||||
|
| `_github/minecraft-server/docs/BACKUP.md` | Rewrite to point to infra runbook; current Schedule/Strategy/Manual sections move to infra. Or replace contents with thin "see infra/runbooks/MC-BACKUP-strategy-2026-05-07.md". |
|
||||||
|
| `_github/minecraft-server/VIBE.md` | Drop or revise lines 54-58 — *"no world backups"* contradicts reality and is the philosophical claim that may have justified treating backups as low-priority. Important narrative fix. |
|
||||||
|
| `_github/minecraft-server/scripts/backup.sh` | Fix orphaned line 119-122 dead-code block. Independent of strategy agent's output. |
|
||||||
|
| `_github/minecraft-server/docker-compose.yml` | If EZShop drop accepted: remove line 51. (Server-audit agent decision.) |
|
||||||
|
| `_github/auth-limbo/CHANGELOG.md` | New `## [1.0.1] - 2026-05-07` entry. |
|
||||||
|
| `_github/auth-limbo/pom.xml` | Version bump 1.0.0 → 1.0.1 if patch shipped. |
|
||||||
|
| `_github/auth-limbo/src/main/java/ru/authlimbo/LoginListener.java` | Code fix per AuthLimbo agent. |
|
||||||
|
| `_github/infra/STATE.md` | Add 2026-05-07 changelog entry referencing the incident; check off "/opt/docker/backup.sh fixes" pending decision (line 97) when backup script repaired. |
|
||||||
|
| `_github/infra/AUDIT-2026-05-05.md` | Append addendum or leave dated; the new audit replaces/augments the F-numbered findings related to MC backups. |
|
||||||
|
| `/home/admin/ai-lab/SYSTEM.md` | Update Minecraft section per §6 above. Add note in Known Issues (line 765). Update Last Updated. |
|
||||||
|
| `/home/admin/ai-lab/README.md` | "Last Updated" stamp; one-line status mention if user wants it surfaced at workspace level. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Open conflicts and duplications
|
||||||
|
|
||||||
|
1. **VIBE.md vs reality** (most important narrative conflict). VIBE
|
||||||
|
says no world backups; backup.sh + BACKUP.md + REBRAND_2026-04-30
|
||||||
|
prove worlds **are** archived nightly. The YOU500 inventory loss
|
||||||
|
means either (a) backups didn't run that day, (b) backup ran but
|
||||||
|
the rollback isn't operationally feasible (would lose other
|
||||||
|
players' progress between 02:00 and the death), or (c) operator
|
||||||
|
chose not to rollback. **The backup-strategy agent must address
|
||||||
|
this explicitly** rather than just propose a new scheme.
|
||||||
|
|
||||||
|
2. **`docs/ROADMAP.md` scope vs README "Roadmap / TODO"** — the
|
||||||
|
docs file is narrowly about plugin-acquisition Phases 1-3, while
|
||||||
|
the README has the all-up living checklist. Future agents should
|
||||||
|
not put generic TODO items into `docs/ROADMAP.md`. Keep its scope
|
||||||
|
tight or rename it `docs/PLUGIN-ACQUISITION-ROADMAP.md`.
|
||||||
|
|
||||||
|
3. **infra `HA-CLUSTER-distribute-and-sync.md` vs new MC-backup
|
||||||
|
strategy** — there's a real risk the backup-strategy agent
|
||||||
|
designs Restic-to-B2 in isolation while HA-CLUSTER already plans
|
||||||
|
that exact service for both nullstone+cobblestone. Strategy doc
|
||||||
|
must reference and extend the HA-CLUSTER plan (specifically the
|
||||||
|
"Backups (offsite)" row in its layer table, line 51).
|
||||||
|
|
||||||
|
4. **CoreProtect MySQL migration** — proposed in session TODOs.
|
||||||
|
`MISSION.md:24` codifies CoreProtect-CE as "the one acknowledged
|
||||||
|
license exception". Switching its DB backend to MySQL is fine
|
||||||
|
under that policy (config, not plugin swap), but the server-audit
|
||||||
|
agent should explicitly note "this is a config change, not a
|
||||||
|
plugin swap, so MISSION.md:24 still holds" so the policy isn't
|
||||||
|
accidentally diluted.
|
||||||
|
|
||||||
|
5. **AuthLimbo CI host** — `.github/workflows/` lives in repo but
|
||||||
|
GH push-mirror is off as of 2026-05-06. Builds will only run if
|
||||||
|
someone manually pushes to GH. Worth flagging to the AuthLimbo
|
||||||
|
agent that any CI step they propose may need a `.forgejo/`
|
||||||
|
variant, otherwise the patched 1.0.1 release won't auto-build.
|
||||||
|
|
||||||
|
6. **`_github/minecraft-client` is not a git repo** — nothing to
|
||||||
|
worry about for this incident, but anyone iterating on the
|
||||||
|
incident later may try to commit something there expecting it to
|
||||||
|
work. Worth recording.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Summary table — convention by repo
|
||||||
|
|
||||||
|
| Repo | Audit doc convention | Incident doc convention | TODO surface | CHANGELOG style |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| `auth-limbo` | (none yet) | (none yet — recommend `docs/INCIDENT-YYYY-MM-DD-<slug>.md`) | (none — small repo) | Keep a Changelog + SemVer, `## [X.Y.Z] - YYYY-MM-DD` |
|
||||||
|
| `minecraft-server` | (none yet — recommend `docs/AUDIT-YYYY-MM-DD.md` matching infra style) | follow `docs/REBRAND_2026-04-30.md` template | README "Roadmap / TODO" with `[P0..P3]` tags | (none — uses git log) |
|
||||||
|
| `infra` | `AUDIT-YYYY-MM-DD.md` at root | (use runbooks for forward-looking; no incident files yet) | `STATE.md` "Pending decisions" table | (none — uses git log + STATE.md) |
|
||||||
|
| `minecraft-launcher` | n/a | n/a | (none) | (none) |
|
||||||
|
| `veilor-os` | (separate brand — out of scope) | — | — | — |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*End of survey. Read-only. No files modified. No commits pushed.*
|
||||||
156
docs/RUNBOOK-BACKUP-RESTORE.md
Normal file
156
docs/RUNBOOK-BACKUP-RESTORE.md
Normal file
|
|
@ -0,0 +1,156 @@
|
||||||
|
# Runbook — Backup & Restore (Minecraft, racked.ru on nullstone)
|
||||||
|
|
||||||
|
Strategy doc: [`../BACKUP-STRATEGY.md`](../BACKUP-STRATEGY.md). This runbook is the **operator-facing** procedure for the three scenarios that come up in practice. Keep it short, copy-paste-able, and reachable from the player support workflow.
|
||||||
|
|
||||||
|
> **Status (2026-05-07):** This runbook is written **ahead** of the implementation it describes. The `mc-backup-frequent` timer and onyx mirror are NOT yet deployed. The "What if no snapshot exists yet?" section at the bottom covers today's reality.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## TL;DR — restore one player's `.dat` from N minutes ago
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On nullstone, as `user`:
|
||||||
|
PUUID=<player-uuid> # e.g. from /opt/docker/minecraft/usercache.json
|
||||||
|
WHEN=latest # or "5 min ago", or a snapshot id
|
||||||
|
RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
|
||||||
|
restic -r /home/user/restic/mc-frequent \
|
||||||
|
restore "$WHEN" \
|
||||||
|
--target /tmp/restore-$$ \
|
||||||
|
--include "world/playerdata/${PUUID}.dat"
|
||||||
|
|
||||||
|
# Verify the file is well-formed NBT before applying:
|
||||||
|
file /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat
|
||||||
|
# Expected: "gzip compressed data"
|
||||||
|
|
||||||
|
# Apply (server must be running so playerdata is writable; the player
|
||||||
|
# MUST be offline or we're racing the writer):
|
||||||
|
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "kick ${PUUID_NICK} Restore in progress"
|
||||||
|
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-off"
|
||||||
|
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-all flush"
|
||||||
|
|
||||||
|
cp /opt/docker/minecraft/world/playerdata/${PUUID}.dat \
|
||||||
|
/opt/docker/minecraft/world/playerdata/${PUUID}.dat.preFix-$(date +%s)
|
||||||
|
cp /tmp/restore-$$/opt/docker/minecraft/world/playerdata/${PUUID}.dat \
|
||||||
|
/opt/docker/minecraft/world/playerdata/${PUUID}.dat
|
||||||
|
chown 100000:100000 /opt/docker/minecraft/world/playerdata/${PUUID}.dat # userns-remap
|
||||||
|
|
||||||
|
mcrcon -H 127.0.0.1 -P 25575 -p *redacted* "save-on"
|
||||||
|
# Tell the player to log back in.
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why kick + `save-off`:** if the player is online, the server holds their NBT in memory and rewrites the `.dat` on next save tick — clobbering the restore. `save-off` halts auto-save; kicking guarantees the in-memory state for that player won't be flushed.
|
||||||
|
|
||||||
|
**Userns-remap reminder:** the host sees container-uid `100000` for files written by the MC process. Restored files written by `user` (uid 1000) will appear empty/permission-denied to the container. Always `chown 100000:100000` (or `chmod 666`) after restore. Memory: `project_nullstone_docker_userns`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scenario 1 — Player lost inventory (T1, the void-death case)
|
||||||
|
|
||||||
|
This is what the strategy was written for. RTO target: **< 2 minutes**.
|
||||||
|
|
||||||
|
1. Find the UUID:
|
||||||
|
```bash
|
||||||
|
grep -i 'NICK' /opt/docker/minecraft/usercache.json
|
||||||
|
```
|
||||||
|
2. Pick a snapshot just **before** the loss. `restic snapshots --tag playerdata` shows timestamps.
|
||||||
|
3. Run the TL;DR block above with that snapshot id (or `latest` if loss happened in the last 5 min).
|
||||||
|
4. Inform the player: "Your inventory from HH:MM has been restored. Anything you picked up after that point is gone."
|
||||||
|
5. Log the incident: append to `docs/INCIDENTS.md` (create if absent) — date, player, snapshot id, cause.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scenario 2 — Whole world rolled back (T2/T3, griefing or corruption)
|
||||||
|
|
||||||
|
RTO target: **15 minutes**. Server downtime expected.
|
||||||
|
|
||||||
|
1. Announce, kick, stop:
|
||||||
|
```bash
|
||||||
|
mcrcon ... "say Server going down for restore — back in ~15 min"
|
||||||
|
mcrcon ... "kick @a Restore in progress"
|
||||||
|
cd /opt/docker/minecraft && docker compose down
|
||||||
|
```
|
||||||
|
2. Move live data aside (do not delete):
|
||||||
|
```bash
|
||||||
|
mv /opt/docker/minecraft /opt/docker/minecraft.broken-$(date +%F)
|
||||||
|
mkdir -p /opt/docker/minecraft
|
||||||
|
```
|
||||||
|
3. Restore from the world repo:
|
||||||
|
```bash
|
||||||
|
RESTIC_PASSWORD_FILE=/etc/mc-backup.pw \
|
||||||
|
restic -r /home/user/restic/mc-world \
|
||||||
|
restore <snapshot-id> --target /tmp/world-restore
|
||||||
|
rsync -aHAX /tmp/world-restore/opt/docker/minecraft/ /opt/docker/minecraft/
|
||||||
|
```
|
||||||
|
4. **Re-apply userns-remap perms** (critical — see memory):
|
||||||
|
```bash
|
||||||
|
chmod -R 777 /opt/docker/minecraft # quickfix; or chown -R 100000:100000
|
||||||
|
```
|
||||||
|
5. Boot:
|
||||||
|
```bash
|
||||||
|
cd /opt/docker/minecraft && docker compose up -d
|
||||||
|
docker logs -f minecraft-mc # watch for "Done" line
|
||||||
|
```
|
||||||
|
6. Verify with a known-good UUID's `.dat` parse, then announce server up.
|
||||||
|
7. Keep `minecraft.broken-YYYY-MM-DD/` for at least 7 days for forensic comparison.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scenario 3 — Host disk dead (T5)
|
||||||
|
|
||||||
|
RTO target: **few hours, depends on hardware swap**.
|
||||||
|
|
||||||
|
1. New host: install Debian 13 + Docker per `_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md`.
|
||||||
|
2. `apt install restic`. Pull the password from operator's password manager into `/etc/mc-backup.pw`.
|
||||||
|
3. Initialise destination dir, then restore from **onyx mirror** (not local — local is gone):
|
||||||
|
```bash
|
||||||
|
restic -r sftp:mc-backup@100.64.0.1:/backups/nullstone-mc-restic \
|
||||||
|
restore latest --target /tmp/world-restore
|
||||||
|
```
|
||||||
|
4. Continue Scenario 2 from step 4.
|
||||||
|
5. Stand up the timers on the new host. **Do not** point them at the same off-host repo until the new host has been re-keyed (rotate restic passwords as part of disaster recovery).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Drill log (monthly)
|
||||||
|
|
||||||
|
| Date | Operator | Snapshot age | Class A restore time | Off-host restore time | Result |
|
||||||
|
|------|----------|--------------|----------------------|------------------------|--------|
|
||||||
|
| (first drill — 2026-06-06) | s8n | TBD | TBD | TBD | TBD |
|
||||||
|
|
||||||
|
Procedure: see `BACKUP-STRATEGY.md` §7.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What if no snapshot exists yet? (CURRENT REALITY 2026-05-07)
|
||||||
|
|
||||||
|
Until phases 1–4 of `BACKUP-STRATEGY.md` are deployed, the only recovery resources are:
|
||||||
|
|
||||||
|
| Source | What's there | Recoverable? |
|
||||||
|
|---|---|---|
|
||||||
|
| `/opt/backups/202604xx_020001/mc-world-backup-*.tar.gz` | World tar from Apr 29 + May 2 (others FAILED) | **GONE** — pruned by 7-day retention |
|
||||||
|
| `/opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz` | Plugin jars only, no world | Not useful for player data |
|
||||||
|
| Live `/opt/docker/minecraft/world/playerdata/<uuid>.dat_old` | MC's own .dat_old shadow file from previous save | **YES** — last save tick before current. **First-line defence right now.** |
|
||||||
|
| CoreProtect DB (`plugins/CoreProtect/database.db`) | Block + container actions, NOT inventory state | Partial — can roll back grief, can't restore lost items |
|
||||||
|
|
||||||
|
**Today's playbook for inventory-loss reports:**
|
||||||
|
|
||||||
|
1. Server console → `co lookup u:NICK` to confirm the loss event in CoreProtect.
|
||||||
|
2. **Stop the server immediately** if the report comes in within the same play session — every save tick overwrites `.dat_old`. `docker compose down` buys time.
|
||||||
|
3. Inspect `world/playerdata/<uuid>.dat_old` — if it predates the loss, copy it over `<uuid>.dat`, fix perms (uid 100000), restart.
|
||||||
|
4. If `.dat_old` is too new (already overwritten): **the loss is unrecoverable until BACKUP-STRATEGY phases 1–4 are deployed.** Apologise to the player. Spawn-in compensation per operator discretion (ops creative-mode replacement is the customary remedy).
|
||||||
|
5. Log the incident — adds urgency to deploying the new strategy.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## TODO — open items (links into BACKUP-STRATEGY.md §11)
|
||||||
|
|
||||||
|
- [ ] Phase 1: fix `/opt/docker/backup.sh` orphan-line bug (F-backup-1).
|
||||||
|
- [ ] Phase 2: deploy `mc-backup-frequent.timer` (Class A, 5-min playerdata).
|
||||||
|
- [ ] Phase 3: deploy `mc-backup-world.timer` (Class B/C/D, hourly).
|
||||||
|
- [ ] Phase 4: provision `mc-backup` user on onyx + `restic copy` job.
|
||||||
|
- [ ] Phase 5: schedule monthly drill calendar entry, run first drill.
|
||||||
|
- [ ] Phase 6: ntfy / Matrix alert wiring (depends on ntfy deployment).
|
||||||
|
- [ ] Phase 7: friend RTX 4080 PC as secondary off-host.
|
||||||
|
- [ ] Verify `usercache.json` on this host: confirm UUID lookup workflow above resolves to the right `.dat`.
|
||||||
|
- [ ] Decide: `mcrcon` package vs lightweight Python `mcrcon` lib.
|
||||||
|
- [ ] Document compensation policy for unrecoverable losses (operator discretion right now).
|
||||||
Loading…
Reference in a new issue