s8n a1cc3940cf docs: 2026-05-07 incident audit + backup strategy

Player YOU500 lost full inventory to AuthLimbo void-death at 17:13:39.
Investigation revealed deployed /opt/docker/backup.sh is an 88-line stub
missing the Minecraft block; last successful world backup 2026-05-02
(already pruned). No recoverable .dat exists.

Files:
- AUDIT-2026-05-07.md — server-side findings F-01..F-06 (P0 backups,
  no-keepInventory, AuthLimbo silent failure, chunk preload race,
  Xmx > container headroom, container hardening gaps)
- BACKUP-HUNT-2026-05-07.md — exhaustive backup scan; only 6-week-old
  archive at _archive/minecraft-old-2026-04-27.tar.gz
- BACKUP-STRATEGY.md — restic-based plan; 5min/hourly/daily classes,
  off-host to onyx via Tailscale, monthly drill
- CROSS-REFERENCE-2026-05-07.md — repo+doc landing map; flags
  pre-existing infra/STATE.md backup-broken note + HA-CLUSTER restic
  draft to extend rather than duplicate
- docs/RUNBOOK-BACKUP-RESTORE.md — operator runbook for .dat restore,
  full-world restore, host-loss restore, drill log

2026-05-07 17:33:24 +01:00

18 KiB

Raw Blame History

Minecraft Server Audit — racked.ru

Container: minecraft-mc on nullstone (192.168.0.100) Date: 2026-05-07 Audit type: Operational / data-integrity (NOT a network-security audit) Auditor: Claude (Opus 4.7) via SSH read-only inspection Catalyst: Player YOU500 void-died at login (~17:13:39 BST), inventory lost. No usable backup existed.

Executive Summary

Status: Critical issues found. Risk score model: Likelihood (1-5) x Impact (1-5) = 1-25. >=15 = High, >=20 = Critical.

A live AuthLimbo teleportAsync returned false warning fired during YOU500's first login of the day, immediately after YOU500 left the confines of this world (void death in auth_limbo world). The player retried twice. On retry #3 they were teleported to (-264.6, 86, -49.8) and 23 seconds later was blown up by Creeper. Console operator (s8n) attempted recovery via RCON but neither the void death nor the creeper death had item-restore data because:

No working backups. /opt/docker/backup.sh deployed on nullstone is a stale 88-line copy missing the entire Minecraft block. The repo version (scripts/backup.sh) has the block but was never deployed. Daily 02:00 cron has been running for at least 7 days producing 8-12K archives that contain no world / playerdata / plugins. BACKUP.md claims the script handles MC; it does not.
CoreProtect tracks inventory transactions but not death drops. co inspect will not surface "dropped on death" entries the way it does pickup/drop, and even if it did, the 1.5 GB SQLite blob is approaching the point where /co rollback over an inventory radius is operationally slow.
No keepInventory rule, no death-drop rescue plugin. With difficulty=hard, gamemode=survival, and no Essentials keepinv permission flow visible, every death is a total loss.
AuthLimbo has no death-listener and no failure remediation. When teleportAsync returns false, the player is dropped at limbo spawn and the warning is logged at WARN level only — no alert, no rollback, no temp-stash of inventory.
JVM heap sized larger than container limit. JVM_OPTS=-Xmx16384M inside an 18G container limit with MEMORY_SIZE=16G; if Aikar G1 heap actually grows to Xmx, plus off-heap (Netty, mmaps, zip cache) >2 GB, kernel OOM kills the container. Restart-on-OOM has no warning hook to discord/Matrix.

Three biggest exposures

Backups silently broken for 7+ days. (Critical — 5x4=20)
No item-loss safety net for any cause of death. (Critical — 4x5=20)
AuthLimbo failure path has no recovery. (High — 4x4=16)

Findings Table

Severity = Likelihood x Impact. P0 = act this week, P1 = this month, P2 = this quarter.

ID	Severity	Finding	Recommendation	Effort
F-01	P0 / 20	`/opt/docker/backup.sh` on nullstone is missing the entire MC backup block. Repo `scripts/backup.sh` has it but was never deployed. Daily backups since 2026-04-30 are 8-12K (effectively empty).	Sync the deployed script with repo, run a manual backup, verify world tarball >= 5 GB. Add a sentinel check to backup.sh that fails the run if `mc-world-backup-*.tar.gz` < 1 GB.	30 min
F-02	P0 / 20	No `keepInventory` rule and no `essentials.keepinv` permission. Every death is total loss.	Decide policy: (a) `gamerule keepInventory true` server-wide, (b) keep-inv only when death cause is "void"/"plugin teleport", or (c) auto-restore-on-AuthLimbo-failure. The narrow option (b) preserves survival pain while plugging the AuthLimbo data-loss vector. Plugin candidates: `KeepInventoryOnVoid`, `DeathChestPro`, custom listener in AuthLimbo.	1-2h research, 1d implement
F-03	P0 / 18	AuthLimbo logs WARN on teleport failure but has no alerting or recovery. The player is left at limbo spawn (y128 platform) where they re-disconnect and on retry get teleported normally — but the warning never reaches an operator.	(a) Bump `teleportAsync returned false` to ERROR. (b) Add a Discord/Matrix webhook alert via existing webhook stack. (c) On failure: snapshot player inventory, kick with friendly message, write recovery file `auth_limbo/incident-<uuid>-<ts>.dat` for ops replay.	1d
F-04	P0 / 18	YOU500's first failed teleport target was (2380.4, 69.9, -11358.4) — that's 11k blocks out and the chunk likely was not loaded yet. AuthLimbo's `preload-chunks: true` setting fires on `AuthMeAsyncPreLoginEvent` which may not run before `LoginEvent` in HaHaWTH's AuthMe fork. Exact timing race is unverified.	Add chunk-loaded assertion in AuthLimbo before calling `teleportAsync`; if not loaded, force-load synchronously OR delay teleport another 10-20 ticks. Add debug logging of chunk-load state in the WARN line.	0.5d
F-05	P0 / 16	JVM `-Xmx16384M` inside container `mem_limit=18G` with no headroom for off-heap (Netty buffers, native mmaps, mod metadata). Aikar flags + 25 plugins easily push native to 2-3 GB. Kernel OOM kill is silent.	Either (a) lower `-Xmx` to 12-14 GB and `MaxRAMPercentage`-style flag, OR (b) raise `mem_limit` to 24 GB. Also add `oom_score_adj` and a `docker events --filter event=oom` watcher that pings Discord.	1h config + 2h alerting
F-06	P0 / 16	No `pids_limit`, no `cap_drop: ALL`, no `read_only: true`. Container runs with the default Docker capability set (CAP_NET_RAW, CAP_SYS_CHROOT, etc.) it does not need.	Add `cap_drop: [ALL]`, `cap_add: [NET_BIND_SERVICE]` (only if binding <1024; 25565 is high so likely none), `pids_limit: 4096`, `security_opt: [no-new-privileges:true]`. Test boot, watch for startup failures.	1h test
F-07	P1 / 15	CoreProtect SQLite at 1.5 GB. Performance and reliability degrade past 2-3 GB. `database.db` is the only copy; no WAL checkpoint or vacuum schedule.	(a) Migrate to MySQL/MariaDB sidecar container. (b) Add monthly cron `co purge t:30d` (purge entries older than 30 days; CoreProtect docs). (c) Schedule `VACUUM` after purge.	1d for MySQL migration, 1h for purge cron
F-08	P1 / 12	AuthMe still on `passwordHash: SHA256` (legacy). Migration plan for SHA256 -> BCRYPT is on TODO list and still pending.	Set `legacyHashes: [SHA256]` and `passwordHash: BCRYPT`. AuthMe re-hashes on next successful login. Communicate "your password works as before, no action needed".	30 min config + monitoring
F-09	P1 / 12	`online-mode=false`. Server depends entirely on AuthMe + EpicGuard for identity. EpicGuard config not audited in this pass.	Verify `enableProtection: false` in AuthMe (currently false) is intentional, since geofencing is `US, GB, LOCALHOST` only — any user from another country is locked out if protection re-enabled. Document the choice in `RULES.md`.	1h doc only
F-10	P1 / 12	`auto-save-interval: 2400` (= 2 minutes at 20 TPS) is fine, BUT `paper-global.yml` has `player-auto-save: rate: -1` (= use auto-save-interval, so also 2 min). A player who joins, dies, and disconnects within 2 min may have NO post-death snapshot persisted before the player.dat is overwritten by their next login. Player save does fire on quit, but if the death happens and the player keeps moving / interacting before logout, items in chunks not yet saved are at risk for tar-while-running backups.	Set `player-auto-save: rate: 1200` (= 1 min). Switch backup strategy to `save-off` + `save-all flush` + tar + `save-on` to guarantee consistency, OR snapshot the host bind-mount with a filesystem-level snapshot (LVM / btrfs / ZFS).	30 min config, 0.5d for snapshot path
F-11	P2 / 10	`EZShop-1.0-SNAPSHOT.jar` is bundled alongside `AuctionHouse-1.4.6.jar`. PLUGIN_ALTERNATIVES.md TODO calls for dropping EZShop.	Remove EZShop, migrate any active shops to AuctionHouse, document the migration in `docs/migrations/`.	0.5d player communication, 1h technical
F-12	P2 / 10	Spigot `entity-tracking-range`: monsters 96, misc 96. Roadmap suggests tightening to monster=32, misc=16 for TPS / network savings.	Tune on next maintenance window, re-baseline TPS with `spark` profile.	1h config, 1d to verify under load
F-13	P2 / 9	21 plugin folders without matching jar (orphans): `bStats`, `CarbonChat`, `ComfyWhitelist`, `EpicGuard`, `Essentials`, `faststats`, `GrimAC`, `Homestead`, `Lands`, `LPC`, `MarriageMaster`, `MiniMOTD`, `Multiverse-Core`, `PhantomSMP`, `TAB`, `UltimateTimber`, `UnexpectedSpawn`, `Vault`, `WorldEdit`, plus `.bak-*` directories. Most have a renamed jar (`carbonchat-paper-...jar`, `EssentialsX-...jar`) so this is mostly cosmetic. `Lands`, `LPC`, `MarriageMaster`, `PhantomSMP`, `UltimateTimber`, `UnexpectedSpawn` truly orphaned: jars not present.	Audit each: delete data dirs of plugins truly removed; the bStats/Essentials/Vault names are normal. Document plugin-name <-> jar-name pattern in `PLUGINS.md`.	1h
F-14	P2 / 9	No TPS Discord webhook alert (mentioned on TODO). spark is installed but auto-profile + alerting are not wired up.	spark already supports `spark profile --thresholds`; route to Discord via existing webhook stack.	0.5d
F-15	P2 / 8	RCON output for async commands (CoreProtect, LuckPerms) does not return to the issuing rcon-cli session. Found while trying `co inspect` from RCON. Async command results land in console only.	Document this in `docs/OPERATIONS.md` (does not exist yet — create it). For automation, attach to `docker logs -f minecraft-mc` in parallel.	30 min doc
F-16	P2 / 8	`gamerule keepInventory` could not be queried via `rcon-cli` due to `execute in <world> run` argument parsing bug in itzg's rcon-cli wrapper (or RCON quoting). State unknown without in-game console.	Verify in-game by op user, document the rcon-cli limitation.	5 min in-game
F-17	P2 / 6	`RCON_PASSWORD` is committed to `docker-compose.yml` in plaintext (`redacted`). RCON port (25575) is bound to `127.0.0.1` so the blast radius is local only — but the secret is still in git history.	Rotate password, move to `.env` (gitignored), confirm `127.0.0.1`-only binding stays.	30 min
F-18	P2 / 6	`restart: unless-stopped` with no `start_period` re-evaluation on rapid OOM-restart loops. If the container OOMs every 60s, Docker keeps restarting indefinitely.	Add `restart_policy: { condition: on-failure, max_attempts: 5, window: 300s }` (compose v3+ deploy block) and a watchdog alert.	30 min

Detailed Methodology

Inputs inspected (read-only, no writes)

Source	Path	Method
Container env	`docker inspect minecraft-mc`	host shell
docker-compose	`/opt/docker/minecraft/docker-compose.yml`	host cat
AuthLimbo config	`/data/plugins/AuthLimbo/config.yml`	`docker exec cat`
AuthLimbo logs	`/data/plugins/AuthLimbo/` (no log files exist; only `config.yml`)	`docker exec ls`
AuthMe config	`/data/plugins/AuthMe/config.yml`	`docker exec cat`
AuthMe DB record for YOU500	`/data/plugins/AuthMe/authme.db`	`docker exec python3 sqlite3`
CoreProtect config	`/data/plugins/CoreProtect/config.yml`	`docker exec cat`
CoreProtect DB size	`/data/plugins/CoreProtect/database.db`	`docker exec du -sh`
Server log	`/data/logs/latest.log`	`docker exec grep`
Paper / Spigot / Purpur configs	`/data/config/paper-*.yml`, `/data/spigot.yml`, `/data/purpur.yml`	`docker exec cat`
World sizes	`/data/world*/`	`docker exec du -sh`
Backup script (deployed)	`/opt/docker/backup.sh`	host cat
Backup script (repo)	`/home/admin/ai-lab/_github/minecraft-server/scripts/backup.sh`	local cat
Backup output	`/opt/backups/`	host stat
Backup log	`/opt/backups/backup.log`	host tail
Live state	RCON `tps`, `list`	`docker exec rcon-cli`

YOU500 incident timeline (reconstructed from `latest.log`)

Time (BST 2026-05-07)	Event
17:13:34	Login from 45.157.234.219, UUID c7c2df8e-...-686b
17:13:35	Spawned in `auth_limbo` (0.5, 128, 0.5) per AuthLimbo platform default
17:13:38	AuthMe: "YOU500 logged in"
17:13:39	AuthLimbo: "Restoring YOU500 to world(2380.4, 69.9, -11358.4)"
17:13:39	`YOU500 left the confines of this world` — void death
17:13:39	`[AuthLimbo] teleportAsync returned false for YOU500 — Paper may have rejected the location.`
17:15:33	Disconnect
17:15:39	Re-login from 82.22.5.229. Stored auth-loc has now been UPDATED to (-264.6, 86, -49.8) — different from the first attempt. Either user `/sethome`'d previously or AuthMe overwrote on the void death.
17:15:44	AuthLimbo: "Restoring YOU500 to world(-264.6, 86.0, -49.8)" — no WARN this time
17:15:53	Disconnect
17:16:00	Re-login from 82.22.5.230
17:16:05	AuthLimbo: "Restoring YOU500 to world(-264.6, 86.0, -49.8)"
17:16:28	`YOU500 was blown up by Creeper`
17:16:57	Operator (s8n) RCON: `tpa YOU500 -264 86 -50` + `tell YOU500 grab items fast 5min despawn`
17:17:02	RCON teleport executed
17:18:22	s8n in-game: `/tp2p YOU500 s8n`

The void death at 17:13:39 is the data-loss event. AuthMe had SaveQuitLocation: true so the (2380, 70, -11358) was a real prior position but the chunk was almost certainly not loaded yet (11k blocks out, no recent player there). teleportAsync returned false either because:

the chunk failed to load within Paper's async generation budget, or
the entity was already dead (void death raced ahead of teleport).

What CoreProtect WOULD have caught (and didn't)

CoreProtect inventory tracking is enabled (item-transactions: true, item-drops: true, item-pickups: true, rollback-items: true). However:

A void-death drops items into the world for ~5 min then despawns. Drops are item entities, not container transactions; CoreProtect logs them as drops only if a player was the immediate cause of the drop.
A death-drop in the auth_limbo world (where the void death happened) drops into y<0 air which is itself a non-event for CP.
Thus there was no item-rollback path even if co inspect had been run within minutes.

Implication: CoreProtect is the wrong tool for death-drop recovery. A real death-drop plugin or keepInventory is the only fix.

Backup script forensics

Deployed: 88 lines, last block is "Prune old backups". No Minecraft block. No umask 077.
Repo: 131 lines (with malformed lines 119-122 leftover from a bad merge — ALSO a bug to fix on the next push). Has the Minecraft block. Has umask 077.
/opt/backups/backup.log shows last 5 days of "Backup complete" entries averaging 8-12K. None contain MC data. None mention MC. The log line Configs: partial (some files missing) is the configs section misfiring on Matrix paths and was never the MC block.
Last verified-good MC archive on host: /opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz (one-shot pre-rebrand snapshot; contents not verified in this audit).

Action Items (Prioritised)

P0 — this week (by 2026-05-14)

F-01 / Backups. Sync deployed backup.sh with repo. Fix the lines 119-122 corruption in repo first. Add post-run sentinel: [ "$(stat -c%s mc-world-backup-*.tar.gz)" -gt 1073741824 ] || log "WORLD BACKUP TOO SMALL — ABORT". Run manual backup, verify >= 5 GB on disk. Test a restore into a scratch dir.
F-02 / Item-loss safety net. Decide policy. Recommend: enable keepInventory true in auth_limbo world only (cheap, narrow), and write a 50-line AuthLimbo extension OnPlayerDeath listener that detects "death in auth_limbo" -> restore inventory snapshot taken at AuthMeAsyncPreLogin. Survival pain preserved everywhere else.
F-03 / AuthLimbo recovery. Bump WARN to ERROR. Wire to existing Discord webhook (per workspace memory: webhook stack on nullstone). On failure, write player snapshot to auth_limbo/incidents/<uuid>-<ts>.dat.
F-04 / Chunk preload race. Add chunk-loaded check + sync force-load before teleportAsync. If still false, kick with friendly message instead of letting the player drop into limbo.
F-05 / OOM headroom. Lower -Xmx to 14 GB and add docker events watcher.
F-06 / Container hardening. Add cap_drop, pids_limit, no-new-privileges. Boot test in a window.

P1 — this month

F-07 CoreProtect prune cron, plan MySQL migration.
F-08 SHA256 -> BCRYPT migration with legacyHashes fallback.
F-09 Document online-mode=false rationale in RULES.md.
F-10 Consider LVM/ZFS snapshot for backup atomicity.

P2 — this quarter

F-11 Drop EZShop after player communication window.
F-12 Tighten entity tracking range, re-profile with spark.
F-13 Clean orphan plugin folders.
F-14 Wire spark TPS alerts to Discord.
F-15 Document RCON async-command behaviour.
F-17 Rotate RCON password, move to .env.
F-18 Add restart-policy max_attempts.

Open Questions for the Operator

Inventory restoration policy. Is silent keepInventory only in auth_limbo acceptable, or do you want a manual ops-restore-from-snapshot approval gate?
YOU500 specifically. Is there an out-of-band record of what they were carrying (Discord screenshot, witness)? If yes, manual NBT injection into player.dat is feasible. CoreProtect cannot help.
Chunk preload trade-off. Force-loading distant chunks at login adds 200-2000ms to login time. Acceptable vs the void-death risk?
MySQL for CoreProtect. Adds an operational dependency (another container, another backup target). Worth the complexity, or is monthly purge to keep SQLite under 1 GB sufficient?
RCON password rotation. The committed value should be rotated on principle. Schedule a maintenance window?
online-mode=false. Confirm long-term stance. Mojang ToS implications for racked.ru?
Backups offsite. Currently /opt/backups/ is on the same host. Plan for offsite copy (B2, restic to friend-PC, anything)?

What was NOT in scope this audit

Network firewall, fail2ban, host-side security (nullstone-server has its own audit folder).
Plugin source-supply-chain audit (covered by docs/ROADMAP.md "plugin acquisition overhaul").
Performance profiling under load (deferred per F-12).
LuckPerms permission graph correctness.
Rules / chat-format / prefix audit (workspace memory: do NOT touch LP prefixes).
Per-region (Lands / Homestead) data integrity.

Sign-off

Field	Value
Audit date	2026-05-07
Method	Read-only SSH inspection, no fixes applied
Workspace rule applied	"Audit findings -> docs first, then fix"
Next action	Operator review + go/no-go on each P0 item
Next audit due	2026-08-07 (quarterly), or sooner after backups remediated

18 KiB Raw Blame History