minecraft-server/AUDIT-2026-05-07.md
s8n a1cc3940cf docs: 2026-05-07 incident audit + backup strategy
Player YOU500 lost full inventory to AuthLimbo void-death at 17:13:39.
Investigation revealed deployed /opt/docker/backup.sh is an 88-line stub
missing the Minecraft block; last successful world backup 2026-05-02
(already pruned). No recoverable .dat exists.

Files:
- AUDIT-2026-05-07.md — server-side findings F-01..F-06 (P0 backups,
  no-keepInventory, AuthLimbo silent failure, chunk preload race,
  Xmx > container headroom, container hardening gaps)
- BACKUP-HUNT-2026-05-07.md — exhaustive backup scan; only 6-week-old
  archive at _archive/minecraft-old-2026-04-27.tar.gz
- BACKUP-STRATEGY.md — restic-based plan; 5min/hourly/daily classes,
  off-host to onyx via Tailscale, monthly drill
- CROSS-REFERENCE-2026-05-07.md — repo+doc landing map; flags
  pre-existing infra/STATE.md backup-broken note + HA-CLUSTER restic
  draft to extend rather than duplicate
- docs/RUNBOOK-BACKUP-RESTORE.md — operator runbook for .dat restore,
  full-world restore, host-loss restore, drill log
2026-05-07 17:33:24 +01:00

18 KiB

Minecraft Server Audit — racked.ru

Container: minecraft-mc on nullstone (192.168.0.100) Date: 2026-05-07 Audit type: Operational / data-integrity (NOT a network-security audit) Auditor: Claude (Opus 4.7) via SSH read-only inspection Catalyst: Player YOU500 void-died at login (~17:13:39 BST), inventory lost. No usable backup existed.


Executive Summary

Status: Critical issues found. Risk score model: Likelihood (1-5) x Impact (1-5) = 1-25. >=15 = High, >=20 = Critical.

A live AuthLimbo teleportAsync returned false warning fired during YOU500's first login of the day, immediately after YOU500 left the confines of this world (void death in auth_limbo world). The player retried twice. On retry #3 they were teleported to (-264.6, 86, -49.8) and 23 seconds later was blown up by Creeper. Console operator (s8n) attempted recovery via RCON but neither the void death nor the creeper death had item-restore data because:

  1. No working backups. /opt/docker/backup.sh deployed on nullstone is a stale 88-line copy missing the entire Minecraft block. The repo version (scripts/backup.sh) has the block but was never deployed. Daily 02:00 cron has been running for at least 7 days producing 8-12K archives that contain no world / playerdata / plugins. BACKUP.md claims the script handles MC; it does not.
  2. CoreProtect tracks inventory transactions but not death drops. co inspect will not surface "dropped on death" entries the way it does pickup/drop, and even if it did, the 1.5 GB SQLite blob is approaching the point where /co rollback over an inventory radius is operationally slow.
  3. No keepInventory rule, no death-drop rescue plugin. With difficulty=hard, gamemode=survival, and no Essentials keepinv permission flow visible, every death is a total loss.
  4. AuthLimbo has no death-listener and no failure remediation. When teleportAsync returns false, the player is dropped at limbo spawn and the warning is logged at WARN level only — no alert, no rollback, no temp-stash of inventory.
  5. JVM heap sized larger than container limit. JVM_OPTS=-Xmx16384M inside an 18G container limit with MEMORY_SIZE=16G; if Aikar G1 heap actually grows to Xmx, plus off-heap (Netty, mmaps, zip cache) >2 GB, kernel OOM kills the container. Restart-on-OOM has no warning hook to discord/Matrix.

Three biggest exposures

  1. Backups silently broken for 7+ days. (Critical — 5x4=20)
  2. No item-loss safety net for any cause of death. (Critical — 4x5=20)
  3. AuthLimbo failure path has no recovery. (High — 4x4=16)

Findings Table

Severity = Likelihood x Impact. P0 = act this week, P1 = this month, P2 = this quarter.

ID Severity Finding Recommendation Effort
F-01 P0 / 20 /opt/docker/backup.sh on nullstone is missing the entire MC backup block. Repo scripts/backup.sh has it but was never deployed. Daily backups since 2026-04-30 are 8-12K (effectively empty). Sync the deployed script with repo, run a manual backup, verify world tarball >= 5 GB. Add a sentinel check to backup.sh that fails the run if mc-world-backup-*.tar.gz < 1 GB. 30 min
F-02 P0 / 20 No keepInventory rule and no essentials.keepinv permission. Every death is total loss. Decide policy: (a) gamerule keepInventory true server-wide, (b) keep-inv only when death cause is "void"/"plugin teleport", or (c) auto-restore-on-AuthLimbo-failure. The narrow option (b) preserves survival pain while plugging the AuthLimbo data-loss vector. Plugin candidates: KeepInventoryOnVoid, DeathChestPro, custom listener in AuthLimbo. 1-2h research, 1d implement
F-03 P0 / 18 AuthLimbo logs WARN on teleport failure but has no alerting or recovery. The player is left at limbo spawn (y128 platform) where they re-disconnect and on retry get teleported normally — but the warning never reaches an operator. (a) Bump teleportAsync returned false to ERROR. (b) Add a Discord/Matrix webhook alert via existing webhook stack. (c) On failure: snapshot player inventory, kick with friendly message, write recovery file auth_limbo/incident-<uuid>-<ts>.dat for ops replay. 1d
F-04 P0 / 18 YOU500's first failed teleport target was (2380.4, 69.9, -11358.4) — that's 11k blocks out and the chunk likely was not loaded yet. AuthLimbo's preload-chunks: true setting fires on AuthMeAsyncPreLoginEvent which may not run before LoginEvent in HaHaWTH's AuthMe fork. Exact timing race is unverified. Add chunk-loaded assertion in AuthLimbo before calling teleportAsync; if not loaded, force-load synchronously OR delay teleport another 10-20 ticks. Add debug logging of chunk-load state in the WARN line. 0.5d
F-05 P0 / 16 JVM -Xmx16384M inside container mem_limit=18G with no headroom for off-heap (Netty buffers, native mmaps, mod metadata). Aikar flags + 25 plugins easily push native to 2-3 GB. Kernel OOM kill is silent. Either (a) lower -Xmx to 12-14 GB and MaxRAMPercentage-style flag, OR (b) raise mem_limit to 24 GB. Also add oom_score_adj and a docker events --filter event=oom watcher that pings Discord. 1h config + 2h alerting
F-06 P0 / 16 No pids_limit, no cap_drop: ALL, no read_only: true. Container runs with the default Docker capability set (CAP_NET_RAW, CAP_SYS_CHROOT, etc.) it does not need. Add cap_drop: [ALL], cap_add: [NET_BIND_SERVICE] (only if binding <1024; 25565 is high so likely none), pids_limit: 4096, security_opt: [no-new-privileges:true]. Test boot, watch for startup failures. 1h test
F-07 P1 / 15 CoreProtect SQLite at 1.5 GB. Performance and reliability degrade past 2-3 GB. database.db is the only copy; no WAL checkpoint or vacuum schedule. (a) Migrate to MySQL/MariaDB sidecar container. (b) Add monthly cron co purge t:30d (purge entries older than 30 days; CoreProtect docs). (c) Schedule VACUUM after purge. 1d for MySQL migration, 1h for purge cron
F-08 P1 / 12 AuthMe still on passwordHash: SHA256 (legacy). Migration plan for SHA256 -> BCRYPT is on TODO list and still pending. Set legacyHashes: [SHA256] and passwordHash: BCRYPT. AuthMe re-hashes on next successful login. Communicate "your password works as before, no action needed". 30 min config + monitoring
F-09 P1 / 12 online-mode=false. Server depends entirely on AuthMe + EpicGuard for identity. EpicGuard config not audited in this pass. Verify enableProtection: false in AuthMe (currently false) is intentional, since geofencing is US, GB, LOCALHOST only — any user from another country is locked out if protection re-enabled. Document the choice in RULES.md. 1h doc only
F-10 P1 / 12 auto-save-interval: 2400 (= 2 minutes at 20 TPS) is fine, BUT paper-global.yml has player-auto-save: rate: -1 (= use auto-save-interval, so also 2 min). A player who joins, dies, and disconnects within 2 min may have NO post-death snapshot persisted before the player.dat is overwritten by their next login. Player save does fire on quit, but if the death happens and the player keeps moving / interacting before logout, items in chunks not yet saved are at risk for tar-while-running backups. Set player-auto-save: rate: 1200 (= 1 min). Switch backup strategy to save-off + save-all flush + tar + save-on to guarantee consistency, OR snapshot the host bind-mount with a filesystem-level snapshot (LVM / btrfs / ZFS). 30 min config, 0.5d for snapshot path
F-11 P2 / 10 EZShop-1.0-SNAPSHOT.jar is bundled alongside AuctionHouse-1.4.6.jar. PLUGIN_ALTERNATIVES.md TODO calls for dropping EZShop. Remove EZShop, migrate any active shops to AuctionHouse, document the migration in docs/migrations/. 0.5d player communication, 1h technical
F-12 P2 / 10 Spigot entity-tracking-range: monsters 96, misc 96. Roadmap suggests tightening to monster=32, misc=16 for TPS / network savings. Tune on next maintenance window, re-baseline TPS with spark profile. 1h config, 1d to verify under load
F-13 P2 / 9 21 plugin folders without matching jar (orphans): bStats, CarbonChat, ComfyWhitelist, EpicGuard, Essentials, faststats, GrimAC, Homestead, Lands, LPC, MarriageMaster, MiniMOTD, Multiverse-Core, PhantomSMP, TAB, UltimateTimber, UnexpectedSpawn, Vault, WorldEdit, plus .bak-* directories. Most have a renamed jar (carbonchat-paper-...jar, EssentialsX-...jar) so this is mostly cosmetic. Lands, LPC, MarriageMaster, PhantomSMP, UltimateTimber, UnexpectedSpawn truly orphaned: jars not present. Audit each: delete data dirs of plugins truly removed; the bStats/Essentials/Vault names are normal. Document plugin-name <-> jar-name pattern in PLUGINS.md. 1h
F-14 P2 / 9 No TPS Discord webhook alert (mentioned on TODO). spark is installed but auto-profile + alerting are not wired up. spark already supports spark profile --thresholds; route to Discord via existing webhook stack. 0.5d
F-15 P2 / 8 RCON output for async commands (CoreProtect, LuckPerms) does not return to the issuing rcon-cli session. Found while trying co inspect from RCON. Async command results land in console only. Document this in docs/OPERATIONS.md (does not exist yet — create it). For automation, attach to docker logs -f minecraft-mc in parallel. 30 min doc
F-16 P2 / 8 gamerule keepInventory could not be queried via rcon-cli due to execute in <world> run argument parsing bug in itzg's rcon-cli wrapper (or RCON quoting). State unknown without in-game console. Verify in-game by op user, document the rcon-cli limitation. 5 min in-game
F-17 P2 / 6 RCON_PASSWORD is committed to docker-compose.yml in plaintext (*redacted*). RCON port (25575) is bound to 127.0.0.1 so the blast radius is local only — but the secret is still in git history. Rotate password, move to .env (gitignored), confirm 127.0.0.1-only binding stays. 30 min
F-18 P2 / 6 restart: unless-stopped with no start_period re-evaluation on rapid OOM-restart loops. If the container OOMs every 60s, Docker keeps restarting indefinitely. Add restart_policy: { condition: on-failure, max_attempts: 5, window: 300s } (compose v3+ deploy block) and a watchdog alert. 30 min

Detailed Methodology

Inputs inspected (read-only, no writes)

Source Path Method
Container env docker inspect minecraft-mc host shell
docker-compose /opt/docker/minecraft/docker-compose.yml host cat
AuthLimbo config /data/plugins/AuthLimbo/config.yml docker exec cat
AuthLimbo logs /data/plugins/AuthLimbo/ (no log files exist; only config.yml) docker exec ls
AuthMe config /data/plugins/AuthMe/config.yml docker exec cat
AuthMe DB record for YOU500 /data/plugins/AuthMe/authme.db docker exec python3 sqlite3
CoreProtect config /data/plugins/CoreProtect/config.yml docker exec cat
CoreProtect DB size /data/plugins/CoreProtect/database.db docker exec du -sh
Server log /data/logs/latest.log docker exec grep
Paper / Spigot / Purpur configs /data/config/paper-*.yml, /data/spigot.yml, /data/purpur.yml docker exec cat
World sizes /data/world*/ docker exec du -sh
Backup script (deployed) /opt/docker/backup.sh host cat
Backup script (repo) /home/admin/ai-lab/_github/minecraft-server/scripts/backup.sh local cat
Backup output /opt/backups/ host stat
Backup log /opt/backups/backup.log host tail
Live state RCON tps, list docker exec rcon-cli

YOU500 incident timeline (reconstructed from latest.log)

Time (BST 2026-05-07) Event
17:13:34 Login from 45.157.234.219, UUID c7c2df8e-...-686b
17:13:35 Spawned in auth_limbo (0.5, 128, 0.5) per AuthLimbo platform default
17:13:38 AuthMe: "YOU500 logged in"
17:13:39 AuthLimbo: "Restoring YOU500 to world(2380.4, 69.9, -11358.4)"
17:13:39 YOU500 left the confines of this world — void death
17:13:39 [AuthLimbo] teleportAsync returned false for YOU500 — Paper may have rejected the location.
17:15:33 Disconnect
17:15:39 Re-login from 82.22.5.229. Stored auth-loc has now been UPDATED to (-264.6, 86, -49.8) — different from the first attempt. Either user /sethome'd previously or AuthMe overwrote on the void death.
17:15:44 AuthLimbo: "Restoring YOU500 to world(-264.6, 86.0, -49.8)" — no WARN this time
17:15:53 Disconnect
17:16:00 Re-login from 82.22.5.230
17:16:05 AuthLimbo: "Restoring YOU500 to world(-264.6, 86.0, -49.8)"
17:16:28 YOU500 was blown up by Creeper
17:16:57 Operator (s8n) RCON: tpa YOU500 -264 86 -50 + tell YOU500 grab items fast 5min despawn
17:17:02 RCON teleport executed
17:18:22 s8n in-game: /tp2p YOU500 s8n

The void death at 17:13:39 is the data-loss event. AuthMe had SaveQuitLocation: true so the (2380, 70, -11358) was a real prior position but the chunk was almost certainly not loaded yet (11k blocks out, no recent player there). teleportAsync returned false either because:

  • the chunk failed to load within Paper's async generation budget, or
  • the entity was already dead (void death raced ahead of teleport).

What CoreProtect WOULD have caught (and didn't)

CoreProtect inventory tracking is enabled (item-transactions: true, item-drops: true, item-pickups: true, rollback-items: true). However:

  • A void-death drops items into the world for ~5 min then despawns. Drops are item entities, not container transactions; CoreProtect logs them as drops only if a player was the immediate cause of the drop.
  • A death-drop in the auth_limbo world (where the void death happened) drops into y<0 air which is itself a non-event for CP.
  • Thus there was no item-rollback path even if co inspect had been run within minutes.

Implication: CoreProtect is the wrong tool for death-drop recovery. A real death-drop plugin or keepInventory is the only fix.

Backup script forensics

  • Deployed: 88 lines, last block is "Prune old backups". No Minecraft block. No umask 077.
  • Repo: 131 lines (with malformed lines 119-122 leftover from a bad merge — ALSO a bug to fix on the next push). Has the Minecraft block. Has umask 077.
  • /opt/backups/backup.log shows last 5 days of "Backup complete" entries averaging 8-12K. None contain MC data. None mention MC. The log line Configs: partial (some files missing) is the configs section misfiring on Matrix paths and was never the MC block.
  • Last verified-good MC archive on host: /opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz (one-shot pre-rebrand snapshot; contents not verified in this audit).

Action Items (Prioritised)

P0 — this week (by 2026-05-14)

  1. F-01 / Backups. Sync deployed backup.sh with repo. Fix the lines 119-122 corruption in repo first. Add post-run sentinel: [ "$(stat -c%s mc-world-backup-*.tar.gz)" -gt 1073741824 ] || log "WORLD BACKUP TOO SMALL — ABORT". Run manual backup, verify >= 5 GB on disk. Test a restore into a scratch dir.
  2. F-02 / Item-loss safety net. Decide policy. Recommend: enable keepInventory true in auth_limbo world only (cheap, narrow), and write a 50-line AuthLimbo extension OnPlayerDeath listener that detects "death in auth_limbo" -> restore inventory snapshot taken at AuthMeAsyncPreLogin. Survival pain preserved everywhere else.
  3. F-03 / AuthLimbo recovery. Bump WARN to ERROR. Wire to existing Discord webhook (per workspace memory: webhook stack on nullstone). On failure, write player snapshot to auth_limbo/incidents/<uuid>-<ts>.dat.
  4. F-04 / Chunk preload race. Add chunk-loaded check + sync force-load before teleportAsync. If still false, kick with friendly message instead of letting the player drop into limbo.
  5. F-05 / OOM headroom. Lower -Xmx to 14 GB and add docker events watcher.
  6. F-06 / Container hardening. Add cap_drop, pids_limit, no-new-privileges. Boot test in a window.

P1 — this month

  1. F-07 CoreProtect prune cron, plan MySQL migration.
  2. F-08 SHA256 -> BCRYPT migration with legacyHashes fallback.
  3. F-09 Document online-mode=false rationale in RULES.md.
  4. F-10 Consider LVM/ZFS snapshot for backup atomicity.

P2 — this quarter

  1. F-11 Drop EZShop after player communication window.
  2. F-12 Tighten entity tracking range, re-profile with spark.
  3. F-13 Clean orphan plugin folders.
  4. F-14 Wire spark TPS alerts to Discord.
  5. F-15 Document RCON async-command behaviour.
  6. F-17 Rotate RCON password, move to .env.
  7. F-18 Add restart-policy max_attempts.

Open Questions for the Operator

  1. Inventory restoration policy. Is silent keepInventory only in auth_limbo acceptable, or do you want a manual ops-restore-from-snapshot approval gate?
  2. YOU500 specifically. Is there an out-of-band record of what they were carrying (Discord screenshot, witness)? If yes, manual NBT injection into player.dat is feasible. CoreProtect cannot help.
  3. Chunk preload trade-off. Force-loading distant chunks at login adds 200-2000ms to login time. Acceptable vs the void-death risk?
  4. MySQL for CoreProtect. Adds an operational dependency (another container, another backup target). Worth the complexity, or is monthly purge to keep SQLite under 1 GB sufficient?
  5. RCON password rotation. The committed value should be rotated on principle. Schedule a maintenance window?
  6. online-mode=false. Confirm long-term stance. Mojang ToS implications for racked.ru?
  7. Backups offsite. Currently /opt/backups/ is on the same host. Plan for offsite copy (B2, restic to friend-PC, anything)?

What was NOT in scope this audit

  • Network firewall, fail2ban, host-side security (nullstone-server has its own audit folder).
  • Plugin source-supply-chain audit (covered by docs/ROADMAP.md "plugin acquisition overhaul").
  • Performance profiling under load (deferred per F-12).
  • LuckPerms permission graph correctness.
  • Rules / chat-format / prefix audit (workspace memory: do NOT touch LP prefixes).
  • Per-region (Lands / Homestead) data integrity.

Sign-off

Field Value
Audit date 2026-05-07
Method Read-only SSH inspection, no fixes applied
Workspace rule applied "Audit findings -> docs first, then fix"
Next action Operator review + go/no-go on each P0 item
Next audit due 2026-08-07 (quarterly), or sooner after backups remediated