H2 (F-06): cap_drop ALL + minimum cap_add (CHOWN, SETUID, SETGID, FOWNER), no-new-privileges, deploy.resources.limits.pids=4096. compose config valid. DAC_OVERRIDE deliberately omitted; re-add only if entrypoint chown fails. H3 (F-05): Xmx 16384M -> 14336M, MEMORY_SIZE 16G -> 14G. Leaves ~3.5G headroom for off-heap inside the unchanged 18G container limit. Host has no spare RAM to raise the cap (other workloads). H1 (F-02): server-wide gamerule keepInventory true planned but RCON path for gamerule is broken (F-16) so it's deferred to operator in-game on next op session. Documented in INTERIM-MITIGATIONS.md with a clear revert trigger (when AuthLimbo F1+F2+F4 ship). H4: pre-edit compose backed up to docker-compose.yml.bak-2026-05-07-before-H2H3 (deployed and repo). Restore commands in INTERIM-MITIGATIONS.md. Live restart deferred: 2 players online (s8n actively restoring YOU500's gear via /give). H2/H3 go live on next compose recreate.
184 lines
20 KiB
Markdown
184 lines
20 KiB
Markdown
# Minecraft Server Audit — racked.ru
|
|
**Container:** `minecraft-mc` on nullstone (192.168.0.100)
|
|
**Date:** 2026-05-07
|
|
**Audit type:** Operational / data-integrity (NOT a network-security audit)
|
|
**Auditor:** Claude (Opus 4.7) via SSH read-only inspection
|
|
**Catalyst:** Player **YOU500** void-died at login (~17:13:39 BST), inventory lost. No usable backup existed.
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
**Status:** Critical issues found.
|
|
**Risk score model:** Likelihood (1-5) x Impact (1-5) = 1-25. >=15 = High, >=20 = Critical.
|
|
|
|
A live AuthLimbo `teleportAsync returned false` warning fired during YOU500's first login of the day, immediately after `YOU500 left the confines of this world` (void death in `auth_limbo` world). The player retried twice. On retry #3 they were teleported to (-264.6, 86, -49.8) and 23 seconds later `was blown up by Creeper`. Console operator (s8n) attempted recovery via RCON but neither the void death nor the creeper death had item-restore data because:
|
|
|
|
1. **No working backups.** `/opt/docker/backup.sh` deployed on nullstone is a stale 88-line copy missing the entire Minecraft block. The repo version (`scripts/backup.sh`) has the block but **was never deployed**. Daily 02:00 cron has been running for at least 7 days producing 8-12K archives that contain no world / playerdata / plugins. `BACKUP.md` claims the script handles MC; it does not.
|
|
2. **CoreProtect tracks inventory transactions but not death drops.** `co inspect` will not surface "dropped on death" entries the way it does pickup/drop, and even if it did, the 1.5 GB SQLite blob is approaching the point where `/co rollback` over an inventory radius is operationally slow.
|
|
3. **No `keepInventory` rule, no death-drop rescue plugin.** With `difficulty=hard`, `gamemode=survival`, and no Essentials `keepinv` permission flow visible, every death is a total loss.
|
|
4. **AuthLimbo has no death-listener and no failure remediation.** When `teleportAsync` returns false, the player is dropped at limbo spawn and the warning is logged at WARN level only — no alert, no rollback, no temp-stash of inventory.
|
|
5. **JVM heap sized larger than container limit.** `JVM_OPTS=-Xmx16384M` inside an `18G` container limit with `MEMORY_SIZE=16G`; if Aikar G1 heap actually grows to Xmx, plus off-heap (Netty, mmaps, zip cache) >2 GB, **kernel OOM kills the container**. Restart-on-OOM has no warning hook to discord/Matrix.
|
|
|
|
**Three biggest exposures**
|
|
1. Backups silently broken for 7+ days. (Critical — 5x4=20)
|
|
2. No item-loss safety net for any cause of death. (Critical — 4x5=20)
|
|
3. AuthLimbo failure path has no recovery. (High — 4x4=16)
|
|
|
|
---
|
|
|
|
## Findings Table
|
|
|
|
Severity = Likelihood x Impact. P0 = act this week, P1 = this month, P2 = this quarter.
|
|
|
|
| ID | Severity | Finding | Recommendation | Effort |
|
|
|----|----------|---------|----------------|--------|
|
|
| F-01 | **P0 / 20** | `/opt/docker/backup.sh` on nullstone is missing the entire MC backup block. Repo `scripts/backup.sh` has it but was never deployed. Daily backups since 2026-04-30 are 8-12K (effectively empty). | Sync the deployed script with repo, run a manual backup, verify world tarball >= 5 GB. Add a sentinel check to backup.sh that fails the run if `mc-world-backup-*.tar.gz` < 1 GB. | 30 min |
|
|
| F-02 | **P0 / 20** | No `keepInventory` rule and no `essentials.keepinv` permission. Every death is total loss. | Decide policy: (a) `gamerule keepInventory true` server-wide, (b) keep-inv only when death cause is "void"/"plugin teleport", or (c) auto-restore-on-AuthLimbo-failure. The narrow option (b) preserves survival pain while plugging the AuthLimbo data-loss vector. Plugin candidates: `KeepInventoryOnVoid`, `DeathChestPro`, custom listener in AuthLimbo. | 1-2h research, 1d implement |
|
|
| F-03 | **P0 / 18** | AuthLimbo logs WARN on teleport failure but has no alerting or recovery. The player is left at limbo spawn (y128 platform) where they re-disconnect and on retry get teleported normally — but the warning never reaches an operator. | (a) Bump `teleportAsync returned false` to ERROR. (b) Add a Discord/Matrix webhook alert via existing webhook stack. (c) On failure: snapshot player inventory, kick with friendly message, write recovery file `auth_limbo/incident-<uuid>-<ts>.dat` for ops replay. | 1d |
|
|
| F-04 | **P0 / 18** | YOU500's first failed teleport target was (2380.4, 69.9, -11358.4) — that's 11k blocks out and the chunk likely was not loaded yet. AuthLimbo's `preload-chunks: true` setting fires on `AuthMeAsyncPreLoginEvent` which may not run before `LoginEvent` in HaHaWTH's AuthMe fork. Exact timing race is unverified. | Add chunk-loaded assertion in AuthLimbo before calling `teleportAsync`; if not loaded, force-load synchronously OR delay teleport another 10-20 ticks. Add debug logging of chunk-load state in the WARN line. | 0.5d |
|
|
| F-05 | **P0 / 16** | JVM `-Xmx16384M` inside container `mem_limit=18G` with no headroom for off-heap (Netty buffers, native mmaps, mod metadata). Aikar flags + 25 plugins easily push native to 2-3 GB. Kernel OOM kill is silent. | Either (a) lower `-Xmx` to 12-14 GB and `MaxRAMPercentage`-style flag, OR (b) raise `mem_limit` to 24 GB. Also add `oom_score_adj` and a `docker events --filter event=oom` watcher that pings Discord. | 1h config + 2h alerting |
|
|
| F-06 | **P0 / 16** | No `pids_limit`, no `cap_drop: ALL`, no `read_only: true`. Container runs with the default Docker capability set (CAP_NET_RAW, CAP_SYS_CHROOT, etc.) it does not need. | Add `cap_drop: [ALL]`, `cap_add: [NET_BIND_SERVICE]` (only if binding <1024; 25565 is high so likely none), `pids_limit: 4096`, `security_opt: [no-new-privileges:true]`. Test boot, watch for startup failures. | 1h test |
|
|
| F-07 | **P1 / 15** | CoreProtect SQLite at 1.5 GB. Performance and reliability degrade past 2-3 GB. `database.db` is the only copy; no WAL checkpoint or vacuum schedule. | (a) Migrate to MySQL/MariaDB sidecar container. (b) Add monthly cron `co purge t:30d` (purge entries older than 30 days; CoreProtect docs). (c) Schedule `VACUUM` after purge. | 1d for MySQL migration, 1h for purge cron |
|
|
| F-08 | **P1 / 12** | AuthMe still on `passwordHash: SHA256` (legacy). Migration plan for SHA256 -> BCRYPT is on TODO list and still pending. | Set `legacyHashes: [SHA256]` and `passwordHash: BCRYPT`. AuthMe re-hashes on next successful login. Communicate "your password works as before, no action needed". | 30 min config + monitoring |
|
|
| F-09 | **P1 / 12** | `online-mode=false`. Server depends entirely on AuthMe + EpicGuard for identity. EpicGuard config not audited in this pass. | Verify `enableProtection: false` in AuthMe (currently false) is intentional, since geofencing is `US, GB, LOCALHOST` only — any user from another country is locked out if protection re-enabled. Document the choice in `RULES.md`. | 1h doc only |
|
|
| F-10 | **P1 / 12** | `auto-save-interval: 2400` (= 2 minutes at 20 TPS) is fine, BUT `paper-global.yml` has `player-auto-save: rate: -1` (= use auto-save-interval, so also 2 min). A player who joins, dies, and disconnects within 2 min may have NO post-death snapshot persisted before the player.dat is overwritten by their next login. Player save *does* fire on quit, but if the death happens and the player keeps moving / interacting before logout, items in chunks not yet saved are at risk for tar-while-running backups. | Set `player-auto-save: rate: 1200` (= 1 min). Switch backup strategy to `save-off` + `save-all flush` + tar + `save-on` to guarantee consistency, OR snapshot the host bind-mount with a filesystem-level snapshot (LVM / btrfs / ZFS). | 30 min config, 0.5d for snapshot path |
|
|
| F-11 | **P2 / 10** | `EZShop-1.0-SNAPSHOT.jar` is bundled alongside `AuctionHouse-1.4.6.jar`. PLUGIN_ALTERNATIVES.md TODO calls for dropping EZShop. | Remove EZShop, migrate any active shops to AuctionHouse, document the migration in `docs/migrations/`. | 0.5d player communication, 1h technical |
|
|
| F-12 | **P2 / 10** | Spigot `entity-tracking-range`: monsters 96, misc 96. Roadmap suggests tightening to monster=32, misc=16 for TPS / network savings. | Tune on next maintenance window, re-baseline TPS with `spark` profile. | 1h config, 1d to verify under load |
|
|
| F-13 | **P2 / 9** | 21 plugin folders without matching jar (orphans): `bStats`, `CarbonChat`, `ComfyWhitelist`, `EpicGuard`, `Essentials`, `faststats`, `GrimAC`, `Homestead`, `Lands`, `LPC`, `MarriageMaster`, `MiniMOTD`, `Multiverse-Core`, `PhantomSMP`, `TAB`, `UltimateTimber`, `UnexpectedSpawn`, `Vault`, `WorldEdit`, plus `.bak-*` directories. Most have a renamed jar (`carbonchat-paper-...jar`, `EssentialsX-...jar`) so this is mostly cosmetic. `Lands`, `LPC`, `MarriageMaster`, `PhantomSMP`, `UltimateTimber`, `UnexpectedSpawn` truly orphaned: jars not present. | Audit each: delete data dirs of plugins truly removed; the bStats/Essentials/Vault names are normal. Document plugin-name <-> jar-name pattern in `PLUGINS.md`. | 1h |
|
|
| F-14 | **P2 / 9** | No TPS Discord webhook alert (mentioned on TODO). spark is installed but auto-profile + alerting are not wired up. | spark already supports `spark profile --thresholds`; route to Discord via existing webhook stack. | 0.5d |
|
|
| F-15 | **P2 / 8** | RCON output for async commands (CoreProtect, LuckPerms) does not return to the issuing rcon-cli session. Found while trying `co inspect` from RCON. Async command results land in console only. | Document this in `docs/OPERATIONS.md` (does not exist yet — create it). For automation, attach to `docker logs -f minecraft-mc` in parallel. | 30 min doc |
|
|
| F-16 | **P2 / 8** | `gamerule keepInventory` could not be queried via `rcon-cli` due to `execute in <world> run` argument parsing bug in itzg's rcon-cli wrapper (or RCON quoting). State unknown without in-game console. | Verify in-game by op user, document the rcon-cli limitation. | 5 min in-game |
|
|
| F-17 | **P2 / 6** | `RCON_PASSWORD` is committed to `docker-compose.yml` in plaintext (`*redacted*`). RCON port (25575) is bound to `127.0.0.1` so the blast radius is local only — but the secret is still in git history. | Rotate password, move to `.env` (gitignored), confirm `127.0.0.1`-only binding stays. | 30 min |
|
|
| F-18 | **P2 / 6** | `restart: unless-stopped` with no `start_period` re-evaluation on rapid OOM-restart loops. If the container OOMs every 60s, Docker keeps restarting indefinitely. | Add `restart_policy: { condition: on-failure, max_attempts: 5, window: 300s }` (compose v3+ deploy block) and a watchdog alert. | 30 min |
|
|
|
|
---
|
|
|
|
## Detailed Methodology
|
|
|
|
### Inputs inspected (read-only, no writes)
|
|
|
|
| Source | Path | Method |
|
|
|--------|------|--------|
|
|
| Container env | `docker inspect minecraft-mc` | host shell |
|
|
| docker-compose | `/opt/docker/minecraft/docker-compose.yml` | host cat |
|
|
| AuthLimbo config | `/data/plugins/AuthLimbo/config.yml` | `docker exec cat` |
|
|
| AuthLimbo logs | `/data/plugins/AuthLimbo/` (no log files exist; only `config.yml`) | `docker exec ls` |
|
|
| AuthMe config | `/data/plugins/AuthMe/config.yml` | `docker exec cat` |
|
|
| AuthMe DB record for YOU500 | `/data/plugins/AuthMe/authme.db` | `docker exec python3 sqlite3` |
|
|
| CoreProtect config | `/data/plugins/CoreProtect/config.yml` | `docker exec cat` |
|
|
| CoreProtect DB size | `/data/plugins/CoreProtect/database.db` | `docker exec du -sh` |
|
|
| Server log | `/data/logs/latest.log` | `docker exec grep` |
|
|
| Paper / Spigot / Purpur configs | `/data/config/paper-*.yml`, `/data/spigot.yml`, `/data/purpur.yml` | `docker exec cat` |
|
|
| World sizes | `/data/world*/` | `docker exec du -sh` |
|
|
| Backup script (deployed) | `/opt/docker/backup.sh` | host cat |
|
|
| Backup script (repo) | `/home/admin/ai-lab/_github/minecraft-server/scripts/backup.sh` | local cat |
|
|
| Backup output | `/opt/backups/` | host stat |
|
|
| Backup log | `/opt/backups/backup.log` | host tail |
|
|
| Live state | RCON `tps`, `list` | `docker exec rcon-cli` |
|
|
|
|
### YOU500 incident timeline (reconstructed from `latest.log`)
|
|
|
|
| Time (BST 2026-05-07) | Event |
|
|
|-----------------------|-------|
|
|
| 17:13:34 | Login from 45.157.234.219, UUID c7c2df8e-...-686b |
|
|
| 17:13:35 | Spawned in `auth_limbo` (0.5, 128, 0.5) per AuthLimbo platform default |
|
|
| 17:13:38 | AuthMe: "YOU500 logged in" |
|
|
| 17:13:39 | AuthLimbo: "Restoring YOU500 to world(2380.4, 69.9, -11358.4)" |
|
|
| 17:13:39 | **`YOU500 left the confines of this world`** — void death |
|
|
| 17:13:39 | **`[AuthLimbo] teleportAsync returned false for YOU500 — Paper may have rejected the location.`** |
|
|
| 17:15:33 | Disconnect |
|
|
| 17:15:39 | Re-login from 82.22.5.229. Stored auth-loc has now been UPDATED to (-264.6, 86, -49.8) — different from the first attempt. Either user `/sethome`'d previously or AuthMe overwrote on the void death. |
|
|
| 17:15:44 | AuthLimbo: "Restoring YOU500 to world(-264.6, 86.0, -49.8)" — no WARN this time |
|
|
| 17:15:53 | Disconnect |
|
|
| 17:16:00 | Re-login from 82.22.5.230 |
|
|
| 17:16:05 | AuthLimbo: "Restoring YOU500 to world(-264.6, 86.0, -49.8)" |
|
|
| 17:16:28 | **`YOU500 was blown up by Creeper`** |
|
|
| 17:16:57 | Operator (s8n) RCON: `tpa YOU500 -264 86 -50` + `tell YOU500 grab items fast 5min despawn` |
|
|
| 17:17:02 | RCON teleport executed |
|
|
| 17:18:22 | s8n in-game: `/tp2p YOU500 s8n` |
|
|
|
|
The void death at 17:13:39 is the data-loss event. AuthMe had `SaveQuitLocation: true` so the (2380, 70, -11358) was a real prior position but the chunk was almost certainly not loaded yet (11k blocks out, no recent player there). `teleportAsync` returned false either because:
|
|
- the chunk failed to load within Paper's async generation budget, or
|
|
- the entity was already dead (void death raced ahead of teleport).
|
|
|
|
### What CoreProtect WOULD have caught (and didn't)
|
|
|
|
CoreProtect inventory tracking is enabled (`item-transactions: true`, `item-drops: true`, `item-pickups: true`, `rollback-items: true`). However:
|
|
- A void-death drops items into the world for ~5 min then despawns. Drops are item entities, not container transactions; CoreProtect logs them as drops only if a player was the immediate cause of the drop.
|
|
- A death-drop in the `auth_limbo` world (where the void death happened) drops into y<0 air which is itself a non-event for CP.
|
|
- Thus there was no item-rollback path even if `co inspect` had been run within minutes.
|
|
|
|
**Implication:** CoreProtect is the wrong tool for death-drop recovery. A real death-drop plugin or `keepInventory` is the only fix.
|
|
|
|
### Backup script forensics
|
|
|
|
- Deployed: 88 lines, last block is "Prune old backups". No Minecraft block. No `umask 077`.
|
|
- Repo: 131 lines (with malformed lines 119-122 leftover from a bad merge — ALSO a bug to fix on the next push). Has the Minecraft block. Has `umask 077`.
|
|
- `/opt/backups/backup.log` shows last 5 days of "Backup complete" entries averaging 8-12K. None contain MC data. None mention MC. The log line `Configs: partial (some files missing)` is the configs section misfiring on Matrix paths and was never the MC block.
|
|
- Last verified-good MC archive on host: `/opt/backups/mc-plugins-prerebrand-2026-04-30.tar.gz` (one-shot pre-rebrand snapshot; contents not verified in this audit).
|
|
|
|
---
|
|
|
|
## Action Items (Prioritised)
|
|
|
|
### P0 — this week (by 2026-05-14)
|
|
|
|
1. **F-01 / Backups.** Sync deployed backup.sh with repo. Fix the lines 119-122 corruption in repo first. Add post-run sentinel: `[ "$(stat -c%s mc-world-backup-*.tar.gz)" -gt 1073741824 ] || log "WORLD BACKUP TOO SMALL — ABORT"`. Run manual backup, verify >= 5 GB on disk. Test a restore into a scratch dir.
|
|
2. **F-02 / Item-loss safety net.** Decide policy. Recommend: enable `keepInventory true` in `auth_limbo` world only (cheap, narrow), and write a 50-line AuthLimbo extension `OnPlayerDeath` listener that detects "death in auth_limbo" -> restore inventory snapshot taken at AuthMeAsyncPreLogin. Survival pain preserved everywhere else. **[H1, 2026-05-07]** Interim: server-wide `gamerule keepInventory true` planned but **deferred** — RCON command path can't reach `gamerule` (see F-16). Operator must run `/gamerule keepInventory true` in-game on next op session. Revert plan documented in `INTERIM-MITIGATIONS.md` (revert when AuthLimbo F1+F2+F4 ship).
|
|
3. **F-03 / AuthLimbo recovery.** Bump WARN to ERROR. Wire to existing Discord webhook (per workspace memory: webhook stack on nullstone). On failure, write player snapshot to `auth_limbo/incidents/<uuid>-<ts>.dat`.
|
|
4. **F-04 / Chunk preload race.** Add chunk-loaded check + sync force-load before `teleportAsync`. If still false, kick with friendly message instead of letting the player drop into limbo.
|
|
5. **F-05 / OOM headroom.** Lower `-Xmx` to 14 GB and add `docker events` watcher. **[H3, 2026-05-07]** `-Xms8192M -Xmx14336M` + `MEMORY_SIZE: "14G"` written to `docker-compose.yml` (both deployed + repo). Container limit unchanged at 18G — host is 31G total / ~13G free, other workloads need the rest. Goes live on next compose recreate (deferred — 2 players online). `docker events` watcher remains TODO.
|
|
6. **F-06 / Container hardening.** Add `cap_drop`, `pids_limit`, `no-new-privileges`. Boot test in a window. **[H2, 2026-05-07]** `cap_drop: [ALL]` + `cap_add: [CHOWN, SETUID, SETGID, FOWNER]` + `security_opt: [no-new-privileges:true]` + `deploy.resources.limits.pids: 4096` written to `docker-compose.yml`. `compose config --quiet` validates clean. DAC_OVERRIDE deliberately omitted — add only if entrypoint chown fails. Goes live on next recreate. Backup of pre-edit compose at `/opt/docker/minecraft/docker-compose.yml.bak-2026-05-07-before-H2H3`.
|
|
|
|
### P1 — this month
|
|
|
|
7. **F-07** CoreProtect prune cron, plan MySQL migration.
|
|
8. **F-08** SHA256 -> BCRYPT migration with legacyHashes fallback.
|
|
9. **F-09** Document `online-mode=false` rationale in RULES.md.
|
|
10. **F-10** Consider LVM/ZFS snapshot for backup atomicity.
|
|
|
|
### P2 — this quarter
|
|
|
|
11. **F-11** Drop EZShop after player communication window.
|
|
12. **F-12** Tighten entity tracking range, re-profile with spark.
|
|
13. **F-13** Clean orphan plugin folders.
|
|
14. **F-14** Wire spark TPS alerts to Discord.
|
|
15. **F-15** Document RCON async-command behaviour.
|
|
16. **F-17** Rotate RCON password, move to .env.
|
|
17. **F-18** Add restart-policy max_attempts.
|
|
|
|
---
|
|
|
|
## Open Questions for the Operator
|
|
|
|
1. **Inventory restoration policy.** Is silent `keepInventory` only in `auth_limbo` acceptable, or do you want a manual ops-restore-from-snapshot approval gate?
|
|
2. **YOU500 specifically.** Is there an out-of-band record of what they were carrying (Discord screenshot, witness)? If yes, manual NBT injection into player.dat is feasible. CoreProtect cannot help.
|
|
3. **Chunk preload trade-off.** Force-loading distant chunks at login adds 200-2000ms to login time. Acceptable vs the void-death risk?
|
|
4. **MySQL for CoreProtect.** Adds an operational dependency (another container, another backup target). Worth the complexity, or is monthly purge to keep SQLite under 1 GB sufficient?
|
|
5. **RCON password rotation.** The committed value should be rotated on principle. Schedule a maintenance window?
|
|
6. **online-mode=false.** Confirm long-term stance. Mojang ToS implications for racked.ru?
|
|
7. **Backups offsite.** Currently `/opt/backups/` is on the same host. Plan for offsite copy (B2, restic to friend-PC, anything)?
|
|
|
|
---
|
|
|
|
## What was NOT in scope this audit
|
|
|
|
- Network firewall, fail2ban, host-side security (nullstone-server has its own audit folder).
|
|
- Plugin source-supply-chain audit (covered by `docs/ROADMAP.md` "plugin acquisition overhaul").
|
|
- Performance profiling under load (deferred per F-12).
|
|
- LuckPerms permission graph correctness.
|
|
- Rules / chat-format / prefix audit (workspace memory: do NOT touch LP prefixes).
|
|
- Per-region (Lands / Homestead) data integrity.
|
|
|
|
---
|
|
|
|
## Sign-off
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| Audit date | 2026-05-07 |
|
|
| Method | Read-only SSH inspection, no fixes applied |
|
|
| Workspace rule applied | "Audit findings -> docs first, then fix" |
|
|
| Next action | Operator review + go/no-go on each P0 item |
|
|
| Next audit due | 2026-08-07 (quarterly), or sooner after backups remediated |
|