diff --git a/AUDIT-2026-05-07.md b/AUDIT-2026-05-07.md index 0b638cc..a930556 100644 --- a/AUDIT-2026-05-07.md +++ b/AUDIT-2026-05-07.md @@ -125,11 +125,11 @@ CoreProtect inventory tracking is enabled (`item-transactions: true`, `item-drop ### P0 — this week (by 2026-05-14) 1. **F-01 / Backups.** Sync deployed backup.sh with repo. Fix the lines 119-122 corruption in repo first. Add post-run sentinel: `[ "$(stat -c%s mc-world-backup-*.tar.gz)" -gt 1073741824 ] || log "WORLD BACKUP TOO SMALL — ABORT"`. Run manual backup, verify >= 5 GB on disk. Test a restore into a scratch dir. -2. **F-02 / Item-loss safety net.** Decide policy. Recommend: enable `keepInventory true` in `auth_limbo` world only (cheap, narrow), and write a 50-line AuthLimbo extension `OnPlayerDeath` listener that detects "death in auth_limbo" -> restore inventory snapshot taken at AuthMeAsyncPreLogin. Survival pain preserved everywhere else. +2. **F-02 / Item-loss safety net.** Decide policy. Recommend: enable `keepInventory true` in `auth_limbo` world only (cheap, narrow), and write a 50-line AuthLimbo extension `OnPlayerDeath` listener that detects "death in auth_limbo" -> restore inventory snapshot taken at AuthMeAsyncPreLogin. Survival pain preserved everywhere else. **[H1, 2026-05-07]** Interim: server-wide `gamerule keepInventory true` planned but **deferred** — RCON command path can't reach `gamerule` (see F-16). Operator must run `/gamerule keepInventory true` in-game on next op session. Revert plan documented in `INTERIM-MITIGATIONS.md` (revert when AuthLimbo F1+F2+F4 ship). 3. **F-03 / AuthLimbo recovery.** Bump WARN to ERROR. Wire to existing Discord webhook (per workspace memory: webhook stack on nullstone). On failure, write player snapshot to `auth_limbo/incidents/-.dat`. 4. **F-04 / Chunk preload race.** Add chunk-loaded check + sync force-load before `teleportAsync`. If still false, kick with friendly message instead of letting the player drop into limbo. -5. **F-05 / OOM headroom.** Lower `-Xmx` to 14 GB and add `docker events` watcher. -6. **F-06 / Container hardening.** Add `cap_drop`, `pids_limit`, `no-new-privileges`. Boot test in a window. +5. **F-05 / OOM headroom.** Lower `-Xmx` to 14 GB and add `docker events` watcher. **[H3, 2026-05-07]** `-Xms8192M -Xmx14336M` + `MEMORY_SIZE: "14G"` written to `docker-compose.yml` (both deployed + repo). Container limit unchanged at 18G — host is 31G total / ~13G free, other workloads need the rest. Goes live on next compose recreate (deferred — 2 players online). `docker events` watcher remains TODO. +6. **F-06 / Container hardening.** Add `cap_drop`, `pids_limit`, `no-new-privileges`. Boot test in a window. **[H2, 2026-05-07]** `cap_drop: [ALL]` + `cap_add: [CHOWN, SETUID, SETGID, FOWNER]` + `security_opt: [no-new-privileges:true]` + `deploy.resources.limits.pids: 4096` written to `docker-compose.yml`. `compose config --quiet` validates clean. DAC_OVERRIDE deliberately omitted — add only if entrypoint chown fails. Goes live on next recreate. Backup of pre-edit compose at `/opt/docker/minecraft/docker-compose.yml.bak-2026-05-07-before-H2H3`. ### P1 — this month diff --git a/INTERIM-MITIGATIONS.md b/INTERIM-MITIGATIONS.md new file mode 100644 index 0000000..bcc3b55 --- /dev/null +++ b/INTERIM-MITIGATIONS.md @@ -0,0 +1,146 @@ +# Interim Mitigations — 2026-05-07 + +Server-level temporary workarounds applied while permanent fixes are pending. +Each item lists its **revert trigger** so we don't carry these forever. + +--- + +## H1 — `gamerule keepInventory true` (server-wide) + +**Status:** **NOT YET APPLIED LIVE.** The `gamerule` command is unreachable +via the current RCON path — every variant attempted (`gamerule keepInventory +true`, `minecraft:gamerule …`, `execute in minecraft:overworld run gamerule +…`, lowercase, no value) returned `Incorrect argument for command` from +Paper's command parser, and the command never appears as a "Rcon issued +server command" line in `/data/logs/latest.log`. This matches AUDIT-2026-05-07 +finding **F-16**: rcon-cli quoting / Paper 1.21.11 brigadier interaction +appears to swallow the gamerule command client-side. + +**Why:** Until AuthLimbo F1 (void-damage guard) and F2 (`teleportAsync` retry) +ship in production, ANY login race that void-kills a transiting player +results in full inventory + xp loss (see YOU500 incident, 2026-05-07 +17:13:39 BST). `keepInventory=true` server-wide is a blunt but sound safety +net during the gap. Trade-off: removes survival death penalty everywhere, +not just on auth-flow deaths. + +**To apply (operator action required, in-game):** + +1. Op-login as `s8n` (or any rank-4 op). +2. In chat, run: `/gamerule keepInventory true` +3. Verify: `/gamerule keepInventory` should reply `keepInventory is set to true` +4. Note the date in this file under "Applied". + +**Applied:** _pending — deferred to operator while RCON gamerule path is +broken (see F-16). Ask s8n to run it next time they're logged in. They +were online 2026-05-07 17:47 BST restoring YOU500's gear — ideal moment +missed; do it on next op session._ + +**Revert trigger (drop this safety net):** + +When **AuthLimbo 1.1.0** is deployed with **all of**: +- F1 (void-damage guard for `pendingTransit` UUIDs) +- F2 (post-`teleportAsync==false` recovery: snap to limbo spawn + retry) +- F4 (pre-empt AuthMe's own broken teleport at LOGIN-LOWEST) + +…AND those have been observed handling at least one production void-death +race correctly (look for `[AuthLimbo] void-damage cancelled for ` or +`teleportAsync recovered after retry` lines in latest.log). + +**Revert command (in-game):** `/gamerule keepInventory false` + +**Cross-reference:** +- Audit: `/home/admin/ai-lab/_github/minecraft-server/AUDIT-2026-05-07.md` F-02, F-16 +- Plugin audit: `/home/admin/ai-lab/_github/auth-limbo/AUDIT-2026-05-07.md` F1, F2, F4 +- Plugin roadmap: `/home/admin/ai-lab/_github/auth-limbo/ROADMAP.md` + +--- + +## H2 — Container capability hardening (compose) + +**Status:** Applied to compose file 2026-05-07. **NOT yet applied to running +container** — change goes live on next `docker compose up -d --force-recreate`. + +**Reason for deferral:** 2 players online (s8n + YOU500) at the time of edit; +operator was actively restoring inventory via `/give`. Restart deferred to +avoid a second incident on the same player on the same day. + +**Restart command (when window opens, no players online or with announcement):** +```bash +ssh user@192.168.0.100 'docker compose -f /opt/docker/minecraft/docker-compose.yml down && docker compose -f /opt/docker/minecraft/docker-compose.yml up -d' +``` + +**Post-restart verification:** +```bash +# 1. Container came up healthy: +docker ps --filter name=minecraft-mc --format '{{.Status}}' +# Expected: "Up X seconds (healthy)" — wait 4-5 min for healthcheck. + +# 2. itzg entrypoint did its chowns successfully: +docker logs minecraft-mc 2>&1 | grep -iE "(error|denied|cannot)" | head + +# 3. RCON still reachable: +echo "list" | docker exec -i minecraft-mc rcon-cli +``` + +**If the container fails to start** (most likely cause: missing capability): +1. Check logs for `chown: ... Operation not permitted` -> add `DAC_OVERRIDE`. +2. Check for `setuid` / `setgid` errors -> already in cap_add, but verify spelling. +3. Roll back: `cp /opt/docker/minecraft/docker-compose.yml.bak-2026-05-07-before-H2H3 /opt/docker/minecraft/docker-compose.yml && docker compose up -d`. + +**No revert trigger** — this is a permanent hardening, not a workaround. + +--- + +## H3 — JVM Xmx lowered 16384M → 14336M (compose) + +**Status:** Applied to compose file 2026-05-07. **NOT yet applied to running +container** — change goes live on the same restart that activates H2. + +**Reason:** AUDIT-2026-05-07 F-05 — original `-Xmx16384M` inside an 18 GB +container leaves <2 GB headroom for off-heap (Netty buffers, native mmaps, +plugin metadata). With 25 plugins on Aikar G1 flags, native memory regularly +sits 2-3 GB above heap. A player surge that pushes G1 to its full 16 GB +ceiling results in a silent kernel OOM kill of the container. + +**Decision:** Lower Xmx (14 GB), do NOT raise the container limit. Host has +31 GB RAM total with ~13 GB free at edit time, but nullstone runs other +docker workloads (matrix, rocketchat, traefik, forgejo, etc) and the 18 GB +budget for MC was already aggressive. New layout: 14 GB heap + ~3.5 GB +native + 0.5 GB direct buffers fits comfortably in 18 GB. + +**No revert trigger** — permanent. If TPS regresses under load due to +heap pressure, raise Xmx in 1 GB steps and re-evaluate; don't blanket-revert. + +--- + +## H4 — Compose backups (defence-in-depth) + +**Status:** Applied 2026-05-07. + +**Files saved:** +- Deployed: `/opt/docker/minecraft/docker-compose.yml.bak-2026-05-07-before-H2H3` +- Repo: `/home/admin/ai-lab/_github/minecraft-server/docker-compose.yml.bak-2026-05-07-before-H2H3` + +**Restore commands (if H2/H3 prove broken after restart):** +```bash +# Deployed (revert + restart): +ssh user@192.168.0.100 'cp /opt/docker/minecraft/docker-compose.yml.bak-2026-05-07-before-H2H3 /opt/docker/minecraft/docker-compose.yml && docker compose -f /opt/docker/minecraft/docker-compose.yml up -d --force-recreate' + +# Repo: +cp /home/admin/ai-lab/_github/minecraft-server/docker-compose.yml.bak-2026-05-07-before-H2H3 /home/admin/ai-lab/_github/minecraft-server/docker-compose.yml +``` + +**Backup retention:** keep both `.bak-2026-05-07-before-H2H3` files until +the post-restart verification has been signed off (i.e. one full day of +healthy uptime under load). + +--- + +## Index of applied measures + +| ID | Status | Applied (live) | Reverts when | +|-----|----------------------|----------------|-------------------------------| +| H1 | Compose-staged only | NO (deferred to operator: F-16 RCON path broken) | AuthLimbo 1.1.0 (F1+F2+F4) ships and proves itself in prod | +| H2 | Compose edits saved | NO (next restart) | never — permanent hardening | +| H3 | Compose edits saved | NO (next restart) | never — permanent | +| H4 | Backups created | YES | after H2/H3 prove healthy | diff --git a/docker-compose.yml b/docker-compose.yml index 77a5132..85c59ed 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -9,8 +9,11 @@ services: CUSTOM_SERVER: "https://api.purpurmc.org/v2/purpur/1.21.11/latest/download" VERSION: "1.21.11" - MEMORY_SIZE: "16G" - JVM_OPTS: "-Xms8192M -Xmx16384M" + # H3 (2026-05-07): Xmx lowered 16384M -> 14336M to leave ~3.5G headroom + # for off-heap (Netty buffers, native mmaps, plugin metadata) inside the + # 18G container limit. See AUDIT-2026-05-07.md F-05. + MEMORY_SIZE: "14G" + JVM_OPTS: "-Xms8192M -Xmx14336M" DIFFICULTY: hard GAMEMODE: survival @@ -72,6 +75,20 @@ services: networks: - proxy restart: unless-stopped + # H2 (2026-05-07): Container hardening per AUDIT-2026-05-07.md F-06. + # Drop the default Docker capability set (CAP_NET_RAW, CAP_SYS_CHROOT, ...) + # which the JVM/Paper does not need. Re-add only the minimum needed by + # itzg's entrypoint chown/gosu flow. DAC_OVERRIDE intentionally omitted — + # add back only if entrypoint fails. NOT applied live until next restart. + cap_drop: + - ALL + cap_add: + - CHOWN + - SETUID + - SETGID + - FOWNER + security_opt: + - no-new-privileges:true healthcheck: test: ["CMD", "mc-health"] interval: 30s @@ -83,6 +100,7 @@ services: limits: memory: 18G cpus: '6' + pids: 4096 reservations: memory: 8G labels: