# AuthLimbo v2 — Architecture Status: **Design draft** (no code). Drafted 2026-05-07 by the auth-limbo v2 design pass after the YOU500 / second-player void-death incidents. Audience: operator (P) and future contributors. Companion docs: - [`AUDIT-2026-05-07.md`](../AUDIT-2026-05-07.md) — root-cause forensic. - [`ROADMAP.md`](../ROADMAP.md) — v1.x tracking (F1-F7). - [`V2-ROADMAP.md`](V2-ROADMAP.md) — milestones M0-M5 for v2. --- ## 1. Why v2 v1 is a single-jar Paper plugin glued onto AuthMe. It works *most* of the time, but its core failure modes are now well-understood and can't be patched away inside the v1 design: | v1 limitation | v2 must address | |---------------|------------------| | Player object exists on the main server *before* auth — coords/inventory technically restorable from RAM by buggy plugins, world chunk activity is observable. | Strong isolation: limbo is the only state the player can touch pre-auth. | | Restore relies on AuthMe firing `LoginEvent`. AuthMe's own broken teleport runs in the same window — F4 pre-empts it but the design still races. | Authoritative state machine that doesn't trust AuthMe's teleport at all. | | Inventory loss on transit-death depends on F1 + F5 holding. There is no inventory-of-record outside live game state. | Snapshot-on-pre-login + snapshot-restore is a first-class subsystem, not a defensive add-on. | | No metrics, no audit log, no admin alerting. Bugs only surface when a player loses gear. | Built-in observability: Prometheus + JSON-Lines audit + Discord webhook. | | No queue / login-throttle. If 50 bots connect at once, AuthMe stalls. | Bounded concurrency with transparent FIFO and trust tiers (NOT pay tiers). | v2 is a clean break (`v2.0.0`), not a v1 patch. v1 stays receiving F3, F5, F6, F7 backports for as long as racked.ru still runs the old jar. --- ## 2. Stack decision — **Paper-only**, with a Velocity-ready seam **Recommendation: Paper-only single-server plugin for v2.0.0.** Velocity-mode is a v2.x deferrable behind a feature flag. ### Reasoning racked.ru today is one Purpur 1.21.11 server in `minecraft-mc` itzg container on nullstone. There is no Velocity / BungeeCord, no second backend, no Forced Hosts, no proxy network. Adding Velocity to ship a gatekeeper plugin would mean: - standing up a new container, opening a new public port (or keeping 25565 on the proxy and 25566 internal), - migrating the 12+ existing Paper plugins through the velocity-paper bridge contract for chat / commands / placeholders, - new TLS / RCON / proxy-protocol surface to harden, - breaking changes to AuthMe's data flow (proxy-side login flow vs paper-side `AuthMeAsyncPreLoginEvent`), - one more thing for the operator to babysit. The privacy property the operator cares about — *no other player sees pre-auth coords / inventory* — is achievable on Paper-only via a strictly isolated limbo world + audience scoping (see §4). Velocity adds *stronger* isolation (player never reaches the backend at all) but the incremental privacy gain is small for a 0-10 player community, and the operational cost is large. ### When Velocity becomes worth it Codify trip-wires up front so the decision isn't dragged out: 1. racked.ru splits into ≥2 backends (e.g. `survival` + `creative`) — you need a proxy anyway. 2. cobblestone server comes online and shares an account/auth pool. 3. Botting attempts cross 100 connections / minute and `connection-throttle` + `firewalld rate-limit` are no longer enough. Velocity + a queue plugin (Ajax / VeloctyQueue) become operationally cheaper than chasing botnets at the application layer. Until any of those, Paper-only is the right answer. ### The Velocity-ready seam v2 internal API is split into two layers so the proxy migration is mechanical: ``` +-------------------------------+ +-------------------------------+ | Gatekeeper (proxy or paper) | | Restore (paper only) | | - accept connection | | - read snapshot | | - check ban / rate limit | | - chunk preload | | - hold in limbo / queue | | - authoritative TP | | - hand off on auth-success | | - publish metrics | +--------------+----------------+ +-------------------------------+ | hand-off event (UUID, target Location, source IP) v ``` In v2.0 both layers live in the Paper plugin and the hand-off is just a local method call. In a future "v2-velo" both layers split: gatekeeper runs as a Velocity plugin, restore stays on Paper, hand-off becomes a plugin-message channel. No code outside those two layers needs to change. --- ## 3. Queue model — login-throttle + transparent trust tiers, NO 2b2t-style sale **For 0-10 player normal load: queue depth is always 0 and players never see "queued" UI. The queue exists for crisis scenarios (bot flood, restart drain, AuthMe DB stall) and to define explicit policy even if it's rarely hit.** ### Policy | Tier | Definition | Effect | |------|-----------|--------| | `staff` | Player has `authlimbo.queue.priority.staff` permission (LP-managed). | Always passes. Bypasses queue entirely. | | `returning` | Player is in AuthMe DB AND has logged in within last 30 days. | Default tier for everyone who isn't new. Normal FIFO ordering by connect-time. | | `new` | Player is NOT in AuthMe DB OR last seen >30 days ago. | Same FIFO as `returning` BUT with a per-IP 1/minute throttle. Stops bot-floods. | | `flagged` | Player IP matches a Pi-hole/CrowdSec/abuse-DB block. | Rejected at gatekeeper, never enters the queue. | Hard rules — written into `V2-ARCHITECTURE.md` so they outlive any one operator's mood: 1. **No paid priority. Ever.** No "priority queue pass", no "supporter rank skip", no Patreon tier. The 2b2t community collapsed under that grift; we don't repeat it. 2. **No hidden veteran tier.** Every tier is documented in this file and in `/authlimbo queue policy` in-game. If a player can't see why they're in tier X, the tier is illegitimate. 3. **No in-game bidding / griefing for queue spots.** Queue position is purely connect-time + tier; no player action affects it. 4. **Ops-staff bypass is logged.** Every staff bypass writes a JSON-L audit row. ### Capacity - `gatekeeper.max-concurrent-auth: 5` — at most 5 players in the pre-auth limbo at once. Defaults sized for racked.ru. AuthMe DB reads + chunk pins per concurrent player are roughly free, but bound it anyway. - `gatekeeper.max-queue-depth: 50` — beyond 50 waiting, new connections get a "server is starting up, try again in 30s" kick. Better UX than a 5-minute black-screen wait. - `gatekeeper.queue-timeout-seconds: 120` — anyone in the queue >2 minutes gets the same kick + a Discord webhook fires. ### What queue UX looks like In limbo, a `BossBar` (Adventure API) shows tier + position: ``` [returning] Queue position: 3 / 7 ETA: ~15s ``` When position == 0 and AuthMe accepts, the bar disappears. There's no hidden state. `/queue` in-chat re-displays the same info. --- ## 4. Privacy isolation This is the original feature; v2 must not regress it. ### Limbo world - Separate Bukkit world `auth_limbo`, `Environment.THE_END`, `VoidGenerator`. Same as v1. - `keepSpawnInMemory=true`. Game-rules: no daylight, no weather, no mobs, no fire-tick, no PvP, `doImmediateRespawn=true`, `keepInventory=true` (defence-in-depth — limbo never *should* see a death event but if it does, no item drops happen). - Per-player view-distance forced to 2 in limbo via Paper's `Player#setViewDistance`. They see 5x5 chunks, all empty. - Limbo platform: 5x5 of `BARRIER` blocks at y=127, single block of `BARRIER` ceiling at y=129 to prevent flying out. y=0..126 and y=130+ are pure void. ### Adventure-API audience scoping `PlayerChatEvent` listener at `EventPriority.HIGHEST`: - If sender is in main worlds, recipient list is filtered: anyone whose `World#getName().equals("auth_limbo")` is dropped. Pre-auth players never see overworld chat. - If sender is in limbo (would normally not chat — AuthMe blocks it — but defence in depth), recipient list is set to *only* the sender. They cannot leak messages to the main world. - `PlayerJoinEvent` join messages are suppressed for `auth_limbo`-spawn joins. Main world only sees a join announcement *after* the authoritative restore TP succeeds (M2 §"join-message shifting" below). ### Tablist scoping Hook `PaperPlayerListEntryEvent` (or fall back to `PlayerJoinEvent` + `Player#hidePlayer`): - Limbo players are hidden from main-world tablist. - Main-world players are hidden from limbo tablist. - Limbo players cannot see each other (each limbo player sees only themselves). ### What main world observers can detect After scoping: - They cannot see the player's name in tablist pre-auth. - They cannot see chat from the player. - They cannot see the player's world or coordinates (AuthMe blocks movement output anyway, but we don't rely on it). - They CAN see the connection event in server logs (operator-only). - They can see "PLAYER joined the game" only AFTER restore succeeds — join message is shifted to fire on restore-success, not on initial connect. This matches the v1 privacy posture and tightens the join-message leak. --- ## 5. Login flow — explicit state machine ``` [CONNECT] ---throttle ok---> [GATE] | failed throttle / ban | v v [REJECTED] [SNAPSHOT] <-- read AuthMe DB, | dump current invent + xp + loc v to plugins/AuthLimbo/snapshots/.nbt [LIMBO] | AuthMe /login ok | v [PRELOAD] <-- 3x3 chunk pin around target | v [RESTORE] <-- teleportAsync, retry up to 3 | +-----+-----+ | | success fail x3 | | v v [LIVE] [SPECTATOR-AT-LIMBO + admin alert] ``` Each transition has: 1. **Trigger event** (e.g. `LoginEvent` MONITOR). 2. **Pre-conditions** (e.g. UUID in `pendingTransit`). 3. **Side-effects** (e.g. metric counter, audit-log row). 4. **Failure handler** (next state on error). States persist in `plugins/AuthLimbo/state/.json` so a plugin crash mid-flow can resume on rejoin. State file is deleted on [LIVE] entry. ### Snapshot subsystem **This is the operator-bug-survives-everything layer.** - On `AuthMeAsyncPreLoginEvent` (player just connected, NOT yet auth'd): if a player file `world/playerdata/.dat` exists, read it and shadow-copy to `plugins/AuthLimbo/snapshots/.nbt` with timestamp. SHA-256 of file content is logged. - `/authlimbo restore ` can roll back any restore by feeding the snapshot through nbtlib (same as the void-death recovery protocol from `feedback_mc_tp_safety.md`). - Snapshots retained 7 days, then GC'd. Configurable. - On `PlayerDeathEvent` while UUID in `pendingTransit`: `keepInventory=true`, `event.getDrops().clear()`, log SEVERE, trigger Discord webhook, schedule restore-from-snapshot on respawn. ### Restore step (replaces v1's `doTeleport` + 10-tick delay) 1. Read saved location from AuthMe DB (cached from pre-login — single in-memory hashmap keyed by UUID, evicted on transit clear). 2. Compute 3x3 chunk grid centred on saved location. 3. `addPluginChunkTicket` on all 9 chunks. 4. `CompletableFuture.allOf(getChunkAtAsyncUrgently x9)` — wait for all 9 to actually be loaded, not just the centre one (closes the "loaded but neighbour unloaded" race). 5. `teleportAsync(saved, PLUGIN)`. If `false`: F2 retry loop (already in v1.1.0, carries over). 6. On success: 5-tick delay, then verify `player.getLocation().distance(saved) < 2.0`. If not, treat as a silent failure → retry. 7. Release tickets 5s post-success. 8. Mark transition [LIVE], publish `auth_login_success_total` metric, write audit-log row, send delayed join-message to main world, clear snapshot. ### F8 — drop the SPECTATOR pre-TP trick v1 considered "set GameMode.SPECTATOR before TP, revert after". v2 does NOT do this — spectator mode has its own client-side render races on chunk-load and silently swallows damage events that the F1 guard *needs to see*. Instead: invariant-driven recovery (snapshot + retry + admin alert) is the safety net. SPECTATOR is the final fallback after 3 failed retries (F6 in v1, kept for v2). --- ## 6. Anti-drama checklist (2b2t lessons) Codified up-front so future "monetisation" pressure is rejected by reference, not by argument. - [x] No pay-to-skip. Tier list above is the entire policy. - [x] No hidden tier or undocumented bypass (staff bypass is logged). - [x] No queue spot trading / selling. - [x] No "queue position visible to others" — your position is only visible to you. No social pressure surface. - [x] Queue is purely FIFO + tier; no algorithm tweaks, no "lottery". - [x] AGPL-3.0 means anyone can fork and self-host an alt gatekeeper if they distrust ours. Operator-friendly. - [x] Audit log is local-file JSON-L, not phoned home, not centralised. Operator-readable, no hidden telemetry. --- ## 7. Operational surface ### Metrics (Prometheus) Exposed via embedded HTTP server bound to `127.0.0.1:9091` (loopback only — Prometheus on nullstone scrapes via localhost): | Metric | Type | Labels | |--------|------|--------| | `authlimbo_connections_total` | counter | `tier`, `outcome={accepted, queued, rejected}` | | `authlimbo_queue_depth` | gauge | — | | `authlimbo_login_success_total` | counter | `tier` | | `authlimbo_login_fail_total` | counter | `reason={timeout, authme_db, tp_failed_3x, ...}` | | `authlimbo_void_damage_blocked_total` | counter | — | | `authlimbo_snapshot_restored_total` | counter | — | | `authlimbo_restore_duration_seconds` | histogram | `tier` | Trip-wire alerts (configured server-side, in `prometheus/alerts.yml`, not in the plugin): - `authlimbo_login_fail_total{reason="tp_failed_3x"}` rate > 0 for 5m. - `authlimbo_void_damage_blocked_total` rate > 0 for 1m. - `authlimbo_queue_depth` > 10 for 5m. ### Discord webhooks Plugin-side webhook fires on: - Snapshot restored (gear was about to be lost). - 3x retry give-up (manual `/authlimbo tp` needed). - Queue depth > config threshold. - AuthMe DB unreachable. - Plugin reload / crash. Webhook URL is in config, redacted from `/authlimbo dump`. ### Audit log `plugins/AuthLimbo/audit.log` — JSON Lines, one row per state transition. Fields: `ts`, `uuid`, `name`, `ip`, `tier`, `state`, `prev_state`, `extra` (free-form JSON). Logrotate-compatible; rotates at 100MB, keeps 7 files. ### Reload-without-restart `/authlimbo reload`: - Re-reads `config.yml`. - Drains in-flight transits to completion (no new joins accepted during drain, max 30s wait). - Re-binds metrics HTTP server if port changed. - Re-creates limbo world if name/spawn changed. - Discord webhook fires "reload completed in Xs". --- ## 8. Failure modes & recovery | Failure | Detection | Recovery | |---------|-----------|----------| | Plugin crashes mid-restore | On startup, scan `state/*.json` files older than 30s. | For each: if player offline, leave snapshot; if online, treat as new transit, force re-restore from saved AuthMe loc. | | Snapshot file corrupt / unreadable | NBT parse exception. | Fall back to AuthMe DB saved-loc; log SEVERE; webhook. Player may lose newest items but not entire inventory. | | World save corrupts | Paper World#getChunkAtAsync throws. | After 3 retries: kick player with "server experiencing storage issue, try again in 5min"; webhook. | | AuthMe DB unreachable | JDBC `getConnection` throws / read times out > 5s. | **Fail closed.** Reject connection at gatekeeper with kick: "auth service degraded". Log + webhook. Do NOT let player onto main world without auth. | | Server `/stop` mid-login window | Paper shutdown hook. | `clearTransit` for all UUIDs, force-save snapshots, kick all limbo players with "server restarting, your gear is safe". | | Race: AuthMe LoginEvent fires twice (HaHaWTH bug) | UUID already in `pendingTransit` and not in `RESTORE` state. | Idempotent — restore handler is a no-op if UUID is past [PRELOAD]. Log INFO. | | Player disconnects in [LIMBO] | `PlayerQuitEvent`. | Clear pendingTransit + retry counter. Snapshot retained 7d. State file kept until snapshot GC. | `fail-open` is never the right choice for an auth gatekeeper. Every failure mode resolves to either: keep player in limbo, or kick them. Never advance them to main-world unauth'd. --- ## 9. Migration from v1 In-place upgrade path (`v1.1.x` → `v2.0.0`): 1. Stop server. 2. Drop new jar in `plugins/`. v2 jar is not v1-compatible — old `AuthLimbo-1.x.jar` must be removed. 3. v2 detects `plugins/AuthLimbo/config.yml` from v1 and rewrites it to v2 schema, leaving a `config.v1.bak` backup. 4. v2 detects `auth_limbo` world dir on disk and re-uses it (no recreation, no data loss). 5. AuthMe DB schema unchanged — v2 still treats `authme.db` as read-only authoritative. 6. New: `plugins/AuthLimbo/snapshots/` and `plugins/AuthLimbo/state/` directories created, owned by the same uid as the itzg container's runtime user. 7. Start server. v2 startup logs walk through migration steps. There is no DB migration. No mandatory player action. Permissions node names change (`authlimbo.admin` is now `authlimbo.command.admin`, etc.) — operator must update LP groups (noted in CHANGELOG). --- ## 10. Test plan ### Unit (JUnit 5 + Mockito) - `LimboWorldManager` — barrier-platform construction is idempotent. - `AuthMeDatabase.getQuitLocation` — returns `Location` for present row, null for absent, null for malformed row. - `Snapshot.serialize` / `deserialize` round-trip. - State-machine: every transition rejects from invalid prev-state. ### Integration (Paper test-server harness) - Stand up Paper 1.21.x in CI (Forgejo Actions runner on nullstone). - Mock AuthMe via a stub plugin that fires `AuthMeAsyncPreLoginEvent` and `LoginEvent` programmatically. - Test scenarios: §5.1-5.6 from `AUDIT-2026-05-07.md` plus v2-specific: queue overflow, snapshot-restore on death, reload-without-restart, fail-closed on AuthMe DB down. ### Stress (Bot flood) - 1000 fake connections in 60s using mineflayer or [`MCBotsPro`](https://github.com/Sammy1Am/MCBotsPro). Verify: - queue-depth bounded (gatekeeper kicks beyond max-queue-depth); - no `pendingTransit` leak (size returns to 0 after); - metrics counters consistent with audit log. ### Chaos - Kill plugin (`/plugman unload AuthLimbo`) mid-restore, verify state recovery on rejoin. - `iptables -A OUTPUT -d -j DROP` and verify fail-closed. - `kill -9` itzg container during transit, verify next-startup walks `state/*.json` and recovers. --- ## 11. Versioning + release - v2.0.0 = breaking redesign (this doc), AGPL-3.0 retained. - v2.1.0 = polish (BossBar UX, /queue command, more metrics). - v2.2.0 = Velocity-mode behind feature flag. - v1.x = receives F3, F5, F6, F7 backports until racked.ru cuts over to v2; then archived. Coordinate naming: when the codename migration completes (onyx→obsidian, nullstone→bedrock per `gravel-laptop-build/ROADMAP.md`), the racked.ru server moves to bedrock. v2.0.0 must run on both naming worlds without config drift. --- ## 12. Open questions - BossBar UI — does the operator want it visible to limbo players, or silent? Default proposed: visible. - Snapshot retention — 7 days is the proposed default. Storage cost is ~1 KB/snapshot for vanilla inventories, up to ~50 KB for shulker-stuffed players. 100 active players → ~5 MB max. - Webhook destination — same Discord channel as `s8n-ru` server-status alerts, or a new channel? Default proposed: same channel, prefixed `[AuthLimbo]`. - v2.2 Velocity migration — needs a separate design pass once cobblestone or a second backend is real. Sign-off pending operator review.