auth-limbo/docs/V2-ARCHITECTURE.md
s8n ab1f607df6 docs: AuthLimbo v2 research + architecture + roadmap
4 parallel research agents output (2026-05-07):
- RESEARCH-2B2T-QUEUE.md — 2b2t queue tech deep-dive: architecture, drama
  timeline, 5 patterns to copy + 5 to avoid
- RESEARCH-LIMBO-PLUGIN-SURVEY.md — open-source plugin survey: STEAL list
  (Elytrium LimboAPI/LimboAuth + PistonQueue), PATTERN list, SKIP list
- V2-ARCHITECTURE.md — Paper-only stack with Velocity-ready seam, 7-state
  login flow, snapshot-on-pre-login, transparent FIFO trust tiers
- V2-ROADMAP.md — M0-M5 milestones with acceptance criteria + dep graph

Stack decision: Paper-only for now (no proxy required), but architecture
split into Gatekeeper + Restore layers so future Velocity migration is
mechanical. Trip-wires codified for when to reconsider.

Anti-drama policy locked in code (not config): no paid priority, no
hidden veteran tier, transparent ban appeals.

Bootstrap repo at git.s8n.ru/s8n/auth-limbo-v2 ready for M0 work.
2026-05-07 19:31:40 +01:00

20 KiB

AuthLimbo v2 — Architecture

Status: Design draft (no code). Drafted 2026-05-07 by the auth-limbo v2 design pass after the YOU500 / second-player void-death incidents. Audience: operator (P) and future contributors.

Companion docs:


1. Why v2

v1 is a single-jar Paper plugin glued onto AuthMe. It works most of the time, but its core failure modes are now well-understood and can't be patched away inside the v1 design:

v1 limitation v2 must address
Player object exists on the main server before auth — coords/inventory technically restorable from RAM by buggy plugins, world chunk activity is observable. Strong isolation: limbo is the only state the player can touch pre-auth.
Restore relies on AuthMe firing LoginEvent. AuthMe's own broken teleport runs in the same window — F4 pre-empts it but the design still races. Authoritative state machine that doesn't trust AuthMe's teleport at all.
Inventory loss on transit-death depends on F1 + F5 holding. There is no inventory-of-record outside live game state. Snapshot-on-pre-login + snapshot-restore is a first-class subsystem, not a defensive add-on.
No metrics, no audit log, no admin alerting. Bugs only surface when a player loses gear. Built-in observability: Prometheus + JSON-Lines audit + Discord webhook.
No queue / login-throttle. If 50 bots connect at once, AuthMe stalls. Bounded concurrency with transparent FIFO and trust tiers (NOT pay tiers).

v2 is a clean break (v2.0.0), not a v1 patch. v1 stays receiving F3, F5, F6, F7 backports for as long as racked.ru still runs the old jar.


2. Stack decision — Paper-only, with a Velocity-ready seam

Recommendation: Paper-only single-server plugin for v2.0.0. Velocity-mode is a v2.x deferrable behind a feature flag.

Reasoning

racked.ru today is one Purpur 1.21.11 server in minecraft-mc itzg container on nullstone. There is no Velocity / BungeeCord, no second backend, no Forced Hosts, no proxy network. Adding Velocity to ship a gatekeeper plugin would mean:

  • standing up a new container, opening a new public port (or keeping 25565 on the proxy and 25566 internal),
  • migrating the 12+ existing Paper plugins through the velocity-paper bridge contract for chat / commands / placeholders,
  • new TLS / RCON / proxy-protocol surface to harden,
  • breaking changes to AuthMe's data flow (proxy-side login flow vs paper-side AuthMeAsyncPreLoginEvent),
  • one more thing for the operator to babysit.

The privacy property the operator cares about — no other player sees pre-auth coords / inventory — is achievable on Paper-only via a strictly isolated limbo world + audience scoping (see §4). Velocity adds stronger isolation (player never reaches the backend at all) but the incremental privacy gain is small for a 0-10 player community, and the operational cost is large.

When Velocity becomes worth it

Codify trip-wires up front so the decision isn't dragged out:

  1. racked.ru splits into ≥2 backends (e.g. survival + creative) — you need a proxy anyway.
  2. cobblestone server comes online and shares an account/auth pool.
  3. Botting attempts cross 100 connections / minute and connection-throttle + firewalld rate-limit are no longer enough. Velocity + a queue plugin (Ajax / VeloctyQueue) become operationally cheaper than chasing botnets at the application layer.

Until any of those, Paper-only is the right answer.

The Velocity-ready seam

v2 internal API is split into two layers so the proxy migration is mechanical:

+-------------------------------+   +-------------------------------+
|  Gatekeeper (proxy or paper)  |   |  Restore (paper only)         |
|  - accept connection          |   |  - read snapshot              |
|  - check ban / rate limit     |   |  - chunk preload              |
|  - hold in limbo / queue      |   |  - authoritative TP           |
|  - hand off on auth-success   |   |  - publish metrics            |
+--------------+----------------+   +-------------------------------+
               | hand-off event (UUID, target Location, source IP)
               v

In v2.0 both layers live in the Paper plugin and the hand-off is just a local method call. In a future "v2-velo" both layers split: gatekeeper runs as a Velocity plugin, restore stays on Paper, hand-off becomes a plugin-message channel. No code outside those two layers needs to change.


3. Queue model — login-throttle + transparent trust tiers, NO 2b2t-style sale

For 0-10 player normal load: queue depth is always 0 and players never see "queued" UI. The queue exists for crisis scenarios (bot flood, restart drain, AuthMe DB stall) and to define explicit policy even if it's rarely hit.

Policy

Tier Definition Effect
staff Player has authlimbo.queue.priority.staff permission (LP-managed). Always passes. Bypasses queue entirely.
returning Player is in AuthMe DB AND has logged in within last 30 days. Default tier for everyone who isn't new. Normal FIFO ordering by connect-time.
new Player is NOT in AuthMe DB OR last seen >30 days ago. Same FIFO as returning BUT with a per-IP 1/minute throttle. Stops bot-floods.
flagged Player IP matches a Pi-hole/CrowdSec/abuse-DB block. Rejected at gatekeeper, never enters the queue.

Hard rules — written into V2-ARCHITECTURE.md so they outlive any one operator's mood:

  1. No paid priority. Ever. No "priority queue pass", no "supporter rank skip", no Patreon tier. The 2b2t community collapsed under that grift; we don't repeat it.
  2. No hidden veteran tier. Every tier is documented in this file and in /authlimbo queue policy in-game. If a player can't see why they're in tier X, the tier is illegitimate.
  3. No in-game bidding / griefing for queue spots. Queue position is purely connect-time + tier; no player action affects it.
  4. Ops-staff bypass is logged. Every staff bypass writes a JSON-L audit row.

Capacity

  • gatekeeper.max-concurrent-auth: 5 — at most 5 players in the pre-auth limbo at once. Defaults sized for racked.ru. AuthMe DB reads + chunk pins per concurrent player are roughly free, but bound it anyway.
  • gatekeeper.max-queue-depth: 50 — beyond 50 waiting, new connections get a "server is starting up, try again in 30s" kick. Better UX than a 5-minute black-screen wait.
  • gatekeeper.queue-timeout-seconds: 120 — anyone in the queue >2 minutes gets the same kick + a Discord webhook fires.

What queue UX looks like

In limbo, a BossBar (Adventure API) shows tier + position:

[returning]  Queue position: 3 / 7   ETA: ~15s

When position == 0 and AuthMe accepts, the bar disappears. There's no hidden state. /queue in-chat re-displays the same info.


4. Privacy isolation

This is the original feature; v2 must not regress it.

Limbo world

  • Separate Bukkit world auth_limbo, Environment.THE_END, VoidGenerator. Same as v1.
  • keepSpawnInMemory=true. Game-rules: no daylight, no weather, no mobs, no fire-tick, no PvP, doImmediateRespawn=true, keepInventory=true (defence-in-depth — limbo never should see a death event but if it does, no item drops happen).
  • Per-player view-distance forced to 2 in limbo via Paper's Player#setViewDistance. They see 5x5 chunks, all empty.
  • Limbo platform: 5x5 of BARRIER blocks at y=127, single block of BARRIER ceiling at y=129 to prevent flying out. y=0..126 and y=130+ are pure void.

Adventure-API audience scoping

PlayerChatEvent listener at EventPriority.HIGHEST:

  • If sender is in main worlds, recipient list is filtered: anyone whose World#getName().equals("auth_limbo") is dropped. Pre-auth players never see overworld chat.
  • If sender is in limbo (would normally not chat — AuthMe blocks it — but defence in depth), recipient list is set to only the sender. They cannot leak messages to the main world.
  • PlayerJoinEvent join messages are suppressed for auth_limbo-spawn joins. Main world only sees a join announcement after the authoritative restore TP succeeds (M2 §"join-message shifting" below).

Tablist scoping

Hook PaperPlayerListEntryEvent (or fall back to PlayerJoinEvent + Player#hidePlayer):

  • Limbo players are hidden from main-world tablist.
  • Main-world players are hidden from limbo tablist.
  • Limbo players cannot see each other (each limbo player sees only themselves).

What main world observers can detect

After scoping:

  • They cannot see the player's name in tablist pre-auth.
  • They cannot see chat from the player.
  • They cannot see the player's world or coordinates (AuthMe blocks movement output anyway, but we don't rely on it).
  • They CAN see the connection event in server logs (operator-only).
  • They can see "PLAYER joined the game" only AFTER restore succeeds — join message is shifted to fire on restore-success, not on initial connect.

This matches the v1 privacy posture and tightens the join-message leak.


5. Login flow — explicit state machine

[CONNECT] ---throttle ok---> [GATE]
                                |
       failed throttle / ban    |
              v                 v
           [REJECTED]      [SNAPSHOT]   <-- read AuthMe DB,
                                |          dump current invent + xp + loc
                                v          to plugins/AuthLimbo/snapshots/<uuid>.nbt
                            [LIMBO]
                                |
                          AuthMe /login ok
                                |
                                v
                           [PRELOAD]   <-- 3x3 chunk pin around target
                                |
                                v
                           [RESTORE]   <-- teleportAsync, retry up to 3
                                |
                          +-----+-----+
                          |           |
                       success     fail x3
                          |           |
                          v           v
                       [LIVE]    [SPECTATOR-AT-LIMBO + admin alert]

Each transition has:

  1. Trigger event (e.g. LoginEvent MONITOR).
  2. Pre-conditions (e.g. UUID in pendingTransit).
  3. Side-effects (e.g. metric counter, audit-log row).
  4. Failure handler (next state on error).

States persist in plugins/AuthLimbo/state/<uuid>.json so a plugin crash mid-flow can resume on rejoin. State file is deleted on [LIVE] entry.

Snapshot subsystem

This is the operator-bug-survives-everything layer.

  • On AuthMeAsyncPreLoginEvent (player just connected, NOT yet auth'd): if a player file world/playerdata/<uuid>.dat exists, read it and shadow-copy to plugins/AuthLimbo/snapshots/<uuid>.nbt with timestamp. SHA-256 of file content is logged.
  • /authlimbo restore <player> can roll back any restore by feeding the snapshot through nbtlib (same as the void-death recovery protocol from feedback_mc_tp_safety.md).
  • Snapshots retained 7 days, then GC'd. Configurable.
  • On PlayerDeathEvent while UUID in pendingTransit: keepInventory=true, event.getDrops().clear(), log SEVERE, trigger Discord webhook, schedule restore-from-snapshot on respawn.

Restore step (replaces v1's doTeleport + 10-tick delay)

  1. Read saved location from AuthMe DB (cached from pre-login — single in-memory hashmap keyed by UUID, evicted on transit clear).
  2. Compute 3x3 chunk grid centred on saved location.
  3. addPluginChunkTicket on all 9 chunks.
  4. CompletableFuture.allOf(getChunkAtAsyncUrgently x9) — wait for all 9 to actually be loaded, not just the centre one (closes the "loaded but neighbour unloaded" race).
  5. teleportAsync(saved, PLUGIN). If false: F2 retry loop (already in v1.1.0, carries over).
  6. On success: 5-tick delay, then verify player.getLocation().distance(saved) < 2.0. If not, treat as a silent failure → retry.
  7. Release tickets 5s post-success.
  8. Mark transition [LIVE], publish auth_login_success_total metric, write audit-log row, send delayed join-message to main world, clear snapshot.

F8 — drop the SPECTATOR pre-TP trick

v1 considered "set GameMode.SPECTATOR before TP, revert after". v2 does NOT do this — spectator mode has its own client-side render races on chunk-load and silently swallows damage events that the F1 guard needs to see. Instead: invariant-driven recovery (snapshot + retry + admin alert) is the safety net. SPECTATOR is the final fallback after 3 failed retries (F6 in v1, kept for v2).


6. Anti-drama checklist (2b2t lessons)

Codified up-front so future "monetisation" pressure is rejected by reference, not by argument.

  • No pay-to-skip. Tier list above is the entire policy.
  • No hidden tier or undocumented bypass (staff bypass is logged).
  • No queue spot trading / selling.
  • No "queue position visible to others" — your position is only visible to you. No social pressure surface.
  • Queue is purely FIFO + tier; no algorithm tweaks, no "lottery".
  • AGPL-3.0 means anyone can fork and self-host an alt gatekeeper if they distrust ours. Operator-friendly.
  • Audit log is local-file JSON-L, not phoned home, not centralised. Operator-readable, no hidden telemetry.

7. Operational surface

Metrics (Prometheus)

Exposed via embedded HTTP server bound to 127.0.0.1:9091 (loopback only — Prometheus on nullstone scrapes via localhost):

Metric Type Labels
authlimbo_connections_total counter tier, outcome={accepted, queued, rejected}
authlimbo_queue_depth gauge
authlimbo_login_success_total counter tier
authlimbo_login_fail_total counter reason={timeout, authme_db, tp_failed_3x, ...}
authlimbo_void_damage_blocked_total counter
authlimbo_snapshot_restored_total counter
authlimbo_restore_duration_seconds histogram tier

Trip-wire alerts (configured server-side, in prometheus/alerts.yml, not in the plugin):

  • authlimbo_login_fail_total{reason="tp_failed_3x"} rate > 0 for 5m.
  • authlimbo_void_damage_blocked_total rate > 0 for 1m.
  • authlimbo_queue_depth > 10 for 5m.

Discord webhooks

Plugin-side webhook fires on:

  • Snapshot restored (gear was about to be lost).
  • 3x retry give-up (manual /authlimbo tp needed).
  • Queue depth > config threshold.
  • AuthMe DB unreachable.
  • Plugin reload / crash.

Webhook URL is in config, redacted from /authlimbo dump.

Audit log

plugins/AuthLimbo/audit.log — JSON Lines, one row per state transition. Fields: ts, uuid, name, ip, tier, state, prev_state, extra (free-form JSON). Logrotate-compatible; rotates at 100MB, keeps 7 files.

Reload-without-restart

/authlimbo reload:

  • Re-reads config.yml.
  • Drains in-flight transits to completion (no new joins accepted during drain, max 30s wait).
  • Re-binds metrics HTTP server if port changed.
  • Re-creates limbo world if name/spawn changed.
  • Discord webhook fires "reload completed in Xs".

8. Failure modes & recovery

Failure Detection Recovery
Plugin crashes mid-restore On startup, scan state/*.json files older than 30s. For each: if player offline, leave snapshot; if online, treat as new transit, force re-restore from saved AuthMe loc.
Snapshot file corrupt / unreadable NBT parse exception. Fall back to AuthMe DB saved-loc; log SEVERE; webhook. Player may lose newest items but not entire inventory.
World save corrupts Paper World#getChunkAtAsync throws. After 3 retries: kick player with "server experiencing storage issue, try again in 5min"; webhook.
AuthMe DB unreachable JDBC getConnection throws / read times out > 5s. Fail closed. Reject connection at gatekeeper with kick: "auth service degraded". Log + webhook. Do NOT let player onto main world without auth.
Server /stop mid-login window Paper shutdown hook. clearTransit for all UUIDs, force-save snapshots, kick all limbo players with "server restarting, your gear is safe".
Race: AuthMe LoginEvent fires twice (HaHaWTH bug) UUID already in pendingTransit and not in RESTORE state. Idempotent — restore handler is a no-op if UUID is past [PRELOAD]. Log INFO.
Player disconnects in [LIMBO] PlayerQuitEvent. Clear pendingTransit + retry counter. Snapshot retained 7d. State file kept until snapshot GC.

fail-open is never the right choice for an auth gatekeeper. Every failure mode resolves to either: keep player in limbo, or kick them. Never advance them to main-world unauth'd.


9. Migration from v1

In-place upgrade path (v1.1.xv2.0.0):

  1. Stop server.
  2. Drop new jar in plugins/. v2 jar is not v1-compatible — old AuthLimbo-1.x.jar must be removed.
  3. v2 detects plugins/AuthLimbo/config.yml from v1 and rewrites it to v2 schema, leaving a config.v1.bak backup.
  4. v2 detects auth_limbo world dir on disk and re-uses it (no recreation, no data loss).
  5. AuthMe DB schema unchanged — v2 still treats authme.db as read-only authoritative.
  6. New: plugins/AuthLimbo/snapshots/ and plugins/AuthLimbo/state/ directories created, owned by the same uid as the itzg container's runtime user.
  7. Start server. v2 startup logs walk through migration steps.

There is no DB migration. No mandatory player action. Permissions node names change (authlimbo.admin is now authlimbo.command.admin, etc.) — operator must update LP groups (noted in CHANGELOG).


10. Test plan

Unit (JUnit 5 + Mockito)

  • LimboWorldManager — barrier-platform construction is idempotent.
  • AuthMeDatabase.getQuitLocation — returns Location for present row, null for absent, null for malformed row.
  • Snapshot.serialize / deserialize round-trip.
  • State-machine: every transition rejects from invalid prev-state.

Integration (Paper test-server harness)

  • Stand up Paper 1.21.x in CI (Forgejo Actions runner on nullstone).
  • Mock AuthMe via a stub plugin that fires AuthMeAsyncPreLoginEvent and LoginEvent programmatically.
  • Test scenarios: §5.1-5.6 from AUDIT-2026-05-07.md plus v2-specific: queue overflow, snapshot-restore on death, reload-without-restart, fail-closed on AuthMe DB down.

Stress (Bot flood)

  • 1000 fake connections in 60s using mineflayer or MCBotsPro. Verify:
    • queue-depth bounded (gatekeeper kicks beyond max-queue-depth);
    • no pendingTransit leak (size returns to 0 after);
    • metrics counters consistent with audit log.

Chaos

  • Kill plugin (/plugman unload AuthLimbo) mid-restore, verify state recovery on rejoin.
  • iptables -A OUTPUT -d <authme-db-host> -j DROP and verify fail-closed.
  • kill -9 itzg container during transit, verify next-startup walks state/*.json and recovers.

11. Versioning + release

  • v2.0.0 = breaking redesign (this doc), AGPL-3.0 retained.
  • v2.1.0 = polish (BossBar UX, /queue command, more metrics).
  • v2.2.0 = Velocity-mode behind feature flag.
  • v1.x = receives F3, F5, F6, F7 backports until racked.ru cuts over to v2; then archived.

Coordinate naming: when the codename migration completes (onyx→obsidian, nullstone→bedrock per gravel-laptop-build/ROADMAP.md), the racked.ru server moves to bedrock. v2.0.0 must run on both naming worlds without config drift.


12. Open questions

  • BossBar UI — does the operator want it visible to limbo players, or silent? Default proposed: visible.
  • Snapshot retention — 7 days is the proposed default. Storage cost is ~1 KB/snapshot for vanilla inventories, up to ~50 KB for shulker-stuffed players. 100 active players → ~5 MB max.
  • Webhook destination — same Discord channel as s8n-ru server-status alerts, or a new channel? Default proposed: same channel, prefixed [AuthLimbo].
  • v2.2 Velocity migration — needs a separate design pass once cobblestone or a second backend is real.

Sign-off pending operator review.