auth-limbo/docs/V2-ARCHITECTURE.md

485 lines
20 KiB
Markdown
Raw Normal View History

# AuthLimbo v2 — Architecture
Status: **Design draft** (no code). Drafted 2026-05-07 by the auth-limbo
v2 design pass after the YOU500 / second-player void-death incidents.
Audience: operator (P) and future contributors.
Companion docs:
- [`AUDIT-2026-05-07.md`](../AUDIT-2026-05-07.md) — root-cause forensic.
- [`ROADMAP.md`](../ROADMAP.md) — v1.x tracking (F1-F7).
- [`V2-ROADMAP.md`](V2-ROADMAP.md) — milestones M0-M5 for v2.
---
## 1. Why v2
v1 is a single-jar Paper plugin glued onto AuthMe. It works *most* of
the time, but its core failure modes are now well-understood and can't
be patched away inside the v1 design:
| v1 limitation | v2 must address |
|---------------|------------------|
| Player object exists on the main server *before* auth — coords/inventory technically restorable from RAM by buggy plugins, world chunk activity is observable. | Strong isolation: limbo is the only state the player can touch pre-auth. |
| Restore relies on AuthMe firing `LoginEvent`. AuthMe's own broken teleport runs in the same window — F4 pre-empts it but the design still races. | Authoritative state machine that doesn't trust AuthMe's teleport at all. |
| Inventory loss on transit-death depends on F1 + F5 holding. There is no inventory-of-record outside live game state. | Snapshot-on-pre-login + snapshot-restore is a first-class subsystem, not a defensive add-on. |
| No metrics, no audit log, no admin alerting. Bugs only surface when a player loses gear. | Built-in observability: Prometheus + JSON-Lines audit + Discord webhook. |
| No queue / login-throttle. If 50 bots connect at once, AuthMe stalls. | Bounded concurrency with transparent FIFO and trust tiers (NOT pay tiers). |
v2 is a clean break (`v2.0.0`), not a v1 patch. v1 stays receiving F3,
F5, F6, F7 backports for as long as racked.ru still runs the old jar.
---
## 2. Stack decision — **Paper-only**, with a Velocity-ready seam
**Recommendation: Paper-only single-server plugin for v2.0.0.**
Velocity-mode is a v2.x deferrable behind a feature flag.
### Reasoning
racked.ru today is one Purpur 1.21.11 server in `minecraft-mc` itzg
container on nullstone. There is no Velocity / BungeeCord, no second
backend, no Forced Hosts, no proxy network. Adding Velocity to ship a
gatekeeper plugin would mean:
- standing up a new container, opening a new public port (or keeping
25565 on the proxy and 25566 internal),
- migrating the 12+ existing Paper plugins through the velocity-paper
bridge contract for chat / commands / placeholders,
- new TLS / RCON / proxy-protocol surface to harden,
- breaking changes to AuthMe's data flow (proxy-side login flow vs
paper-side `AuthMeAsyncPreLoginEvent`),
- one more thing for the operator to babysit.
The privacy property the operator cares about — *no other player sees
pre-auth coords / inventory* — is achievable on Paper-only via a
strictly isolated limbo world + audience scoping (see §4). Velocity adds
*stronger* isolation (player never reaches the backend at all) but the
incremental privacy gain is small for a 0-10 player community, and the
operational cost is large.
### When Velocity becomes worth it
Codify trip-wires up front so the decision isn't dragged out:
1. racked.ru splits into ≥2 backends (e.g. `survival` + `creative`) —
you need a proxy anyway.
2. cobblestone server comes online and shares an account/auth pool.
3. Botting attempts cross 100 connections / minute and `connection-throttle` +
`firewalld rate-limit` are no longer enough. Velocity + a queue
plugin (Ajax / VeloctyQueue) become operationally cheaper than
chasing botnets at the application layer.
Until any of those, Paper-only is the right answer.
### The Velocity-ready seam
v2 internal API is split into two layers so the proxy migration is
mechanical:
```
+-------------------------------+ +-------------------------------+
| Gatekeeper (proxy or paper) | | Restore (paper only) |
| - accept connection | | - read snapshot |
| - check ban / rate limit | | - chunk preload |
| - hold in limbo / queue | | - authoritative TP |
| - hand off on auth-success | | - publish metrics |
+--------------+----------------+ +-------------------------------+
| hand-off event (UUID, target Location, source IP)
v
```
In v2.0 both layers live in the Paper plugin and the hand-off is just a
local method call. In a future "v2-velo" both layers split: gatekeeper
runs as a Velocity plugin, restore stays on Paper, hand-off becomes a
plugin-message channel. No code outside those two layers needs to
change.
---
## 3. Queue model — login-throttle + transparent trust tiers, NO 2b2t-style sale
**For 0-10 player normal load: queue depth is always 0 and players
never see "queued" UI. The queue exists for crisis scenarios (bot
flood, restart drain, AuthMe DB stall) and to define explicit policy
even if it's rarely hit.**
### Policy
| Tier | Definition | Effect |
|------|-----------|--------|
| `staff` | Player has `authlimbo.queue.priority.staff` permission (LP-managed). | Always passes. Bypasses queue entirely. |
| `returning` | Player is in AuthMe DB AND has logged in within last 30 days. | Default tier for everyone who isn't new. Normal FIFO ordering by connect-time. |
| `new` | Player is NOT in AuthMe DB OR last seen >30 days ago. | Same FIFO as `returning` BUT with a per-IP 1/minute throttle. Stops bot-floods. |
| `flagged` | Player IP matches a Pi-hole/CrowdSec/abuse-DB block. | Rejected at gatekeeper, never enters the queue. |
Hard rules — written into `V2-ARCHITECTURE.md` so they outlive any one
operator's mood:
1. **No paid priority. Ever.** No "priority queue pass", no
"supporter rank skip", no Patreon tier. The 2b2t community
collapsed under that grift; we don't repeat it.
2. **No hidden veteran tier.** Every tier is documented in this file
and in `/authlimbo queue policy` in-game. If a player can't see why
they're in tier X, the tier is illegitimate.
3. **No in-game bidding / griefing for queue spots.** Queue position
is purely connect-time + tier; no player action affects it.
4. **Ops-staff bypass is logged.** Every staff bypass writes a JSON-L
audit row.
### Capacity
- `gatekeeper.max-concurrent-auth: 5` — at most 5 players in the
pre-auth limbo at once. Defaults sized for racked.ru. AuthMe DB
reads + chunk pins per concurrent player are roughly free, but bound
it anyway.
- `gatekeeper.max-queue-depth: 50` — beyond 50 waiting, new
connections get a "server is starting up, try again in 30s" kick.
Better UX than a 5-minute black-screen wait.
- `gatekeeper.queue-timeout-seconds: 120` — anyone in the queue >2
minutes gets the same kick + a Discord webhook fires.
### What queue UX looks like
In limbo, a `BossBar` (Adventure API) shows tier + position:
```
[returning] Queue position: 3 / 7 ETA: ~15s
```
When position == 0 and AuthMe accepts, the bar disappears. There's no
hidden state. `/queue` in-chat re-displays the same info.
---
## 4. Privacy isolation
This is the original feature; v2 must not regress it.
### Limbo world
- Separate Bukkit world `auth_limbo`, `Environment.THE_END`,
`VoidGenerator`. Same as v1.
- `keepSpawnInMemory=true`. Game-rules: no daylight, no weather, no
mobs, no fire-tick, no PvP, `doImmediateRespawn=true`,
`keepInventory=true` (defence-in-depth — limbo never *should* see a
death event but if it does, no item drops happen).
- Per-player view-distance forced to 2 in limbo via Paper's
`Player#setViewDistance`. They see 5x5 chunks, all empty.
- Limbo platform: 5x5 of `BARRIER` blocks at y=127, single block of
`BARRIER` ceiling at y=129 to prevent flying out. y=0..126 and
y=130+ are pure void.
### Adventure-API audience scoping
`PlayerChatEvent` listener at `EventPriority.HIGHEST`:
- If sender is in main worlds, recipient list is filtered: anyone
whose `World#getName().equals("auth_limbo")` is dropped. Pre-auth
players never see overworld chat.
- If sender is in limbo (would normally not chat — AuthMe blocks it
— but defence in depth), recipient list is set to *only* the
sender. They cannot leak messages to the main world.
- `PlayerJoinEvent` join messages are suppressed for
`auth_limbo`-spawn joins. Main world only sees a join announcement
*after* the authoritative restore TP succeeds (M2 §"join-message
shifting" below).
### Tablist scoping
Hook `PaperPlayerListEntryEvent` (or fall back to
`PlayerJoinEvent` + `Player#hidePlayer`):
- Limbo players are hidden from main-world tablist.
- Main-world players are hidden from limbo tablist.
- Limbo players cannot see each other (each limbo player sees only
themselves).
### What main world observers can detect
After scoping:
- They cannot see the player's name in tablist pre-auth.
- They cannot see chat from the player.
- They cannot see the player's world or coordinates (AuthMe blocks
movement output anyway, but we don't rely on it).
- They CAN see the connection event in server logs (operator-only).
- They can see "PLAYER joined the game" only AFTER restore succeeds
— join message is shifted to fire on restore-success, not on
initial connect.
This matches the v1 privacy posture and tightens the join-message
leak.
---
## 5. Login flow — explicit state machine
```
[CONNECT] ---throttle ok---> [GATE]
|
failed throttle / ban |
v v
[REJECTED] [SNAPSHOT] <-- read AuthMe DB,
| dump current invent + xp + loc
v to plugins/AuthLimbo/snapshots/<uuid>.nbt
[LIMBO]
|
AuthMe /login ok
|
v
[PRELOAD] <-- 3x3 chunk pin around target
|
v
[RESTORE] <-- teleportAsync, retry up to 3
|
+-----+-----+
| |
success fail x3
| |
v v
[LIVE] [SPECTATOR-AT-LIMBO + admin alert]
```
Each transition has:
1. **Trigger event** (e.g. `LoginEvent` MONITOR).
2. **Pre-conditions** (e.g. UUID in `pendingTransit`).
3. **Side-effects** (e.g. metric counter, audit-log row).
4. **Failure handler** (next state on error).
States persist in `plugins/AuthLimbo/state/<uuid>.json` so a plugin
crash mid-flow can resume on rejoin. State file is deleted on
[LIVE] entry.
### Snapshot subsystem
**This is the operator-bug-survives-everything layer.**
- On `AuthMeAsyncPreLoginEvent` (player just connected, NOT yet
auth'd): if a player file `world/playerdata/<uuid>.dat` exists,
read it and shadow-copy to `plugins/AuthLimbo/snapshots/<uuid>.nbt`
with timestamp. SHA-256 of file content is logged.
- `/authlimbo restore <player>` can roll back any restore by
feeding the snapshot through nbtlib (same as the void-death recovery
protocol from `feedback_mc_tp_safety.md`).
- Snapshots retained 7 days, then GC'd. Configurable.
- On `PlayerDeathEvent` while UUID in `pendingTransit`:
`keepInventory=true`, `event.getDrops().clear()`, log SEVERE,
trigger Discord webhook, schedule restore-from-snapshot on respawn.
### Restore step (replaces v1's `doTeleport` + 10-tick delay)
1. Read saved location from AuthMe DB (cached from pre-login —
single in-memory hashmap keyed by UUID, evicted on transit clear).
2. Compute 3x3 chunk grid centred on saved location.
3. `addPluginChunkTicket` on all 9 chunks.
4. `CompletableFuture.allOf(getChunkAtAsyncUrgently x9)` — wait for
all 9 to actually be loaded, not just the centre one (closes the
"loaded but neighbour unloaded" race).
5. `teleportAsync(saved, PLUGIN)`. If `false`: F2 retry loop (already
in v1.1.0, carries over).
6. On success: 5-tick delay, then verify
`player.getLocation().distance(saved) < 2.0`. If not, treat as a
silent failure → retry.
7. Release tickets 5s post-success.
8. Mark transition [LIVE], publish `auth_login_success_total`
metric, write audit-log row, send delayed join-message to main
world, clear snapshot.
### F8 — drop the SPECTATOR pre-TP trick
v1 considered "set GameMode.SPECTATOR before TP, revert after". v2
does NOT do this — spectator mode has its own client-side render races
on chunk-load and silently swallows damage events that the F1 guard
*needs to see*. Instead: invariant-driven recovery (snapshot + retry +
admin alert) is the safety net. SPECTATOR is the final fallback after
3 failed retries (F6 in v1, kept for v2).
---
## 6. Anti-drama checklist (2b2t lessons)
Codified up-front so future "monetisation" pressure is rejected by
reference, not by argument.
- [x] No pay-to-skip. Tier list above is the entire policy.
- [x] No hidden tier or undocumented bypass (staff bypass is logged).
- [x] No queue spot trading / selling.
- [x] No "queue position visible to others" — your position is only
visible to you. No social pressure surface.
- [x] Queue is purely FIFO + tier; no algorithm tweaks, no "lottery".
- [x] AGPL-3.0 means anyone can fork and self-host an alt
gatekeeper if they distrust ours. Operator-friendly.
- [x] Audit log is local-file JSON-L, not phoned home, not
centralised. Operator-readable, no hidden telemetry.
---
## 7. Operational surface
### Metrics (Prometheus)
Exposed via embedded HTTP server bound to `127.0.0.1:9091` (loopback
only — Prometheus on nullstone scrapes via localhost):
| Metric | Type | Labels |
|--------|------|--------|
| `authlimbo_connections_total` | counter | `tier`, `outcome={accepted, queued, rejected}` |
| `authlimbo_queue_depth` | gauge | — |
| `authlimbo_login_success_total` | counter | `tier` |
| `authlimbo_login_fail_total` | counter | `reason={timeout, authme_db, tp_failed_3x, ...}` |
| `authlimbo_void_damage_blocked_total` | counter | — |
| `authlimbo_snapshot_restored_total` | counter | — |
| `authlimbo_restore_duration_seconds` | histogram | `tier` |
Trip-wire alerts (configured server-side, in
`prometheus/alerts.yml`, not in the plugin):
- `authlimbo_login_fail_total{reason="tp_failed_3x"}` rate > 0 for 5m.
- `authlimbo_void_damage_blocked_total` rate > 0 for 1m.
- `authlimbo_queue_depth` > 10 for 5m.
### Discord webhooks
Plugin-side webhook fires on:
- Snapshot restored (gear was about to be lost).
- 3x retry give-up (manual `/authlimbo tp` needed).
- Queue depth > config threshold.
- AuthMe DB unreachable.
- Plugin reload / crash.
Webhook URL is in config, redacted from `/authlimbo dump`.
### Audit log
`plugins/AuthLimbo/audit.log` — JSON Lines, one row per state
transition. Fields: `ts`, `uuid`, `name`, `ip`, `tier`, `state`,
`prev_state`, `extra` (free-form JSON). Logrotate-compatible; rotates
at 100MB, keeps 7 files.
### Reload-without-restart
`/authlimbo reload`:
- Re-reads `config.yml`.
- Drains in-flight transits to completion (no new joins accepted
during drain, max 30s wait).
- Re-binds metrics HTTP server if port changed.
- Re-creates limbo world if name/spawn changed.
- Discord webhook fires "reload completed in Xs".
---
## 8. Failure modes & recovery
| Failure | Detection | Recovery |
|---------|-----------|----------|
| Plugin crashes mid-restore | On startup, scan `state/*.json` files older than 30s. | For each: if player offline, leave snapshot; if online, treat as new transit, force re-restore from saved AuthMe loc. |
| Snapshot file corrupt / unreadable | NBT parse exception. | Fall back to AuthMe DB saved-loc; log SEVERE; webhook. Player may lose newest items but not entire inventory. |
| World save corrupts | Paper World#getChunkAtAsync throws. | After 3 retries: kick player with "server experiencing storage issue, try again in 5min"; webhook. |
| AuthMe DB unreachable | JDBC `getConnection` throws / read times out > 5s. | **Fail closed.** Reject connection at gatekeeper with kick: "auth service degraded". Log + webhook. Do NOT let player onto main world without auth. |
| Server `/stop` mid-login window | Paper shutdown hook. | `clearTransit` for all UUIDs, force-save snapshots, kick all limbo players with "server restarting, your gear is safe". |
| Race: AuthMe LoginEvent fires twice (HaHaWTH bug) | UUID already in `pendingTransit` and not in `RESTORE` state. | Idempotent — restore handler is a no-op if UUID is past [PRELOAD]. Log INFO. |
| Player disconnects in [LIMBO] | `PlayerQuitEvent`. | Clear pendingTransit + retry counter. Snapshot retained 7d. State file kept until snapshot GC. |
`fail-open` is never the right choice for an auth gatekeeper. Every
failure mode resolves to either: keep player in limbo, or kick them.
Never advance them to main-world unauth'd.
---
## 9. Migration from v1
In-place upgrade path (`v1.1.x` → `v2.0.0`):
1. Stop server.
2. Drop new jar in `plugins/`. v2 jar is not v1-compatible — old
`AuthLimbo-1.x.jar` must be removed.
3. v2 detects `plugins/AuthLimbo/config.yml` from v1 and rewrites it
to v2 schema, leaving a `config.v1.bak` backup.
4. v2 detects `auth_limbo` world dir on disk and re-uses it (no
recreation, no data loss).
5. AuthMe DB schema unchanged — v2 still treats `authme.db` as
read-only authoritative.
6. New: `plugins/AuthLimbo/snapshots/` and
`plugins/AuthLimbo/state/` directories created, owned by the same
uid as the itzg container's runtime user.
7. Start server. v2 startup logs walk through migration steps.
There is no DB migration. No mandatory player action. Permissions
node names change (`authlimbo.admin` is now
`authlimbo.command.admin`, etc.) — operator must update LP groups
(noted in CHANGELOG).
---
## 10. Test plan
### Unit (JUnit 5 + Mockito)
- `LimboWorldManager` — barrier-platform construction is idempotent.
- `AuthMeDatabase.getQuitLocation` — returns `Location` for present row,
null for absent, null for malformed row.
- `Snapshot.serialize` / `deserialize` round-trip.
- State-machine: every transition rejects from invalid prev-state.
### Integration (Paper test-server harness)
- Stand up Paper 1.21.x in CI (Forgejo Actions runner on nullstone).
- Mock AuthMe via a stub plugin that fires `AuthMeAsyncPreLoginEvent`
and `LoginEvent` programmatically.
- Test scenarios: §5.1-5.6 from `AUDIT-2026-05-07.md` plus
v2-specific: queue overflow, snapshot-restore on death,
reload-without-restart, fail-closed on AuthMe DB down.
### Stress (Bot flood)
- 1000 fake connections in 60s using mineflayer or
[`MCBotsPro`](https://github.com/Sammy1Am/MCBotsPro). Verify:
- queue-depth bounded (gatekeeper kicks beyond max-queue-depth);
- no `pendingTransit` leak (size returns to 0 after);
- metrics counters consistent with audit log.
### Chaos
- Kill plugin (`/plugman unload AuthLimbo`) mid-restore, verify
state recovery on rejoin.
- `iptables -A OUTPUT -d <authme-db-host> -j DROP` and verify
fail-closed.
- `kill -9` itzg container during transit, verify next-startup
walks `state/*.json` and recovers.
---
## 11. Versioning + release
- v2.0.0 = breaking redesign (this doc), AGPL-3.0 retained.
- v2.1.0 = polish (BossBar UX, /queue command, more metrics).
- v2.2.0 = Velocity-mode behind feature flag.
- v1.x = receives F3, F5, F6, F7 backports until racked.ru cuts over
to v2; then archived.
Coordinate naming: when the codename migration completes
(onyx→obsidian, nullstone→bedrock per
`gravel-laptop-build/ROADMAP.md`), the racked.ru server moves to
bedrock. v2.0.0 must run on both naming worlds without config drift.
---
## 12. Open questions
- BossBar UI — does the operator want it visible to limbo players, or
silent? Default proposed: visible.
- Snapshot retention — 7 days is the proposed default. Storage cost
is ~1 KB/snapshot for vanilla inventories, up to ~50 KB for
shulker-stuffed players. 100 active players → ~5 MB max.
- Webhook destination — same Discord channel as `s8n-ru` server-status
alerts, or a new channel? Default proposed: same channel, prefixed
`[AuthLimbo]`.
- v2.2 Velocity migration — needs a separate design pass once
cobblestone or a second backend is real.
Sign-off pending operator review.