infra/AUDIT-2026-05-05.md

371 lines
16 KiB
Markdown
Raw Normal View History

# 5-Agent Audit Report — 2026-05-05
Synthesis of 5 parallel agents covering: GitHub→Forgejo migration,
ai-lab structure, nullstone services, stack rating, recommended
additions.
Source agent outputs:
1. Migration agent → `nullstone-server/forgejo/migration-report-2026-05-05.md`
2. ai-lab structural audit
3. nullstone services + deployment audit
4. Stack rating (10 axes)
5. Recommended service additions
---
## TL;DR
- **GH → Forgejo migration: complete.** 6/6 repos mirrored
(5× s8n-ru/* + veilor-org/veilor-os). All HEADs match, branches
match, tags match, push-mirrors back to GH all green. Repaired one
default-branch metadata drift on `s8n-ru/x`. Zero failures.
- **Stack rating: 7/10.** Above-average self-hosted setup. Audit
discipline + identity/CA story unusually strong. Fragile on
monitoring + offsite backup + single-host.
- **Top 5 weaknesses (severity-ordered):** F4 no LUKS on nullstone
(regression), no monitoring/alerting, backups local-only with
silently broken script, `:latest` floats on most stacks, single
point of failure (nullstone + home WAN).
- **Top 5 services to add (priority):** Restic+autorestic, Vaultwarden,
Gatus, CrowdSec, Beszel.
- **Top 4 anti-recommendations:** Nextcloud, full LGTM stack, Mastodon,
HashiCorp Vault.
---
## 1 — GitHub repo migration
**Status: complete.** Per migration agent's report.
- 6 repos enumerated under `s8n-ru` user + admin'd orgs.
- 6 mirrored to `git.s8n.ru` (Forgejo); 5 fresh, 1 already pre-migrated
(`veilor-org/veilor-os`).
- HEADs / branches / tags match GH for all 6.
- Push-mirrors Forgejo → GH configured (8h interval + sync-on-commit),
all green.
- One repair: `s8n-ru/x` default branch was stuck on
`KisaragiEffective-patch-1` from Misskey upstream; PATCHed to
`master`.
Detail: `nullstone-server/forgejo/migration-report-2026-05-05.md`.
---
## 2 — ai-lab structural audit
### Devices
| codename | type | OS | role |
|---|---|---|---|
| onyx | laptop | Fedora 43 KDE | Dev workstation (DHCP `.28`, registry says `.6` — drift) |
| nullstone | server | Debian 13 | Infra host — Docker stack, mesh, Matrix/Misskey/RC |
| office | workstation | Fedora 43 KDE (pending install since 2026-04-19) | Office/sales (.5) |
External: friend PC `100.64.0.3` (RTX 4080, vLLM in WSL2).
### Active projects (`_github/`)
| repo | purpose | status |
|---|---|---|
| veilor-os | Hardened Fedora 43 KDE remix | actively iterating, BlueBuild + kickstart |
| auth-limbo | Paper plugin (racked.ru AuthMe fix) | active, released jars |
| minecraft-launcher | Custom MC launcher (PrismLauncher fork) | active, v1 build script |
| minecraft-server | Purpur MC at `mc.racked.ru:25565` | live in prod |
| minecraft-client | racked.ru MC client (FO 11.3.2 fork) | active |
### Per-device security audit cadence
| device | last audit | folder |
|---|---|---|
| nullstone | 2026-05-05 (ACL hardening); full 2026-05-02 | `security/nullstone-server/` (9 reports) |
| onyx | 2026-04-15 | `security/onyx-laptop/` (2 reports) |
| office | never | `security/office-workstation/` (empty) |
### Memory record (31 files, 1 index)
- 2 user, 7 feedback, 1 reference, 21 project memos.
- Top-active: matrix_veilor, txt_cinny, x_misskey_fork, tailscale_mesh,
friend_gpu, org_charter, brand_separation, simplex_org_chat.
### What this lab is
The operator runs a small home-lab/3-member CTO-style org
(`P M=CTO, nullstone=Runtime Owner, onyx-ai=Research/Review`) split
cleanly across **two brands** (per `project_brand_separation.md`):
1. **racked.ru** — privacy-first Minecraft platform (MC server +
client + custom launcher + AuthLimbo plugin)
2. **veilor** — security company stack (veilor-os hardened Fedora
ISO, veilor-server-bootstrap Debian preseed, Matrix at veilor.uk,
Misskey-fork at x.veilor)
All self-hosted on nullstone behind Traefik+Headscale+Pi-hole. Mesh
includes friend's RTX 4080 for remote LLM inference via Tailscale.
### Drift / gaps
- `office-workstation/` registered in CLAUDE.md but install pending
since 2026-04-19; no audit folder populated.
- README onyx IP `.6` vs actual DHCP `.28`.
- README folder tree doesn't match real repo (lists `_project_code/`
+ `scripts/`; reality has `_github/`, `_projects/`, `_archive/`,
`archive/`, `github/`, several `.sync-conflict-*` files, 30 MB
binary `re` at root).
- Two parallel `nullstone-server/` and `server/` device folders —
drift candidate.
- `MEMORY.md` index missing entry for `project_forgejo_nullstone.md`
(file present, index not updated).
- Sync-conflict files for CLAUDE.md / README.md / SYSTEM.md from
Syncthing merge never resolved.
- SYSTEM.md still mentions Jitsi/coturn / MAS Element X test
retired per project_matrix_veilor.md — TODO list not pruned.
---
## 3 — nullstone services + deployment audit
### Hardware
- **CPU:** AMD Ryzen 5 2600X (6c/12t)
- **RAM:** 32 GiB (15 used, 15 free, 24 GiB swap, 256 KiB used)
- **GPU:** GTX 1660 Ti 6 GB (Ollama)
- **Disk:** 477 GiB NVMe, LVM (`keystone-vg`):
- root 30 G (35% used)
- var 12 G (15%)
- **home 399 G (60%, 227 G used / 153 G free)** — watch growth
- tmp 2.7 G, swap 24 G
- **OS:** Debian 13, kernel 6.12.85+deb13
- **Docker:** v29.4.2, overlay2, **userns-remap=default**,
live-restore=true, icc=false, no-new-privileges=true. Data root
symlinked `/var/lib/docker → /home/user/docker-data`.
### Active services (28 containers)
Including: traefik, socket-proxy, authentik (server+worker+pg+redis),
forgejo + forgejo-runner, misskey + db + redis, x-source nginx,
rocketchat + mongodb, tuwunel + tuwunel-txt, cinny-txt, commet-web,
signup-page + signup-txt, livekit + lk-jwt-service, dl-veilor, pihole,
headscale, n8n + postgres, step-ca, filebrowser-mc, minecraft-mc,
anythingllm, plus 2 stale `alpine:3` shells from userns-host bypass.
### Domain → service map (all on `*.s8n.ru` or `*.veilor[.uk]`)
`sys.s8n.ru` (traefik dash), `git.s8n.ru` (forgejo, NEW), `auth.s8n.ru`
(authentik), `pihole.s8n.ru`, `signup.txt.s8n.ru`, `hs.s8n.ru`
(headscale), `rc.s8n.ru` (rocketchat), `n8n.s8n.ru`, `txt.s8n.ru`
(cinny), `mx.s8n.ru` (tuwunel-txt), `x.veilor` (misskey),
`matrix.veilor.uk`, `chat.veilor.uk` (commet), `livekit.veilor.uk`,
`signup.veilor.uk`, `dl.veilor.org`.
### Deployment patterns
- Compose: `/opt/docker/<svc>/docker-compose.yml`
- Data: named docker volumes under
`/home/user/docker-data/100000.100000/volumes/` + per-service
bind mounts. Newer services (forgejo, forgejo-runner, minecraft)
on `/home/docker/<svc>/` to dodge 30 G root.
- userns-remap quirk: container UIDs shifted +100000.
Workaround: alpine root container or chown to 101000.
- Docker socket exposure: traefik does NOT mount docker.sock; goes
via tecnativa/docker-socket-proxy on socket-proxy-net.
- Networks: `proxy` + `socket-proxy-net` + `misskey-frontend` +
per-stack internals (authentik-internal, misskey-internal, etc.).
- Middleware chain: `trusted-only@file → security-headers@file
→ rate-limit@file → <service-specific>` with `no-guest@file`
for routers needing tailnet+LAN but blocking public.
### Auth patterns
- **Authentik (auth.s8n.ru)** — central OIDC, all 4 components healthy.
**Currently mostly unwired.** Forgejo runs native auth (no OAUTH
section in app.ini). RC, n8n, anythingllm, filebrowser likely
local-auth too. Authentik present but underused.
- **Forgejo** — local users + PAT, admin `s8n-ru`, SSH 222.
- **Headscale** — preauthkey enrollment + `headscale-deny-leaks@file`.
- **Traefik dashboard** — basicauth + trusted-only@file.
### Backup state
- `/etc/cron.d/docker-backup` runs `/opt/docker/backup.sh` at 02:00
daily, 7-day rotation to `/opt/backups/`.
- **Script silently broken (HIGH):** matrix-postgres container is
gone (Synapse retired); rocketchat-mongodb name mismatch (script
expects `mongodb`); Mongo password reads literal
`CHANGE_ME_MONGO_ADMIN_PASSWORD`. So Rocket.Chat + (former) Matrix
dumps **not happening**. Misskey side-script works.
- **No off-host replication.** Single NVMe = total loss on disk
failure.
### Drift / risk register
- 🔴 Backup script broken (RC + ex-Matrix not dumping)
- 🔴 `anythingllm` listens 0.0.0.0:3001 with no traefik label,
bypasses entire L7 trust model. Either bind LAN-only or front via
traefik.
- 🟠 Resource limits: only minecraft-mc has memory/CPU limits.
30 other containers unbounded — runaway can OOM-kill neighbours.
- 🟠 No service-level health checks on ~half the containers.
- 🔴 `no-guest@file` IPAllowList stub: declares only
`sourceRange: ["127.0.0.0/8"]`. Routers chained with `no-guest`
reject everything except loopback unless XFF restores client IP.
**Verify** entryPoint forwardedHeaders.trustedIPs + middleware
ipStrategy.depth — misconfig either 403s real users or accepts
spoofed XFF.
- 🟡 office (100.64.0.4) not in `trusted-only@file` despite
`tag:infra` per SYSTEM.md.
- 🟠 RocketChat: first-admin setup still pending — wizard endpoint
takeover risk until claimed.
- 🟡 Stale `alpine:3` shell containers (userns-host bypass leftovers).
`docker rm -f` after each one-shot.
- 🟡 Archived compose dirs (`pocket-id.archived-*`, `matrix-old`)
contain secrets — move off docker tree.
- 🟡 `/home` 60% with growing volumes (Ollama, mongo, postgres ×3).
No quotas.
### Mem pressure: none right now
Top consumer minecraft 9.35 / 18 GiB cap (52% of cap, ~30% host).
All others < 2.2%. Headroom good.
---
## 4 — Stack rating (10 axes)
| Axis | Score | Top weakness |
|---|---|---|
| Architectural coherence | 8 | Drift artifacts (sync-conflict files, parallel `_archive`/`archive`) |
| Security posture | 7 | F4 no LUKS on server (regression); F30 ip_forward=1; F12 partial revert |
| Reproducibility | 6 | Most stacks on `:latest`; no IaC; admin bootstrap uncoded |
| Operational maturity | **4** | **No metrics/alerts; backups untested; on-call="user reads logs"** |
| Cost discipline | 9 | Single residential ISP + single home server is "cheap because fragile" |
| Threat model clarity | 6 | No written THREAT_MODEL.md; AGPL §13 source endpoint deferred |
| Update hygiene | 5 | `:latest` floats; no staged rollout; recovery = "edit compose, restart" |
| Documentation quality | 8 | SYSTEM.md is 979 lines; CV + team-msg.txt + sync-conflicts in repo root |
| Network resilience | 5 | Single residential WAN; control + data plane same box; no Tor/SOCKS fallback |
| Branding/product discipline | 9 | "X" rebrand close to veilor — easy to confuse in logs/docs |
### Overall: **7/10**
Above-average self-hosted stack. Better-documented than 90% of
homelabs, with audit discipline most small SaaS shops don't reach,
and a coherent identity/CA story (own root CA via step-ca, own VPN
control plane via Headscale, own Matrix homeserver). Loses points on
operational maturity (no monitoring, no offsite/tested backups, no
rollback), one critical regression (no LUKS on nullstone), and
inherent fragility from single-host single-ISP design.
The gap between **known weaknesses** and **fixed weaknesses** is the
limiting factor — operator clearly *can* fix these (audit closes 27/35
findings in 3 days), they just haven't yet.
### Comparison
- vs **Stock Fedora desktop + GitHub:** wins decisively (8 vs 3) on
network/identity/AGPL discipline.
- vs **secureblue + GH Actions:** stronger on server-side sovereignty;
weaker on client posture and CI. Roughly tied overall, different axes.
- vs **Hetzner-VPS hobbyist stack:** loses on resilience + update
hygiene, wins on cost + GPU inference + identity depth. This stack
more ambitious; Hetzner more boring-and-reliable.
- vs **Cloudflare/Workers managed:** wins on sovereignty + GPU + Matrix
ability. Loses on uptime + DDoS + zero-patching. This stack's whole
reason to exist is the inverse tradeoff — and it makes that tradeoff
coherently.
---
## 5 — Recommended service additions
### Top 5 priority (deploy in this order)
| # | Service | Why now | Effort | Maintenance |
|---|---|---|---|---|
| 1 | **Restic + autorestic** | Single biggest gap. nullstone NVMe failure = total loss right now. Encrypted incremental to B2/Wasabi or to onyx. | M | S |
| 2 | **Vaultwarden** | N services with N storage methods for secrets. Centralize before count grows. | S | S |
| 3 | **Gatus** | Otherwise you find out about a downed service from a friend on Matrix. Cert-expiry alone catches the silent killer. Alerts via Tuwunel webhook or ntfy. | S | S |
| 4 | **CrowdSec** | Pi-hole only sees DNS layer. Public Matrix fed candidates + RC + Misskey + signup pages = HTTP attack surface. Bouncer plugin blocks at Traefik. | M | S |
| 5 | **Beszel** | Once Restic is filling disk + CrowdSec flagging IPs, you want one dashboard. | S | S |
### Anti-recommendations
| Service | Why NOT |
|---|---|
| **Nextcloud** | Heavy (1.5 GB+ RAM idle), notorious upgrade pain. Use Seafile if you need files. |
| **Full LGTM stack** (Grafana+Prom+Loki+Alertmanager) | Five services to do what Beszel+Gatus do for solo-op. |
| **Mastodon** | You already run Misskey-fork. Federating two ActivityPub silos doubles moderation. |
| **HashiCorp Vault** | Complexity-to-benefit ratio terrible for one operator. Infisical or pass-with-git enough. |
| **Authelia** | Duplicates Authentik. Pick one. |
### Consolidation suggestions
- **Cinny + various Element/Commet forks:** pick **one** web client
per Matrix instance. Each fork = separate audit + CSP + branding burden.
- **n8n:** if only used for 2-3 simple flows, replace with shell
scripts in Forgejo Actions cron. n8n's value is the GUI for
non-coders; you're a coder.
- **Step-CA + Let's Encrypt:** confirm zero overlap. If step-ca only
issues one cert, kill it.
- **dl-veilor + signup pages:** if static, fold into single Caddy
container behind Traefik. Two containers for static HTML is two
too many.
### Other notable picks (lower priority)
- **Seafile CE** — file sync (much lighter than Nextcloud)
- **Karakeep** (formerly Hoarder) — bookmarks/RSS/read-later, AI tags
via your local Ollama / friend RTX 4080
- **ntfy** — formalize the push-notification target you're already
using ad-hoc
- **Forgejo Packages** — already implicit, just enable for container
registry + npm/cargo/maven/generic
---
## 6 — Action items (severity-ordered)
### Ship-blocking (do this week)
1. **Fix `/opt/docker/backup.sh`** — remove dead matrix-postgres,
correct rocketchat-mongodb container name, replace literal
`CHANGE_ME_MONGO_ADMIN_PASSWORD`. Verify next 02:00 run produces
non-zero RC + Mongo dumps.
2. **Bind anythingllm to LAN-only** OR add traefik front with
`no-guest@file`. Currently public on :3001.
3. **Verify `no-guest@file` ACL** — confirm `sourceRange` covers
LAN + tailnet, not just loopback. Verify XFF chain restores
real client IP.
4. **Claim RocketChat first-admin** — takeover risk until then.
5. **Enable LUKS2 on nullstone** (F4 regression) — schedule reinstall
window with TPM2 unlock; or until then, LUKS-on-file loopback
for step-ca root key + acme.json + Mongo keyfile.
### High-value next (do this month)
6. Deploy **Restic + autorestic** with B2/Wasabi target + restore drill.
7. Deploy **Vaultwarden** + migrate secrets out of `.env` files.
8. Deploy **Gatus** with cert-expiry checks + Matrix/ntfy alerts.
9. Resolve **sync-conflict files** at ai-lab repo root.
10. **Pin docker images by digest** for critical stacks (already done
for Misskey; do tuwunel/livekit/cinny/pihole/RC/Traefik next).
### Defer / planned
- Office workstation install + first audit
- Fold dl-veilor + signup pages into single Caddy
- Replace n8n with Forgejo Actions cron (if usage <5 flows)
- Move Headscale + step-ca to $4/mo VPS for SPOF mitigation
---
## 7 — File index
| Output | Path |
|---|---|
| This synthesis | `~/ai-lab/nullstone-server/audit-report-2026-05-05.md` |
| Migration detail | `~/ai-lab/nullstone-server/forgejo/migration-report-2026-05-05.md` |
| Forgejo runbook | `~/ai-lab/nullstone-server/forgejo/deploy-runbook.md` |
| Forgejo memory | `~/.claude/projects/-home-admin-ai-lab/memory/project_forgejo_nullstone.md` |
| veilor-os strategy | `~/ai-lab/_github/veilor-os/docs/STRATEGY.md` |
| veilor-os roadmap | `~/ai-lab/_github/veilor-os/docs/ROADMAP.md` |
| veilor-os threat model | `~/ai-lab/_github/veilor-os/docs/THREAT-MODEL.md` |