infra/runbooks/MIGRATION-nullstone-to-cobblestone.md

631 lines
28 KiB
Markdown
Raw Normal View History

<!--
Migration runbook: nullstone → cobblestone
Audience: P M (operator), nullstone Runtime Owner.
Status: DRAFT — pre-cutover. Read sections 13 first; sections 47 are
executed only on cutover day.
Source-of-truth audits referenced:
- ~/ai-lab/SYSTEM.md
- ~/ai-lab/nullstone-server/audit-report-2026-05-05.md
- ~/ai-lab/nullstone-server/forgejo/deploy-runbook.md
Last updated: 2026-05-06
-->
# Migration runbook — nullstone → cobblestone
Goal: relocate the Docker stack (~28 containers, ~227 GiB state) from
**nullstone** (Debian 13, 192.168.0.100, AMD Ryzen 5 2600X / 32 GiB /
477 GiB NVMe, no LUKS) to **cobblestone** (Debian, fresh, LAN, hardware
TBD by operator), and close audit regression **F4 (no LUKS at rest)**
in the same window.
This runbook is read-only on both hosts until cutover (section 4).
Sections 13 are inventory + planning; section 4 is the destructive
cutover; sections 57 are follow-through.
## Things we don't know about cobblestone yet — operator to fill in
| Question | Why it matters | Default if unset |
|---|---|---|
| CPU model / cores / threads | Sizing for parallel postgres + Ollama + MC | Assume ≥ Ryzen 5 2600X parity |
| RAM | 32 GiB nullstone runs 50 % util peak; less = trim MC + Ollama | Require ≥ 32 GiB |
| Storage layout (LVM? ZFS? plain?) | Decides LUKS strategy in 3a | Assume single NVMe, plain ext4 |
| GPU present (any) | Ollama / vLLM / Misskey thumb GPU helpers | Assume none, leave Ollama on friend RTX 4080 |
| LUKS already enabled at install? | If no → reinstall window or LUKS-on-file fallback | Assume **no** (act accordingly) |
| Static IP allocated? | Cutover plan needs a parking IP | Assume DHCP, target `.101` for cutover |
| DE installed? | Strip vs keep debate | Confirmed installed; default = strip |
| User account name + uid | Bind-mount permissions on /home/docker | Assume `user`, uid 1000 (mirror nullstone) |
Update this table before running section 3.
---
## 1 — Pre-migration audit (run on nullstone)
All commands read-only. SSH as `user@192.168.0.100`
(per `feedback_nullstone_ssh_user.md``admin@` is rejected).
### 1.1 Container inventory
```bash
ssh user@192.168.0.100 'docker ps -a --format "{{json .}}"' \
> nullstone-containers-$(date +%F).jsonl
ssh user@192.168.0.100 'docker inspect $(docker ps -aq)' \
> nullstone-inspect-$(date +%F).json
```
Parse for `Names`, `Image`, `Mounts[].Source`, `NetworkSettings.Networks`,
`HostConfig.RestartPolicy`, `Config.Labels` (Traefik routers).
### 1.2 Volumes (size estimate)
```bash
ssh user@192.168.0.100 'docker volume ls --format "{{.Name}}"' \
| xargs -I {} ssh user@192.168.0.100 \
"docker run --rm -v {}:/v alpine du -sh /v 2>/dev/null | sed 's|/v|{}|'"
```
Cross-reference with `/home/user/docker-data/100000.100000/volumes/`
(userns-remapped path) for per-volume bytes.
### 1.3 Network
```bash
ssh user@192.168.0.100 'docker network ls; \
ss -tlnp 2>/dev/null | grep LISTEN; \
iptables-save 2>/dev/null; nft list ruleset 2>/dev/null'
```
Capture Traefik vhosts:
```bash
ssh user@192.168.0.100 'cd /opt/docker/traefik && \
ls dynamic/; cat dynamic/*.yml | grep -E "rule:|sourceRange:"'
```
### 1.4 Cron + scheduled tasks
```bash
ssh user@192.168.0.100 'sudo cat /etc/crontab /etc/cron.d/* 2>/dev/null; \
for u in $(cut -d: -f1 /etc/passwd); do \
crontab -u $u -l 2>/dev/null && echo "(user $u)"; done'
```
Known: `/etc/cron.d/docker-backup` runs `/opt/docker/backup.sh` daily at
02:00 — **broken** (F-backup-1, fix in section 5).
### 1.5 Systemd
```bash
ssh user@192.168.0.100 'systemctl list-unit-files \
--state=enabled --type=service --no-pager'
```
Watch for: `docker.service`, `tailscaled.service`, `ollama.service`
(Ollama runs on host, not in Docker), `chrony.service`, `ssh.service`.
### 1.6 Disk + memory + cpu baseline
```bash
ssh user@192.168.0.100 'df -hT; \
sudo du -sh /home/docker/* /opt/docker/* /opt/backups 2>/dev/null; \
free -h; lscpu | head -20; nproc'
```
Reference (2026-05-06 spot check):
`/` 30 G (37 %) · `/var` 12 G (17 %) · `/home` 399 G (60 %, 226 G used).
Most state is on `/home`.
### 1.7 Daemon config
```bash
ssh user@192.168.0.100 'cat /etc/docker/daemon.json /etc/subuid /etc/subgid; \
sudo cat /etc/systemd/system/docker.service.d/override.conf 2>/dev/null'
```
Known good (carry forward except possibly userns-remap, see 3c):
```json
{
"log-driver": "json-file",
"log-opts": {"max-size": "10m", "max-file": "3"},
"live-restore": true,
"icc": false,
"userns-remap": "default",
"default-address-pools": [{"base": "172.20.0.0/16", "size": 24}],
"storage-driver": "overlay2",
"no-new-privileges": true
}
```
---
## 2 — Secret + state catalog
Anything in this table that is **lost** or **corrupted** during transfer
forces re-issuance / re-pinning / re-handshake. Group by criticality.
### Tier 0 — irreplaceable (lose this and external systems break)
| Path | Bytes (est.) | Restore cost if lost |
|---|---|---|
| `/opt/docker/step-ca/data/secrets/` + `/opt/docker/step-ca/.env` | < 1 MiB | Re-issue every internal cert; reinstall `veilor-root.crt` on every device that uses `*.veilor` / internal-CA chains. Hard. |
| `/opt/docker/traefik/data/acme.json` (LE prod) | < 1 MiB | Hits LE rate-limit (5 dupe certs/wk per FQDN, 50 certs/wk per registered domain). Could lock cert issuance for a full week. |
| `/opt/docker/traefik/data/acme-internal.json` (step-ca chain) | < 1 MiB | Step-ca re-issues fast, but every leaf reissue invalidates pinned trust anchors. |
| `/opt/docker/headscale/config/private.key` + `/opt/docker/headscale/data/db.sqlite` | < 50 MiB | Loss = every node re-enrolls; preauthkeys, routes, ACLs reset. Friend GPU node identity churn. |
| `/etc/ssh/ssh_host_*` | < 1 MiB | Either copy keep TOFU pinning intact, OR rotate all clients hit "key changed" warning (acceptable but noisy). |
### Tier 1 — application secrets (loss → password reset cascade)
| Path | Bytes (est.) | Notes |
|---|---|---|
| `/opt/docker/forgejo/data/gitea/conf/app.ini` (note: file is `app.ini` under `gitea/conf/` even on Forgejo) | ~10 KiB | `SECRET_KEY`, `INTERNAL_TOKEN`, `JWT_SECRET`, `LFS_JWT_SECRET`, OAuth client secrets. |
| `/opt/docker/authentik/.env` + authentik PG dump | tens of MiB | `AUTHENTIK_SECRET_KEY`, `PG_PASS`. Any service trusting Authentik OIDC needs `client_secret` re-handover. |
| `/opt/docker/misskey/.env` + misskey PG dump | < 1 MiB env | `id`, `db.user/pass`, `redis.pass`, master key. |
| `/opt/docker/n8n/.env` + n8n PG dump | < 1 MiB env | Encryption key for credentials at rest **lose this and stored creds inside n8n flows are unrecoverable**. |
| `/opt/docker/rocketchat/.env` + Mongo dump (currently stopped — see 4.1) | < 1 MiB env | First-admin still unclaimed (audit risk item). |
| `/opt/docker/tuwunel*/etc/tuwunel.toml` | < 1 MiB | Server signing key seed; lose = federation re-onboard from zero. |
| `/opt/docker/livekit/livekit.yaml` | < 1 KiB | `keys:` map (api-keysecret); JWT minter (`lk-jwt-service`) shares this. |
| `/opt/docker/pihole/etc-pihole/` | ~50 MiB | Adlists + custom DNS; rebuildable in 30 min if lost. |
| Gandi PAT (`GANDIV5_PERSONAL_ACCESS_TOKEN` in `/opt/docker/traefik/.env`) | <1 KiB | Re-issuable from Gandi UI; LiveDNS-only scope (per `reference_gandi_api.md`). |
| Tailscale auth keys (Headscale) | regenerate via `headscale preauthkeys create` | OK to regenerate. |
### Tier 2 — bulk data (large, but reproducible OR low-stakes)
| Path | Bytes (est.) | Notes |
|---|---|---|
| Misskey `/files/` (S3-style local) | tens of GiB | User uploads — irreplaceable to users. Dedup-friendly. |
| Forgejo `/home/docker/forgejo/data/git/` | ~5 GiB now | Git repos; also mirrored to GH per `project_forgejo_nullstone.md`, so partial DR exists. |
| `dl-veilor` static files | ~1 GiB | Public ISO downloads; rebuildable from veilor-os pipeline. |
| n8n flows (in `n8n_n8n_data`) | < 1 GiB | Encrypted with key from Tier 1; export JSON via UI as belt-and-braces. |
| Minecraft world (`/home/docker/minecraft/data/`) | ~1030 GiB | Players will riot if lost. |
| Ollama models (`/home/user/models/ollama/`) | ~17 GiB | Re-downloadable from registry; not blocking. |
| Postgres dumps (authentik, misskey-db, n8n-postgres) | covered by `pg_dumpall` in 4.1 | |
| MongoDB dump (rocketchat-mongodb) | covered by `mongodump` in 4.1 | Container is **stopped** today — start, dump, stop. |
### Tier 3 — config-as-code (safely re-deployable from `~/ai-lab/_github/`)
- All `/opt/docker/*/docker-compose.yml` — committed under
`~/ai-lab/_github/infra/repos/` and `~/ai-lab/nullstone-server/`.
- Traefik `dynamic/*.yml` middleware files.
- Treat as authoritative in repo; copy from repo to cobblestone, not
from nullstone. Diff old-compose vs repo-compose during section 3d to
catch any uncommitted drift.
---
## 3 — Cobblestone install plan
### 3a — OS layer
Verify base:
```bash
ssh user@cobblestone 'cat /etc/debian_version; uname -r; lsb_release -a'
```
**LUKS2 (mandatory — closes F4):**
- **Path A (preferred):** reinstall with full-disk LUKS2 from the
Debian installer (`/`, `/home`, swap all on encrypted PVs). Set up
TPM2 unattended unlock post-install:
```bash
systemd-cryptenroll --tpm2-device=auto --tpm2-pcrs=0+7 /dev/nvmeXnYpZ
```
PCR 0+7 binds to firmware + secure-boot state; bricks if firmware
is updated → fall back to passphrase.
- **Path B (fallback if reinstall blocked):** LUKS-on-file loopback
for the high-value subset only:
- `/opt/docker/step-ca/`
- `/opt/docker/traefik/data/acme*.json`
- `/opt/docker/headscale/`
- postgres data dirs
- Mongo keyfile volume
This is **strictly worse** than Path A (rest of disk still
cleartext, including misskey uploads and forgejo repos), but it
closes the highest-value subset. Document as accepted risk.
Hostname + base packages:
```bash
sudo hostnamectl set-hostname cobblestone
sudo apt update && sudo apt install -y \
curl ca-certificates gnupg jq ufw fail2ban chrony \
rsync restic tmux htop iotop ncdu
```
**DE strip vs keep — recommendation: STRIP.**
Cost of keeping: ~500 MiB RAM, ~5 GiB disk, larger attack surface
(CUPS, avahi, polkit, GUI daemons on localhost). Benefit: local
browser for vhost testing, on-keyboard recovery if SSH wedges.
- **Default (strip):** `sudo apt purge '*-desktop' '*xorg*' lightdm
sddm gdm3 'plymouth*' libreoffice-* && sudo apt autoremove --purge`.
Install Cockpit for web admin behind Traefik + `no-guest@file`.
- **Keep:** lock SDDM/GDM local-only via PAM, disable XDMCP, mask
`cups-browsed`. No auto-login.
Operator picks; document choice in SYSTEM.md.
### 3b — Network
**IP allocation during cutover** — use `192.168.0.101` for
cobblestone while nullstone stays on `.100`. Flip DNS / port-forwards
last (section 4.6). Avoids ARP collisions and keeps rollback trivial.
**nftables ruleset** (mirror nullstone pattern — read live ruleset off
nullstone in 1.3, replay on cobblestone):
```bash
sudo systemctl enable --now nftables
# Drop in /etc/nftables.conf with:
# - default policy drop on input
# - accept established/related
# - accept lo
# - accept 22 (SSH) from LAN + tailnet
# - accept 80/443 (Traefik) from anywhere
# - accept 222 (Forgejo SSH) from LAN + tailnet
# - accept 25565 (Minecraft) from anywhere
# - log+drop everything else
```
**IPv6:** audit reports nullstone has `net.ipv4.ip_forward=1` (F30).
That was an *unintended carryover* from a Tailscale subnet-router
experiment. **Do NOT** copy `/etc/sysctl.d/` from nullstone wholesale.
Instead, set explicitly:
```bash
sudo tee /etc/sysctl.d/99-cobblestone.conf <<'EOF'
net.ipv4.ip_forward = 0
net.ipv6.conf.all.forwarding = 0
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_redirects = 0
EOF
sudo sysctl --system
```
If Headscale or Tailscale subnet-router is wired later, re-enable
`ip_forward` with explicit comment + audit note.
**Tailscale + Headscale node identity:**
- Cleanest path: re-enroll cobblestone from scratch. New node, new
node-key, list `cobblestone` separately from `nullstone` in
Headscale during cutover week.
- Alternative: copy `/var/lib/tailscale/` from nullstone → cobblestone
to inherit the existing identity. Saves one ACL update but
conflates audit history. Not recommended.
### 3c — Docker
Install via official repo:
```bash
curl -fsSL https://download.docker.com/linux/debian/gpg | \
sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/debian $(lsb_release -cs) stable" | \
sudo tee /etc/apt/sources.list.d/docker.list
sudo apt update && sudo apt install -y \
docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
```
**`/etc/docker/daemon.json` — userns-remap decision.**
Two paths; operator decides. Document choice in SYSTEM.md.
**Path 1 — DROP userns-remap (recommended):** same JSON as nullstone
minus the `userns-remap` line.
- Pros: no more `chown 101000` dance; nsenter trick
(`feedback_docker_sudo_bypass.md`) drops the `--userns=host` flag;
Mongo keyfile pattern from `project_nullstone_docker_userns.md`
becomes unnecessary; `docker exec` UIDs match host 1:1.
- Cons: container root → host uid 0. Compensated by
`no-new-privileges`, `icc=false`, per-compose CAP drops, read-only
root FS where compatible. Net: small regression in defense-in-depth,
large workflow simplification.
**Path 2 — KEEP userns-remap:** carry `/etc/subuid` + `/etc/subgid`
identically (`user:100000:65536`). Existing on-disk ownership at uid
`101000` transfers without rechown. Cost: persisting the daily
friction the operator has been hitting for months.
**Default: Path 1.** If chosen, after rsync:
```bash
sudo chown -R user:user /home/docker /opt/docker
# Then per-service to the container uid (forgejo 1000, postgres 999,
# mongo 999, traefik 0).
```
Networks (must exist before Traefik comes up):
```bash
docker network create proxy
docker network create socket-proxy-net
docker network create misskey-frontend
```
### 3d — Service redeploy order
Topological. Each step depends only on its predecessors. Verification
command and rollback at each stage.
| # | Stack | Depends on | Verify | Rollback |
|---|---|---|---|---|
| 1 | networks (`proxy`, `socket-proxy-net`, `misskey-frontend`) | docker daemon | `docker network ls` | `docker network rm` |
| 2 | `socket-proxy` | network `socket-proxy-net` | `docker logs socket-proxy` shows API filter active | down compose |
| 3 | `traefik` | socket-proxy + acme.json/acme-internal.json carryover + Gandi PAT in .env | `curl -k https://sys.s8n.ru` returns dashboard auth challenge; `docker logs traefik` shows resolver init OK; cert files repopulate without LE call (acme.json reuse) | down compose; acme.json restore from backup |
| 4 | `step-ca` | traefik (for ACME-back) | `docker exec step-ca step ca health`; Traefik internal-CA resolver issues a cert against `https://step-ca:9000/acme/acme/directory` | down compose; revert traefik resolver config |
| 5 | `headscale` | traefik | `curl https://hs.s8n.ru/health`; `docker exec headscale headscale nodes list` shows existing nodes (db.sqlite carryover) | down compose; restore db.sqlite snapshot |
| 6 | authentik (`postgres → redis → server → worker`) | traefik | `curl https://auth.s8n.ru/-/health/ready/`; OIDC discovery doc loads | per-component down |
| 7 | `forgejo` | traefik (+ optional authentik, currently unwired) | `curl https://git.s8n.ru/api/v1/version`; `git clone ssh://git@cobblestone:222/...` | down compose; data dir tar-revert |
| 8 | misskey (`db → redis → misskey → x-source`) | traefik, network `misskey-frontend` | `curl https://x.veilor/api/meta` returns JSON; signup page renders | down compose; pg dump restore |
| 9 | `tuwunel` + `tuwunel-txt` | traefik | `curl https://matrix.veilor.uk/_matrix/federation/v1/version` and `https://mx.s8n.ru/_matrix/federation/v1/version` | down compose; data tar-revert |
| 10 | `cinny-txt` + `commet-web` + `signup-page` + `signup-txt` | tuwunel reachable, traefik | `curl -I https://txt.s8n.ru` 200; static assets 200 | down compose |
| 11 | `livekit-server` + `lk-jwt-service` | traefik (TURN over HTTPS) | `wscat wss://livekit.veilor.uk/`; jwt service `/healthz` | down compose |
| 12 | n8n (`postgres → n8n`) | traefik, restored encryption key | `curl https://n8n.s8n.ru/healthz`; UI loads with existing flows | pg dump restore |
| 13 | `pihole` | traefik | `dig @cobblestone | head`; admin UI auth | down compose |
| 14 | `forgejo-runner` | forgejo (#7) reachable on internal name | `docker logs forgejo-runner` shows `Runner registered successfully` | down compose; regenerate token via `forgejo actions generate-runner-token` |
| 15 | `minecraft-mc` | traefik (only for filebrowser-mc), router port-forward 25565 | `mcstatus mc.racked.ru` (or `nc -zv cobblestone 25565`) | down compose; world tar-revert |
| 16 | `dl-veilor` + `filebrowser-mc` | traefik | `curl https://dl.veilor.org/v0.2.0/veilor-root.crt` | down compose |
| 17 | `anythingllm` | traefik **with `no-guest@file` middleware applied** OR LAN-only bind — must NOT bring up like nullstone (port 3001 publicly exposed, audit F-anythingllm-1) | `curl -I -H 'Host: ai.s8n.ru' https://cobblestone` from off-LAN must 403 | down compose |
| 18 | RocketChat (`mongodb → rocketchat`) | **operator decision** — currently stopped on nullstone; if not retired, restore from mongodump produced in 4.1 | `curl https://rc.s8n.ru/api/info`; first-admin claim if still pending | leave stopped (matches today's state) |
---
## 4 — Cutover sequence
### 4.1 — Snapshot state on nullstone
```bash
NS=user@192.168.0.100
TS=$(date +%F-%H%M)
DEST=/opt/snap/$TS
ssh $NS "sudo mkdir -p $DEST && sudo chown user:user $DEST"
# Postgres dumps
for pg in authentik-postgres misskey-db n8n-postgres-1; do
ssh $NS "docker exec $pg pg_dumpall -U postgres" \
| gzip > $DEST/$pg.sql.gz
done
# Mongo (start, dump, stop again — currently stopped per audit)
ssh $NS 'cd /opt/docker/rocketchat && docker compose up -d rocketchat-mongodb && sleep 15'
ssh $NS 'docker exec rocketchat-mongodb mongodump \
--username root \
--password "$(grep MONGO_INITDB_ROOT_PASSWORD /opt/docker/rocketchat/.env | cut -d= -f2)" \
--authenticationDatabase admin --archive' \
| gzip > $DEST/rocketchat.archive.gz
ssh $NS 'cd /opt/docker/rocketchat && docker compose stop rocketchat-mongodb'
# Forgejo full dump (covers DB + repos + LFS + attachments)
ssh $NS 'docker exec -u 1000 forgejo \
forgejo dump --type tar.zst --file /tmp/forgejo-dump.tar.zst'
ssh $NS 'docker cp forgejo:/tmp/forgejo-dump.tar.zst -' \
> $DEST/forgejo-dump.tar.zst
# Stop everything before tar (consistency)
ssh $NS 'for d in /opt/docker/*/; do \
[ -f "$d/docker-compose.yml" ] && \
(cd "$d" && docker compose down) ; \
done'
# Bulk state tar
ssh $NS "sudo tar --acls --xattrs -cpf - /opt/docker /home/docker /opt/backups" \
| zstd -T0 -19 > $DEST.tar.zst
# Manifest
ssh $NS "find /opt/docker /home/docker -type f -print0 \
| xargs -0 sha256sum" > $DEST.sha256
```
Hold the tarball plus dumps in two places: cobblestone target host
and an offline USB. `acme.json` and step-ca secrets get an
*additional* armored copy to the password manager.
### 4.2 — rsync to cobblestone
After the tarball lands, repopulate cobblestone:
```bash
COBB=user@192.168.0.101
scp $DEST.tar.zst $COBB:/tmp/
ssh $COBB 'sudo mkdir -p /opt/docker /home/docker /opt/backups && \
sudo zstd -d /tmp/snap.tar.zst -o /tmp/snap.tar && \
sudo tar --acls --xattrs -xpf /tmp/snap.tar -C /'
# If userns-remap dropped (Path 1 in 3c):
ssh $COBB 'sudo chown -R user:user /opt/docker /home/docker'
```
### 4.3 — Bring up services on cobblestone
Walk section 3d table top to bottom. **Stop and verify** at each row
before the next. Don't batch — one bad startup cascades.
For services that store internal hostnames (Tuwunel `server_name`,
Headscale `server_url`, Forgejo `ROOT_URL`), the values stay the same
because public DNS still resolves to the WAN IP — only the internal LAN
target changes. No app config edits needed for cutover.
### 4.4 — Verify per vhost
```bash
for host in sys.s8n.ru git.s8n.ru auth.s8n.ru pihole.s8n.ru \
signup.txt.s8n.ru hs.s8n.ru rc.s8n.ru n8n.s8n.ru \
txt.s8n.ru mx.s8n.ru x.veilor matrix.veilor.uk \
chat.veilor.uk livekit.veilor.uk signup.veilor.uk \
dl.veilor.org; do
echo -n "$host: "
curl --resolve $host:443:192.168.0.101 -sI https://$host | head -1
done
```
Then push key flows:
- `git push nullstone-remote` (alias still works because DNS is
unchanged) — Forgejo CI runs.
- Matrix federation: `curl https://federationtester.matrix.org/api/report?server_name=veilor.uk`.
- Misskey signup: hit invite-gated form, complete signup, federation
test post.
### 4.5 — Cutover network
Two paths; operator picks based on appetite.
**Path A — DNS swing** (lower risk, slower propagation):
1. Lower `*.s8n.ru` and `*.veilor*` A-record TTLs to 60 s **a week
before** cutover (Gandi UI; can't be done via API per
`reference_gandi_api.md`).
2. Day-of: change A records from `82.31.156.86` (assumed unchanged
public IP) only if the WAN NAT target has changed (e.g. router
port-forwards now point at `.101`). If WAN IP and port-forwards
stay the same and you swap LAN IPs (`.100` → `.101`), no public
DNS edit needed — only edit `/etc/hosts` on internal clients (per
`feedback_s8n_hosts_override.md`).
**Path B — IP takeover** (faster, higher rollback friction):
- Bring nullstone down on `.100`, change cobblestone from `.101`
`.100`, restart networking. Public DNS + router port-forwards
unchanged. Rollback = swap IPs back.
Update onyx `/etc/hosts` long pin line **last**:
```
192.168.0.<new> rc.s8n.ru n8n.s8n.ru pihole.s8n.ru sys.s8n.ru \
mx.s8n.ru txt.s8n.ru signup.txt.s8n.ru git.s8n.ru x.veilor \
dl.veilor.org
```
### 4.6 — Update memory + ai-lab docs
- `~/ai-lab/CLAUDE.md` — Device Registry: add `cobblestone` row, mark
`nullstone` as `decom 2026-MM-DD`.
- `~/ai-lab/SYSTEM.md` — replace nullstone hardware/network blocks
with cobblestone equivalents; keep nullstone as "cold spare" until
wipe.
- `~/ai-lab/README.md` — device table one-liner.
- `~/ai-lab/security/` — create `cobblestone-server/` folder; first
audit due within 7 days of cutover.
- Memory files to update: `project_nullstone_docker_userns.md`
(mark **superseded** if userns-remap dropped),
`project_forgejo_nullstone.md`,
`project_rocketchat_nullstone.md`, `project_tailscale_mesh.md`,
`feedback_nullstone_ssh_user.md`, `feedback_s8n_hosts_override.md`
(new IP).
### 4.7 — Cold spare + wipe
- Hold nullstone powered-off but cabled, 7 days minimum.
- If no rollback triggered, wipe: full LUKS reformat (or `nvme
format -s1` for crypto-erase if drive supports it), then either
donate or repurpose as cobblestone backup target (Restic
destination — closes audit recommendation #6).
---
## 5 — Post-migration immediate fixes
Carried over from `nullstone-server/audit-report-2026-05-05.md`:
- **F-backup-1 — fix `/opt/docker/backup.sh`:** remove dead
`matrix-postgres` block (Synapse retired); correct
`rocketchat-mongodb` container name; replace literal
`CHANGE_ME_MONGO_ADMIN_PASSWORD` with read from
`/opt/docker/rocketchat/.env`. Verify next 02:00 run produces
non-zero RC + Mongo dumps.
- **no-guest@file ACL:** populate `sourceRange` to cover LAN
(`192.168.0.0/24`) + tailnet (`100.64.0.0/10`) + IPv6 equivalents.
Verify XFF chain restores client IP at the entryPoint level
(`forwardedHeaders.trustedIPs`).
- **anythingllm:** front via Traefik with `no-guest@file` OR bind
LAN-only. Must not repeat the 0.0.0.0:3001 nullstone state.
- **LUKS:** done at install (3a). Verify via `cryptsetup status` +
`systemd-cryptenroll --tpm2-device=list` post-cutover.
- **Restic + autorestic** to B2/Wasabi or to nullstone-as-spare,
with restore drill scheduled.
- **Vaultwarden** to centralize the secrets currently sprayed across
`.env` files.
- **Gatus** with cert-expiry checks + ntfy/Matrix alerts.
- **CrowdSec** with bouncer plugin at Traefik for the public
HTTP attack surface.
- **Beszel** for one-pane host metrics.
---
## 6 — Open questions (operator decisions)
| Question | Default if undecided |
|---|---|
| Strip DE on cobblestone? | **Strip + Cockpit.** Easier to defend; remote admin via web UI through Traefik + no-guest@file. |
| userns-remap on cobblestone? | **Off (Path 1 in 3c).** Operator pain outweighs the marginal isolation. Document tradeoff. |
| Move Headscale + step-ca to a $4 VPS? | **Defer (phase 2).** Keep on cobblestone for now; revisit once Restic + Gatus are running. SPOF mitigation is real but adds attack surface; do it once monitoring is in place. |
| RocketChat: bring back up or retire? | **Retire if not used in 30 days.** Currently stopped; first-admin still unclaimed. Mongo dump captured in 4.1, then drop the stack from cobblestone redeploy. Keeps `rc.s8n.ru` DNS for future revival. |
| Tailscale identity copy vs re-enroll for cobblestone? | **Re-enroll** (cleaner audit trail; Headscale ACLs need a one-line edit). |
| SSH host keys copy vs rotate? | **Copy.** TOFU pinning intact; one less "is this MITM?" prompt for clients. Add rotation to a follow-up cron. |
| Authentik wiring during cutover or after? | **After.** Authentik is currently mostly unwired (audit). Cutover is not the time to add new auth dependencies. |
---
## 7 — Risks (severity-tagged)
- 🔴 **acme.json mishandling = LE rate-limit.** Mitigation: copy
`acme.json` + `acme-internal.json` BEFORE bringing up Traefik on
cobblestone. Never let cobblestone Traefik issue a fresh batch of
certs. Hold a backup of both files in two locations.
- 🔴 **step-ca root key loss = full re-issuance.** Mitigation:
triple-copy `/opt/docker/step-ca/.env` + `data/secrets/`
(cobblestone, USB, password manager). Test that the encrypted root
key decrypts on cobblestone before tearing down nullstone.
- 🔴 **anythingllm reintroduces public 0.0.0.0:3001.** Mitigation: do
NOT bring it up before middleware is in place. Test from off-LAN
IP.
- 🟠 **PostgreSQL major-version skew.** Mitigation: pin same major on
cobblestone (`postgres:16-alpine` already pinned; do NOT use
`:latest`). If a major upgrade is desired, do it as a separate
step *after* cutover settles, with a fresh pg_dumpall as safety
net.
- 🟠 **Headscale node identity churn** if `db.sqlite` not copied. All
nodes (onyx, friend RTX 4080 PC, office) re-enroll. Mitigation:
copy `db.sqlite` + `private.key`; verify `headscale nodes list`
matches pre-cutover before flipping DNS.
- 🟡 **chrony NTS peers** may need re-trust on new host (NTS-KE binds
to hostname). Mitigation: chrony config copy verbatim; first
`chronyc tracking` should show stratum within 5 minutes.
- 🟡 **Authentik OIDC `client_secret`s.** Today: mostly unwired
(audit). Risk small. If Forgejo/RC/n8n were wired through
Authentik, each `client_secret` would need re-handover. Defer
Authentik wiring until post-cutover.
- 🟡 **Misskey AGPL §13 source endpoint** (`x-source`). Per
`project_x_misskey_fork.md`, the AGPL link must keep serving
source — and per the same memo, mute is acceptable for short
windows. Cutover downtime budget: **≤ 2 h**. If exceeded, post a
banner on `x.veilor` linking to `https://git.s8n.ru/s8n-ru/x` for
the duration.
- 🟡 **Backup script broken on copy.** Audit F-backup-1 still applies
if you copy `/opt/docker/backup.sh` verbatim. Fix during section 5,
not before — but do not let it run on cobblestone before fix
(disable the cron entry until corrected).
---
## Appendix — quick reference
- nullstone: `user@192.168.0.100`, Debian 13, 32 GiB / 477 GiB, ~28
containers, no LUKS (F4).
- cobblestone: `user@192.168.0.101` during cutover, swing to `.100`
post-validation.
- LE wildcard `*.s8n.ru` + `*.veilor.uk` via Gandi DNS-01. Internal CA
via step-ca, Traefik resolver `internal-ca`.
- Out of scope: office workstation install, friend GPU re-enrollment,
veilor-os ISO build pipeline.
---
**Path:** `/home/admin/ai-lab/_github/infra/runbooks/MIGRATION-nullstone-to-cobblestone.md`
Two-line summary: pre-migration audit + secret catalog + cobblestone
install plan (LUKS2, optional userns-remap drop, 18-step topological
service redeploy) + cutover script + post-migration fixes carried over
from the 2026-05-05 audit. Operator must fill the "things we don't know
about cobblestone" table and pick on userns-remap / DE / RC retirement
before section 3 runs.