# 24 — Storage / Disk-I/O / Filesystem Audit (Read-Only)

> Status: **read-only audit**, executed 2026-05-08 against
> `nullstone` (192.168.0.100). Scope: storage stack underneath Jellyfin
> on `arrflix.s8n.ru`. Sibling audits cover color/HDR, server runtime,
> and edge/network — this file owns LVM, disks, ext4, mount opts, image
> cache, transcode cache, and the RO bind-mount overhead.
>
> **No writes. No mount changes. No fstrim execution. No cache
> flushes. No SMART self-tests.**

---

## Executive summary

**Storage is not the bottleneck. CPU is.** Disk I/O across every
metric came back fast and healthy. The "loads kinda slow" symptom is
almost certainly playback-stall caused by a CPU-only host running 5
concurrent ffmpeg transcodes of the same file at load average 42 — not
disk. The storage layer is in the bottom third of the suspect list.

Top three storage-side observations (severity, then quick-win order):

1. **Single PV / single LV / single NVMe — no isolation between media
   reads, transcode writes, OS, and Docker overlay churn.**  Severity
   **Y**. Every workload hits `/dev/nvme0n1` and the ext4 journal at
   `keystone--vg-home`. Today the SSD shrugs it off (2.1 GB/s direct,
   1.2 GB/s through the container RO mount), but transcode-write
   contention with library-scan reads is real — and the box is
   currently doing 5 concurrent ffmpegs. **Quick win: nothing today;
   investment: split media onto a second LV (or second device) so
   transcode-write churn does not share an ext4 journal with
   library-scan reads.**
2. **Read-ahead is 128 KB on the LV (`dm-4`).** Severity **Y**.
   Default for sequential 1080p streams from MKV; would benefit from
   **512 KB–1 MB** for higher-bitrate or scanning workloads. Tiny
   win, costs 30 seconds. **Quick win.**
3. **`relatime` on `/home` updates atime on the RO library (the bind
   mount is RO from the container's view but the underlying ext4 is
   RW from the host).** Severity **G→Y**. `relatime` is the kernel
   default and only writes ~1 atime update per 24 h per file, so the
   write cost on a 201-file library is rounding noise. Documented for
   completeness; **not worth fixing**.

Ruled out as not-a-problem: rotating disk (it's NVMe), low free space
(62 % used, 146 GiB free — was 90 % at the prior audit, materially
better), inode pressure (6 % used), stale transcodes (zero >60 min
old), image-cache GC thrash (oldest cached image is 16 h old, no
churn), bind-mount overhead (40 % vs raw — but absolute throughput
still 12× a 4K HEVC stream needs), SSD wear (8 % used, 100 % spare,
zero media errors), and `data=ordered` journal write barriers
(NVMe-class device, irrelevant).

---

## 1. Disk + LVM topology

### Hardware

| Layer | Detail |
|---|---|
| Device | `/dev/nvme0n1`, **Intel SSDPEKKF512G8 NVMe**, 476.9 GiB, non-rotational, internal |
| Bus | NVMe |
| Loops (irrelevant) | `loop0..loop3` 256 M each (snap remnants — empty) |

Single physical drive. **No HDDs. No external storage. No NAS
mounts.** The "media on rotating media" hypothesis (a) is **ruled
out** — everything is on this NVMe.

SMART (NVMe Log 0x02):

| Field | Value |
|---|---|
| Critical Warning | `0x00` |
| Temperature | 43 °C |
| Available Spare | 100 % |
| Percentage Used | **8 %** |
| Power-On Hours | 18 597 |
| Power Cycles | 3 729 |
| Unsafe Shutdowns | 774 |
| Media + Data Integrity Errors | **0** |
| Error Log Entries | 0 |
| Data Units Read | 25.7 TB |
| Data Units Written | 25.9 TB |

Drive is healthy, mid-life. No remediation.

### Partitions and LVM

```
nvme0n1 (476.9 GiB, NVMe SSD)
├─ nvme0n1p1   976 M  vfat   /boot/efi
├─ nvme0n1p2   977 M  ext4   /boot
└─ nvme0n1p3   475 G  LVM2 PV → keystone-vg
   ├─ keystone--vg-root    30.4 G  ext4   /
   ├─ keystone--vg-var     11.4 G  ext4   /var
   ├─ keystone--vg-swap_1  24.3 G  swap   [SWAP]
   ├─ keystone--vg-tmp      2.8 G  ext4   /tmp
   └─ keystone--vg-home   406.2 G  ext4   /home  ← media + jellyfin live here
```

Single-PV VG, **VFree = 0**. Cannot grow `home` without adding
another PV. Note swap is **on the same PV** as `home`; under memory
pressure (the prior audit caught 6.8 GiB swap in use) swap traffic
contends with media reads on the same NVMe queue.

### Mount table (relevant entries only)

| Source | Mountpoint | FS | Options |
|---|---|---|---|
| `keystone--vg-root` | `/` | ext4 | `rw,relatime,errors=remount-ro` |
| `keystone--vg-var` | `/var` | ext4 | `rw,nosuid,nodev,relatime` |
| `keystone--vg-tmp` | `/tmp` | ext4 | `rw,nosuid,nodev,noexec,relatime` |
| `keystone--vg-home` | `/home` | ext4 | `rw,nosuid,nodev,**relatime**` |
| `nvme0n1p2` | `/boot` | ext4 | `rw,relatime` |
| `nvme0n1p1` | `/boot/efi` | vfat | `rw,relatime,fmask=0077,dmask=0077` |

`relatime` is the kernel default; **`atime` was not used** (good —
pure `atime` is the actual horror). `noatime` would shave ~1 atime
write per 24 h per file accessed; on a 201-file library that's
sub-noise. **Not a remediation candidate.** No `discard` flag (good
— online discard hurts performance; the weekly `fstrim.timer` is the
right pattern, see §8).

### Container bind mounts (Jellyfin)

| Host path | Container path | RW |
|---|---|---|
| `/home/docker/jellyfin/config` | `/config` | RW |
| `/home/docker/jellyfin/cache` | `/cache` | RW |
| `/home/user/media` | `/media` | **RO** |
| `/opt/docker/jellyfin/web-overrides/index.html` | `/jellyfin/jellyfin-web/index.html` | RO |

All bind mounts hit the same `keystone--vg-home` LV — config,
transcode cache, image cache, and media library all share one ext4
journal and one queue.

### ext4 features (`/dev/keystone--vg-home`)

```
Filesystem features:      has_journal ext_attr resize_inode dir_index orphan_file
                          filetype extent 64bit flex_bg metadata_csum_seed
                          sparse_super large_file huge_file dir_nlink extra_isize
                          metadata_csum orphan_present
Default mount options:    user_xattr acl
Total journal size:       1024 M  (1 GiB — chunky but standard for 400 GiB)
Journal features:         journal_incompat_revoke journal_64bit journal_checksum_v3
Filesystem state:         clean
Last mount time:          Sun May  3 23:42:28 2026
Mount count:              8
Block size:               4096
Inode count:              26 624 000
```

Journal mode is the ext4 default `data=ordered` (no override in
mountopts). On NVMe with `metadata_csum` and `journal_checksum_v3`,
this is **fine** — would only matter on slow rotational. Hypothesis
(b) "ext4 journal in `data=ordered` starves reads" is **ruled out**:
the device is NVMe-class and not the bottleneck.

---

## 2. Read throughput (1 large file, raw)

Test file: `Rick and Morty (2013) - S01E04 - M. Night Shaym-Aliens.mkv`
(1.5 GB, host path `/home/user/media/tv/...`).

| Test | Bytes | Wall | Throughput |
|---|---|---|---|
| `dd … bs=1M count=512 iflag=direct` (host, bypasses cache) | 537 MB | 0.258 s | **2.1 GB/s** |
| `dd … bs=1M count=512` (host, page-cache eligible) | 537 MB | 0.536 s | 1.0 GB/s (still warming) |
| `dd … bs=1M count=256 iflag=direct` (inside `jellyfin`, RO bind) | 268 MB | 0.233 s | **1.2 GB/s** |

**Bind-mount overhead = ~40 %** (2.1 → 1.2 GB/s). That's higher than
the "bind mounts are free" folklore but absolute throughput still
crushes any practical media bitrate (4K HDR HEVC tops out around
50 Mbit/s = 6.25 MB/s; 1.2 GB/s is **190× headroom**). **Not a
bottleneck. Not a remediation candidate.**

---

## 3. Random-read latency

`ioping` not installed on host or in container. Skipped.

Indirect signal: NVMe device-queue stats from `/proc/diskstats` for
`dm-4` (home LV):

```
reads: 15 003 996  read_sectors: 2 600 976 283  read_ms: 3 384 240
writes: 41 153 214  write_sectors: 1 997 023 232  write_ms: 145 844 732
in-flight: 0   io_ms: 5 153 616
```

Average per-read service ≈ **0.226 ms**, average per-write ≈ **3.5 ms**
(consistent with NVMe + ext4 journal flush). No queue stalls
observed.

---

## 4. Cache size breakdown

| Path | Bytes | Notes |
|---|---|---|
| `/cache` (total) | **84 MB** | Entire jellyfin cache fits in one MP3 album |
| `/cache/transcodes` | 39–61 MB | Live during audit; **5 concurrent ffmpegs** (see §6) |
| `/cache/images` | 39 MB | 412 files in 16 hash-prefixed dirs |
| `/cache/images/resized-images` | 39 MB | 0 dir, 1 dir, …, f dir (16 buckets, 18–30 files each) |
| `/cache/omdb` | 84 KB | Plugin response cache |
| `/cache/fontconfig` | 36 KB | |
| `/cache/attachments` | 12 KB | Subtitle/font extracts |
| `/cache/imagesbyname` | 4 KB | Empty |

Total cache = 84 MB on a 400 GB filesystem. **There is no cache
pressure.** The "cache being garbage-collected mid-page-load"
hypothesis (c) is **ruled out** (oldest cached image timestamp =
2026-05-08 01:12 BST, newest = 17:42 BST = **16.5 h retention with
no eviction**).

---

## 5. Image cache miss-vs-hit timing

Public asset latency from onyx → `https://arrflix.s8n.ru`:

| URL | Attempt 1 (cold) | Attempt 2 (warm) |
|---|---|---|
| `/web/assets/img/icon-transparent.png` | 0.227 s | 0.047 s |
| `/web/serviceworker.js` | 0.059 s | 0.059 s |
| `/web/main.jellyfin.bundle.js` | 0.092 s | 0.052 s |

5-sample steady state on `/web/main.jellyfin.bundle.js` = **44–68 ms,
median 49 ms**. Traefik + Jellyfin static-asset path is fast.

Direct poster URLs (`/Items/{id}/Images/Primary`) require an auth
token; could not be probed without a fresh `X-Emby-Token`. Inferred
from on-disk evidence: the `resized-images` cache contains 412
WebPs, all under 200 KB, no eviction in the last 16 h. **Image cache
serves all current items from disk on warm path.**

Hypothesis (c) is **ruled out**.

---

## 6. Stale-transcode detection

```
/cache/transcodes:
  total bytes:   39 MB  (was 61 MB earlier in audit, churn = active stream)
  total files:   26
  files >60 min old:   0
  bytes >60 min old:   0 MB
```

`Clean Transcode Directory` task last ran `2026-05-08T02:13` (per
audit 13 task list). **Currently zero stale transcode segments.**
Hypothesis (d) is **ruled out** — no accumulation.

However, **5 concurrent ffmpeg processes are transcoding the same
file** right now:

```
PID    CPU    file
1685478 246%  Rick and Morty S01E01 - Pilot.mkv
1686665 203%  Rick and Morty S01E01 - Pilot.mkv  (same file)
1686651 198%  Rick and Morty S01E01 - Pilot.mkv  (same file)
1689000 125%  Rick and Morty S01E01 - Pilot.mkv  (same file)
1689109 120%  Rick and Morty S01E01 - Pilot.mkv  (same file)
```

This is a **CPU-side** issue (no ffmpeg de-dup, no segment
throttling — see audit 13 finding 03). It causes:

- Load average **42.62 / 22.84 / 12.32** (12-core box).
- Swap usage 7.8 GiB / 24 GiB.
- I/O wait however is **0 %** in `vmstat` (`wa=0`).

The host CPU is saturated, not the disk. **Storage layer is not
this user's bottleneck.**

---

## 7. Inode + free-space stats

| Filesystem | 1K-blocks | Used | Available | Use % | Inodes | IUsed | IUse % |
|---|---|---|---|---|---|---|---|
| `keystone--vg-home` (`/home`) | 418 106 320 | 244 025 392 | 152 768 828 | **62 %** | 26 624 000 | 1 489 612 | **6 %** |
| `keystone--vg-root` (`/`) | — | — | — | — | — | — | — |
| `keystone--vg-var` (`/var`) | 12 G | 2.0 G | 8.6 G | **19 %** | n/a | n/a | n/a |

**Free space went from 40 GiB at audit 13 (90 % full) to 146 GiB now
(62 %).** Material improvement; the prior "low free space"
hypothesis (e) is **ruled out**. Inode pressure ruled out.

(Note: `/home` houses `/home/user/docker-data/100000.100000/...`
which contains all userns-remapped Docker overlay2 trees. The 233 G
used number includes container layers, not just media. Library
itself is 201 files.)

---

## 8. fstrim status

```
fstrim.timer    Loaded, enabled, active (waiting)
                Last triggered: Sun 2026-05-03 23:42:29 BST
                Next trigger:   Mon 2026-05-11 01:12:58 BST
fstrim --dry-run /home    →  /home: 0 B (dry run) trimmed
```

Weekly trim is configured and recently ran (one week before next
trigger). **Dry-run reports 0 B candidate** → there is no untrimmed
free space on `/home`. SSD performance degradation from
unTRIMmed-blocks is **not** a factor. No `discard` mount option
(correct — async batched trim via timer is preferred over inline).

---

## 9. Read-ahead and queue settings

| Block device | `read_ahead_kb` | scheduler | `nr_requests` |
|---|---|---|---|
| `nvme0n1` (physical) | **128 KB** | `[none] mq-deadline` | 1023 |
| `dm-4` (`keystone--vg-home`, the LV) | **128 KB** | n/a | n/a |
| `/sys/block/dm-4` lacks scheduler/nr_requests (dm devices inherit) |

128 KB read-ahead is the kernel default. For sequential MKV streams
this is OK; for library-scan workloads (`stat` + open + read first
chunk per file) it's also OK. Bumping to 512 KB or 1024 KB would
help **scan throughput** during a Jellyfin library refresh — minor
win, ~30 s of work.

NVMe is using `none` scheduler (correct for NVMe — multiqueue + no
elevator).

---

## 10. RO bind-mount overhead — confirmed

(From §2.) Host direct = 2.1 GB/s. Container RO bind = 1.2 GB/s.
Overhead ≈ 40 % which is higher than expected, likely a side-effect
of:

- userns remap (`100000.100000` shifts uids)
- the `nosuid,nodev` flags on `/home` propagating into the bind
- container's `read_ahead_kb` is **not** configurable through bind
  (inherits 128 KB)

**Not actionable today.** Both numbers are 100×+ of any media
bitrate. Documented to rule out hypothesis (f).

`atime` cost on RO bind: bind mount inherits the host's `relatime`
semantics — at most one atime write per file per 24 h, gated by
`relatime`. On 201 files that's ≤ 201 atime writes/day = **rounding
noise**. Hypothesis (f) **ruled out**.

---

## 11. Concrete remediation list — ranked

Severity legend: **R** = red (acute, fix this week), **Y** = yellow
(deferred, document risk), **G** = green (audited, healthy, no
action). Effort: **S** ≤ 30 min, **M** half-day, **L** > 1 day.

| # | Severity | Effort | Bucket | Action | Why |
|---|:-:|:-:|---|---|---|
| S01 | Y | S | Quick-win | Bump `read_ahead_kb` on `/dev/nvme0n1` to **512 KB** (sysfs or udev rule) | Helps library-scan and large-MKV streams. Tiny risk; reverts on reboot if set live. |
| S02 | Y | M | Quick-win | Add `noatime` (replacing `relatime`) to `/home` mount in `/etc/fstab` | Eliminates the residual `relatime` writes; cosmetic but cheap. Requires a remount; do during a window with no playback. |
| S03 | Y | M | Investment | Carve a separate **`media` LV** (or attach a second NVMe) for `/home/user/media` and bind-mount it RO into Jellyfin | Isolates library reads from transcode-write churn and Docker overlay churn on the same ext4 journal. Today it is fine; at scale it will not be. |
| S04 | Y | M | Investment | Move `keystone--vg-swap_1` off `keystone-vg` (or onto a separate device) | Swap is currently 7.8 GiB used and shares the NVMe queue with media reads. CPU saturation is the proximate cause but cleanly isolating swap helps when CPU finally gets fixed (GPU re-enable, see audit 13 #02). |
| S05 | Y | M | Investment | Add a second PV to `keystone-vg` so the VG has free space | `vgs` shows **VFree=0**. Any future `lvextend` will fail until a PV is added. Latent ops trap. |
| S06 | G | — | — | Keep weekly `fstrim.timer` as-is | Healthy, current. |
| S07 | G | — | — | Keep image cache untouched | 84 MB total cache, 16 h retention, no GC pressure. |
| S08 | G | — | — | No change to `data=ordered` ext4 journal | NVMe; mode is fine. |

**The single biggest "loads kinda slow" win lives in audit 13
(finding 03 — enable transcode throttling + segment deletion).
Storage is not where this is fixed.**

---

## 12. Quick-win vs investment

### Quick-win (≤30 min total, today)

- **S01** — `echo 1024 > /sys/block/nvme0n1/queue/read_ahead_kb` (or
  512). Reverts on reboot; persist via udev rule under
  `/etc/udev/rules.d/60-readahead.rules`. Marginal but free.
- **S02** — flip `relatime` → `noatime` in `/etc/fstab` for
  `/home`. Cosmetic but cheap. **Skip if even half-load** — a
  bad fstab + reboot is an outage; only do during a planned
  window.

### Investment (half-day to multi-day, plan)

- **S03** — separate `media` LV. Requires `lvcreate`, `mkfs`, rsync
  the library, swap the bind-mount in compose. ~half-day. Pays back
  when (a) library grows past the current 201 files, (b) GPU
  transcode is re-enabled (audit 13 #02) and many concurrent reads
  start happening.
- **S04** — relocate swap. Only meaningful after GPU re-enable
  closes the CPU-saturation root cause.
- **S05** — second PV. Trivial mechanically (`pvcreate`, `vgextend`),
  blocked on having a second device. Defer until needed.

### No-op (audited and healthy)

- SMART status (8 % wear, no errors)
- ext4 features and journal mode
- Inode usage (6 %)
- Free space (62 %, 146 GiB headroom)
- Cache size (84 MB total)
- Stale transcodes (zero)
- `fstrim.timer` (working, candidate-bytes = 0)
- Bind-mount throughput (1.2 GB/s, 190× any 4K stream)

---

## 13. Sign-off

- Audit: 2026-05-08, read-only, ~15 min wall.
- No fixes applied. No state mutated. No container restart. No SMART
  self-test. No fstrim execution. No mount changes.
- **Top storage culprit: none.** Storage stack is healthy. The
  "loads kinda slow" symptom is CPU-side (5 concurrent ffmpegs at
  load 42, audit 13 #02 + #03).
- **Top quick-win: S01 — bump `read_ahead_kb` to 512 KB on
  `nvme0n1`** for marginal scan/stream gain. Real fix lives in
  audit 13.
- Next audit due: **2026-08-08** (quarterly, with audit 13).