ARRFLIX/docs/24-storage-io-audit.md

17 KiB
Raw Blame History

24 — Storage / Disk-I/O / Filesystem Audit (Read-Only)

Status: read-only audit, executed 2026-05-08 against nullstone (192.168.0.100). Scope: storage stack underneath Jellyfin on arrflix.s8n.ru. Sibling audits cover color/HDR, server runtime, and edge/network — this file owns LVM, disks, ext4, mount opts, image cache, transcode cache, and the RO bind-mount overhead.

No writes. No mount changes. No fstrim execution. No cache flushes. No SMART self-tests.


Executive summary

Storage is not the bottleneck. CPU is. Disk I/O across every metric came back fast and healthy. The "loads kinda slow" symptom is almost certainly playback-stall caused by a CPU-only host running 5 concurrent ffmpeg transcodes of the same file at load average 42 — not disk. The storage layer is in the bottom third of the suspect list.

Top three storage-side observations (severity, then quick-win order):

  1. Single PV / single LV / single NVMe — no isolation between media reads, transcode writes, OS, and Docker overlay churn. Severity Y. Every workload hits /dev/nvme0n1 and the ext4 journal at keystone--vg-home. Today the SSD shrugs it off (2.1 GB/s direct, 1.2 GB/s through the container RO mount), but transcode-write contention with library-scan reads is real — and the box is currently doing 5 concurrent ffmpegs. Quick win: nothing today; investment: split media onto a second LV (or second device) so transcode-write churn does not share an ext4 journal with library-scan reads.
  2. Read-ahead is 128 KB on the LV (dm-4). Severity Y. Default for sequential 1080p streams from MKV; would benefit from 512 KB1 MB for higher-bitrate or scanning workloads. Tiny win, costs 30 seconds. Quick win.
  3. relatime on /home updates atime on the RO library (the bind mount is RO from the container's view but the underlying ext4 is RW from the host). Severity G→Y. relatime is the kernel default and only writes ~1 atime update per 24 h per file, so the write cost on a 201-file library is rounding noise. Documented for completeness; not worth fixing.

Ruled out as not-a-problem: rotating disk (it's NVMe), low free space (62 % used, 146 GiB free — was 90 % at the prior audit, materially better), inode pressure (6 % used), stale transcodes (zero >60 min old), image-cache GC thrash (oldest cached image is 16 h old, no churn), bind-mount overhead (40 % vs raw — but absolute throughput still 12× a 4K HEVC stream needs), SSD wear (8 % used, 100 % spare, zero media errors), and data=ordered journal write barriers (NVMe-class device, irrelevant).


1. Disk + LVM topology

Hardware

Layer Detail
Device /dev/nvme0n1, Intel SSDPEKKF512G8 NVMe, 476.9 GiB, non-rotational, internal
Bus NVMe
Loops (irrelevant) loop0..loop3 256 M each (snap remnants — empty)

Single physical drive. No HDDs. No external storage. No NAS mounts. The "media on rotating media" hypothesis (a) is ruled out — everything is on this NVMe.

SMART (NVMe Log 0x02):

Field Value
Critical Warning 0x00
Temperature 43 °C
Available Spare 100 %
Percentage Used 8 %
Power-On Hours 18 597
Power Cycles 3 729
Unsafe Shutdowns 774
Media + Data Integrity Errors 0
Error Log Entries 0
Data Units Read 25.7 TB
Data Units Written 25.9 TB

Drive is healthy, mid-life. No remediation.

Partitions and LVM

nvme0n1 (476.9 GiB, NVMe SSD)
├─ nvme0n1p1   976 M  vfat   /boot/efi
├─ nvme0n1p2   977 M  ext4   /boot
└─ nvme0n1p3   475 G  LVM2 PV → keystone-vg
   ├─ keystone--vg-root    30.4 G  ext4   /
   ├─ keystone--vg-var     11.4 G  ext4   /var
   ├─ keystone--vg-swap_1  24.3 G  swap   [SWAP]
   ├─ keystone--vg-tmp      2.8 G  ext4   /tmp
   └─ keystone--vg-home   406.2 G  ext4   /home  ← media + jellyfin live here

Single-PV VG, VFree = 0. Cannot grow home without adding another PV. Note swap is on the same PV as home; under memory pressure (the prior audit caught 6.8 GiB swap in use) swap traffic contends with media reads on the same NVMe queue.

Mount table (relevant entries only)

Source Mountpoint FS Options
keystone--vg-root / ext4 rw,relatime,errors=remount-ro
keystone--vg-var /var ext4 rw,nosuid,nodev,relatime
keystone--vg-tmp /tmp ext4 rw,nosuid,nodev,noexec,relatime
keystone--vg-home /home ext4 rw,nosuid,nodev,**relatime**
nvme0n1p2 /boot ext4 rw,relatime
nvme0n1p1 /boot/efi vfat rw,relatime,fmask=0077,dmask=0077

relatime is the kernel default; atime was not used (good — pure atime is the actual horror). noatime would shave ~1 atime write per 24 h per file accessed; on a 201-file library that's sub-noise. Not a remediation candidate. No discard flag (good — online discard hurts performance; the weekly fstrim.timer is the right pattern, see §8).

Container bind mounts (Jellyfin)

Host path Container path RW
/home/docker/jellyfin/config /config RW
/home/docker/jellyfin/cache /cache RW
/home/user/media /media RO
/opt/docker/jellyfin/web-overrides/index.html /jellyfin/jellyfin-web/index.html RO

All bind mounts hit the same keystone--vg-home LV — config, transcode cache, image cache, and media library all share one ext4 journal and one queue.

ext4 features (/dev/keystone--vg-home)

Filesystem features:      has_journal ext_attr resize_inode dir_index orphan_file
                          filetype extent 64bit flex_bg metadata_csum_seed
                          sparse_super large_file huge_file dir_nlink extra_isize
                          metadata_csum orphan_present
Default mount options:    user_xattr acl
Total journal size:       1024 M  (1 GiB — chunky but standard for 400 GiB)
Journal features:         journal_incompat_revoke journal_64bit journal_checksum_v3
Filesystem state:         clean
Last mount time:          Sun May  3 23:42:28 2026
Mount count:              8
Block size:               4096
Inode count:              26 624 000

Journal mode is the ext4 default data=ordered (no override in mountopts). On NVMe with metadata_csum and journal_checksum_v3, this is fine — would only matter on slow rotational. Hypothesis (b) "ext4 journal in data=ordered starves reads" is ruled out: the device is NVMe-class and not the bottleneck.


2. Read throughput (1 large file, raw)

Test file: Rick and Morty (2013) - S01E04 - M. Night Shaym-Aliens.mkv (1.5 GB, host path /home/user/media/tv/...).

Test Bytes Wall Throughput
dd … bs=1M count=512 iflag=direct (host, bypasses cache) 537 MB 0.258 s 2.1 GB/s
dd … bs=1M count=512 (host, page-cache eligible) 537 MB 0.536 s 1.0 GB/s (still warming)
dd … bs=1M count=256 iflag=direct (inside jellyfin, RO bind) 268 MB 0.233 s 1.2 GB/s

Bind-mount overhead = ~40 % (2.1 → 1.2 GB/s). That's higher than the "bind mounts are free" folklore but absolute throughput still crushes any practical media bitrate (4K HDR HEVC tops out around 50 Mbit/s = 6.25 MB/s; 1.2 GB/s is 190× headroom). Not a bottleneck. Not a remediation candidate.


3. Random-read latency

ioping not installed on host or in container. Skipped.

Indirect signal: NVMe device-queue stats from /proc/diskstats for dm-4 (home LV):

reads: 15 003 996  read_sectors: 2 600 976 283  read_ms: 3 384 240
writes: 41 153 214  write_sectors: 1 997 023 232  write_ms: 145 844 732
in-flight: 0   io_ms: 5 153 616

Average per-read service ≈ 0.226 ms, average per-write ≈ 3.5 ms (consistent with NVMe + ext4 journal flush). No queue stalls observed.


4. Cache size breakdown

Path Bytes Notes
/cache (total) 84 MB Entire jellyfin cache fits in one MP3 album
/cache/transcodes 3961 MB Live during audit; 5 concurrent ffmpegs (see §6)
/cache/images 39 MB 412 files in 16 hash-prefixed dirs
/cache/images/resized-images 39 MB 0 dir, 1 dir, …, f dir (16 buckets, 1830 files each)
/cache/omdb 84 KB Plugin response cache
/cache/fontconfig 36 KB
/cache/attachments 12 KB Subtitle/font extracts
/cache/imagesbyname 4 KB Empty

Total cache = 84 MB on a 400 GB filesystem. There is no cache pressure. The "cache being garbage-collected mid-page-load" hypothesis (c) is ruled out (oldest cached image timestamp = 2026-05-08 01:12 BST, newest = 17:42 BST = 16.5 h retention with no eviction).


5. Image cache miss-vs-hit timing

Public asset latency from onyx → https://arrflix.s8n.ru:

URL Attempt 1 (cold) Attempt 2 (warm)
/web/assets/img/icon-transparent.png 0.227 s 0.047 s
/web/serviceworker.js 0.059 s 0.059 s
/web/main.jellyfin.bundle.js 0.092 s 0.052 s

5-sample steady state on /web/main.jellyfin.bundle.js = 4468 ms, median 49 ms. Traefik + Jellyfin static-asset path is fast.

Direct poster URLs (/Items/{id}/Images/Primary) require an auth token; could not be probed without a fresh X-Emby-Token. Inferred from on-disk evidence: the resized-images cache contains 412 WebPs, all under 200 KB, no eviction in the last 16 h. Image cache serves all current items from disk on warm path.

Hypothesis (c) is ruled out.


6. Stale-transcode detection

/cache/transcodes:
  total bytes:   39 MB  (was 61 MB earlier in audit, churn = active stream)
  total files:   26
  files >60 min old:   0
  bytes >60 min old:   0 MB

Clean Transcode Directory task last ran 2026-05-08T02:13 (per audit 13 task list). Currently zero stale transcode segments. Hypothesis (d) is ruled out — no accumulation.

However, 5 concurrent ffmpeg processes are transcoding the same file right now:

PID    CPU    file
1685478 246%  Rick and Morty S01E01 - Pilot.mkv
1686665 203%  Rick and Morty S01E01 - Pilot.mkv  (same file)
1686651 198%  Rick and Morty S01E01 - Pilot.mkv  (same file)
1689000 125%  Rick and Morty S01E01 - Pilot.mkv  (same file)
1689109 120%  Rick and Morty S01E01 - Pilot.mkv  (same file)

This is a CPU-side issue (no ffmpeg de-dup, no segment throttling — see audit 13 finding 03). It causes:

  • Load average 42.62 / 22.84 / 12.32 (12-core box).
  • Swap usage 7.8 GiB / 24 GiB.
  • I/O wait however is 0 % in vmstat (wa=0).

The host CPU is saturated, not the disk. Storage layer is not this user's bottleneck.


7. Inode + free-space stats

Filesystem 1K-blocks Used Available Use % Inodes IUsed IUse %
keystone--vg-home (/home) 418 106 320 244 025 392 152 768 828 62 % 26 624 000 1 489 612 6 %
keystone--vg-root (/)
keystone--vg-var (/var) 12 G 2.0 G 8.6 G 19 % n/a n/a n/a

Free space went from 40 GiB at audit 13 (90 % full) to 146 GiB now (62 %). Material improvement; the prior "low free space" hypothesis (e) is ruled out. Inode pressure ruled out.

(Note: /home houses /home/user/docker-data/100000.100000/... which contains all userns-remapped Docker overlay2 trees. The 233 G used number includes container layers, not just media. Library itself is 201 files.)


8. fstrim status

fstrim.timer    Loaded, enabled, active (waiting)
                Last triggered: Sun 2026-05-03 23:42:29 BST
                Next trigger:   Mon 2026-05-11 01:12:58 BST
fstrim --dry-run /home    →  /home: 0 B (dry run) trimmed

Weekly trim is configured and recently ran (one week before next trigger). Dry-run reports 0 B candidate → there is no untrimmed free space on /home. SSD performance degradation from unTRIMmed-blocks is not a factor. No discard mount option (correct — async batched trim via timer is preferred over inline).


9. Read-ahead and queue settings

Block device read_ahead_kb scheduler nr_requests
nvme0n1 (physical) 128 KB [none] mq-deadline 1023
dm-4 (keystone--vg-home, the LV) 128 KB n/a n/a
/sys/block/dm-4 lacks scheduler/nr_requests (dm devices inherit)

128 KB read-ahead is the kernel default. For sequential MKV streams this is OK; for library-scan workloads (stat + open + read first chunk per file) it's also OK. Bumping to 512 KB or 1024 KB would help scan throughput during a Jellyfin library refresh — minor win, ~30 s of work.

NVMe is using none scheduler (correct for NVMe — multiqueue + no elevator).


10. RO bind-mount overhead — confirmed

(From §2.) Host direct = 2.1 GB/s. Container RO bind = 1.2 GB/s. Overhead ≈ 40 % which is higher than expected, likely a side-effect of:

  • userns remap (100000.100000 shifts uids)
  • the nosuid,nodev flags on /home propagating into the bind
  • container's read_ahead_kb is not configurable through bind (inherits 128 KB)

Not actionable today. Both numbers are 100×+ of any media bitrate. Documented to rule out hypothesis (f).

atime cost on RO bind: bind mount inherits the host's relatime semantics — at most one atime write per file per 24 h, gated by relatime. On 201 files that's ≤ 201 atime writes/day = rounding noise. Hypothesis (f) ruled out.


11. Concrete remediation list — ranked

Severity legend: R = red (acute, fix this week), Y = yellow (deferred, document risk), G = green (audited, healthy, no action). Effort: S ≤ 30 min, M half-day, L > 1 day.

# Severity Effort Bucket Action Why
S01 Y S Quick-win Bump read_ahead_kb on /dev/nvme0n1 to 512 KB (sysfs or udev rule) Helps library-scan and large-MKV streams. Tiny risk; reverts on reboot if set live.
S02 Y M Quick-win Add noatime (replacing relatime) to /home mount in /etc/fstab Eliminates the residual relatime writes; cosmetic but cheap. Requires a remount; do during a window with no playback.
S03 Y M Investment Carve a separate media LV (or attach a second NVMe) for /home/user/media and bind-mount it RO into Jellyfin Isolates library reads from transcode-write churn and Docker overlay churn on the same ext4 journal. Today it is fine; at scale it will not be.
S04 Y M Investment Move keystone--vg-swap_1 off keystone-vg (or onto a separate device) Swap is currently 7.8 GiB used and shares the NVMe queue with media reads. CPU saturation is the proximate cause but cleanly isolating swap helps when CPU finally gets fixed (GPU re-enable, see audit 13 #02).
S05 Y M Investment Add a second PV to keystone-vg so the VG has free space vgs shows VFree=0. Any future lvextend will fail until a PV is added. Latent ops trap.
S06 G Keep weekly fstrim.timer as-is Healthy, current.
S07 G Keep image cache untouched 84 MB total cache, 16 h retention, no GC pressure.
S08 G No change to data=ordered ext4 journal NVMe; mode is fine.

The single biggest "loads kinda slow" win lives in audit 13 (finding 03 — enable transcode throttling + segment deletion). Storage is not where this is fixed.


12. Quick-win vs investment

Quick-win (≤30 min total, today)

  • S01echo 1024 > /sys/block/nvme0n1/queue/read_ahead_kb (or 512). Reverts on reboot; persist via udev rule under /etc/udev/rules.d/60-readahead.rules. Marginal but free.
  • S02 — flip relatimenoatime in /etc/fstab for /home. Cosmetic but cheap. Skip if even half-load — a bad fstab + reboot is an outage; only do during a planned window.

Investment (half-day to multi-day, plan)

  • S03 — separate media LV. Requires lvcreate, mkfs, rsync the library, swap the bind-mount in compose. ~half-day. Pays back when (a) library grows past the current 201 files, (b) GPU transcode is re-enabled (audit 13 #02) and many concurrent reads start happening.
  • S04 — relocate swap. Only meaningful after GPU re-enable closes the CPU-saturation root cause.
  • S05 — second PV. Trivial mechanically (pvcreate, vgextend), blocked on having a second device. Defer until needed.

No-op (audited and healthy)

  • SMART status (8 % wear, no errors)
  • ext4 features and journal mode
  • Inode usage (6 %)
  • Free space (62 %, 146 GiB headroom)
  • Cache size (84 MB total)
  • Stale transcodes (zero)
  • fstrim.timer (working, candidate-bytes = 0)
  • Bind-mount throughput (1.2 GB/s, 190× any 4K stream)

13. Sign-off

  • Audit: 2026-05-08, read-only, ~15 min wall.
  • No fixes applied. No state mutated. No container restart. No SMART self-test. No fstrim execution. No mount changes.
  • Top storage culprit: none. Storage stack is healthy. The "loads kinda slow" symptom is CPU-side (5 concurrent ffmpegs at load 42, audit 13 #02 + #03).
  • Top quick-win: S01 — bump read_ahead_kb to 512 KB on nvme0n1 for marginal scan/stream gain. Real fix lives in audit 13.
  • Next audit due: 2026-08-08 (quarterly, with audit 13).