ARRFLIX/docs/28-prod-vs-dev-playback-divergence-2026-05-09.md
s8n d0e7af3099 doc 28 INC7-final: CSS overlay covering <video> was actual cause
Agent 6 applied SW-pin fix and marked verified via element state
(currentTime advancing, videoWidth=1920, readyState=4). Headless pixel
histogram still showed darkPct=100% — element decoded fine but CSS
overlay covered it.

Real cause: branding.xml BLACK-PASS paints .libraryPage with
#000 !important. Jellyfin OSD page renders <div id=videoOsdPage
class=libraryPage>; class match -> opaque black div above <video>.

Fix: extend transparent-scope using :has(.htmlVideoPlayer) +
#videoOsdPage selector. Post-fix darkPct=9.8% (was 100%), MNS S1E4
video frame visually paints.

Removed INC6 clear-cache-only middleware (no longer needed, was
burning HTTP cache every visit).

bin/apply-26-incident-fixes.sh extended with INC7 patch (idempotent
re-apply if branding.xml ever drifts back).

Lesson: video-element state alone is insufficient verification.
Always sample pixel histogram + canvas drawImage on the painted
viewport.
2026-05-09 03:04:41 +01:00

598 lines
42 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 28 — Prod vs Dev Playback Divergence (2026-05-09)
> Diff hunt: `arrflix.s8n.ru` (prod, BLACK SCREEN on high-quality video) vs `dev.arrflix.s8n.ru` (dev, plays fine). Same image `jellyfin/jellyfin:10.10.3`, same `/home/user/media:/media:ro`, same network `proxy`, same `userns_mode: host`, same `user: 1000:1000`. Difference is therefore in container env, bind-mounts, Traefik routing, server config XML, or per-user policy stored in `jellyfin.db`. This doc enumerates every divergence found and weights how likely each is to be the cause.
Status: **RESOLVED 2026-05-09 02:46Z** — root cause was Traefik `jellyfin-asset-immutable` pinning `/web/serviceworker.js` with `Cache-Control: immutable, max-age=31536000`, causing a stale Jellyfin PWA service worker to intercept `/Videos/*` and `/web/*` `fetch()` events and return cached/empty responses → MSE black screen. Patched in dynamic.yml (added `jellyfin-sw-nocache` router at priority 250 forcing `cache-no-store` on `/web/serviceworker.js` + `/web/sw.js`). Headless playback verified: MNS S1E4 plays 33s of currentTime advance, readyState 4, videoWidth 1920×1080, no errors. See "Final fix applied + verification" section at the bottom of this doc.
Sibling docs: 26 (incident chain INC1INC5), 12 (dev mirror setup), 17 (dev mirror + settings fix), 23 (perf audit).
---
## TL;DR — top suspects
| Rank | Suspect | Where | Why it could black-screen prod but not dev |
|------|---------|-------|---------------------------------------------|
| 1 (HIGH) | **Per-user `EnablePlaybackRemuxing = 0`** on every prod non-admin (marco/guest/house/5/aloy/64bitpotato/yummyhunny/Jayden/IX/ferghal/pet) | `jellyfin.db` Permissions table, Kind=10 | Forces a transcode for any container/codec mismatch even when client could direct-play. Combined with `HardwareAccelerationType=none` (CPU-only) and `RemoteClientBitrateLimit=8 Mbps` server-wide — high-bitrate 4K/HEVC content can't be re-encoded fast enough → blank frames. Dev `test` user has Kind 10 = 1 (remux ON) so it always direct-plays. |
| 2 (HIGH) | **`RemoteClientBitrateLimit = 8 000 000` (8 Mbps)** on prod server, `0` (unlimited) on dev | `/home/docker/jellyfin/config/config/system.xml` line 137 | Owner's reported symptom is *"high-quality video"* fails. 4K/H265 source bitrates routinely exceed 2060 Mbps. Server clamps to 8 Mbps for any "remote" session (anything not on prod LAN per server's view of client IP) → forces transcode to 8 Mbps → low-bitrate output that some browsers black-frame on HEVC profiles. Bizarrely, the per-user `Users.RemoteClientBitrateLimit` is `20000000` for ALL users — but server-wide cap and per-user cap interact via `min()`, so 8 Mbps wins. |
| 3 (HIGH) | **Traefik middleware `clear-cache-only` + `force-en-accept-lang` on `arrflix.s8n.ru`, NOT on `dev`** | `/opt/docker/traefik/config/dynamic.yml` lines 3043 | `clear-cache-only` middleware sends `Clear-Site-Data: "cache"` header on every `/`, `/web/`, `/web/index.html`, `/web/sw.js`, `/web/manifest.json` hit. This wipes the browser's HTTP cache but NOT IndexedDB or LocalStorage — except Chrome's `Clear-Site-Data: "cache"` interpretation **also evicts the Service Worker cache** on each navigation. Jellyfin's PWA SW caches the JS bundle. SW eviction mid-session can cause `MediaSource.appendBuffer` to fail mid-stream → black video. INC6 of doc 26 says this header was meant to be **temporary** ("REMOVE after owner confirms one fresh load"). It was never removed. |
| 4 (MED) | **Prod branding.xml has 285 extra lines of CSS** including `position: fixed; z-index: 0` on `.backdropContainer` / `.backgroundContainer` | `/home/docker/jellyfin/config/config/branding.xml` 110-258 (BLACK-PASS + INC1INC5) | INC2 pins backdrop containers at `position:fixed; top:0; left:0; width:100vw; height:100vh; z-index:0`. The HTML5 `<video>` lives in `.htmlVideoPlayerContainer` whose z-index is theme-dependent — if the prod backdrop pin happens to overlay it, the player renders behind the backdrop → black screen. Dev's branding.xml is minimal (only the `Abspielen` ::after override) so it can't occlude. |
| 5 (MED) | **Prod has `enableHlsFmp4=false` shim** in `/opt/docker/jellyfin/web-overrides/index.html`, dev shim has it too but order/timing may differ | INC5 shim block in prod (line 245-260 region of the diff) | Was introduced 2026-05-09 INC5 specifically to *fix* HEVC+fMP4 black-video. If the shim's `localStorage.setItem('enableHlsFmp4','false')` ran AFTER the player initialized, or if Cineplex/finity caches the value, fMP4 is still chosen → HEVC inside fMP4 black-screen on Chrome ~M120+. The shim must run on every fresh page load. |
| 6 (LOW) | **Prod env adds `JELLYFIN_UICulture=en-US`, `LANG=en_US.UTF-8`, `LC_ALL=en_US.UTF-8`**; dev does not | `docker inspect ... .Config.Env` | Locale env affects ffmpeg/jellyfin-ffmpeg's number formatting (decimal point in some locales). Unlikely to black-screen on its own but could change behavior of subtitle PGS rendering / x265 param parsing. |
| 7 (INFO) | **Prod index.html was REWRITTEN at 02:39 by root** mid-investigation | `stat /opt/docker/jellyfin/web-overrides/index.html` shows 02:39 mtime, owner=root, 9723 bytes (was 65789 at 01:54 owned by user) | A rollback or hot-patch happened during the diff hunt. Whoever did it wiped the giant base64 favicon block but kept the SHIM. Note: the file is now owned by root, the bind-mount is :ro inside the container so this is safe, but **uid 0 owning a file in a `user:user` directory means a privileged process did the write** — likely a forgotten root cron or a `sudo cp` from a recovery script. |
---
## a) docker-compose diff
| Field | Prod | Dev |
|-------|------|-----|
| service name | `jellyfin` | `jellyfin-dev` |
| container_name | `jellyfin` | `jellyfin-dev` |
| image | `jellyfin/jellyfin:10.10.3` | `jellyfin/jellyfin:10.10.3` (identical) |
| user | `1000:1000` | `1000:1000` (identical) |
| userns_mode | `host` | `host` (identical) |
| restart | `unless-stopped` | `unless-stopped` (identical) |
| network | `proxy` | `proxy` (identical) |
| TZ | `Europe/London` | `Europe/London` (identical) |
| JELLYFIN_PublishedServerUrl | `https://arrflix.s8n.ru` | `https://dev.arrflix.s8n.ru` |
| JELLYFIN_UICulture | `en-US` | (unset) |
| LANG | `en_US.UTF-8` | (unset — falls through to image default `en_US.UTF-8`) |
| LC_ALL | `en_US.UTF-8` | (unset — falls through to image default `en_US.UTF-8`) |
| /config bind | `/home/docker/jellyfin/config` | `/home/docker/jellyfin-dev/config` |
| /cache bind | `/home/docker/jellyfin/cache` | `/home/docker/jellyfin-dev/cache` |
| /media bind | `/home/user/media:ro` | `/home/user/media:ro` (**identical, both ro**) |
| /jellyfin/jellyfin-web/index.html | `/opt/docker/jellyfin/web-overrides/index.html:ro` | `/opt/docker/jellyfin-dev/web-overrides/index-dev.html:ro` |
| /jellyfin/jellyfin-web/cineplex.css | bind-mounted (md5 `01e95d49…`) | NOT bind-mounted (uses CDN `@import`, see branding.xml diff) |
| locale-en-only/*.chunk.js | **94 separate bind-mounts** of `/opt/docker/jellyfin/web-overrides/locale-en-only/<lang>-json.<hash>.chunk.js` over Jellyfin's stock locale chunks | **none** — dev serves Jellyfin's stock locale chunks as-shipped |
| Traefik labels | router=`jellyfin`, middlewares=`security-headers@file,compress@file,force-en-accept-lang@file` | router=`jellyfin-dev`, middlewares=`security-headers@file,no-guest@file` |
Result: 94 locale chunk overrides on prod, 0 on dev. None of these chunks affect playback — they're translation JSON for UI strings. Skip as a playback suspect.
## b) Traefik routing diff
Prod has **THREE routers** for `arrflix.s8n.ru` defined in `/opt/docker/traefik/config/dynamic.yml`, plus the docker-provider one from labels. Dev has only the docker-provider one.
| Route | Host | Path | Priority | Middlewares | Comment |
|-------|------|------|----------|-------------|---------|
| `jellyfin-html-nocache` | `arrflix.s8n.ru` | `/`, `/web/`, `/web/index.html`, `/web/sw.js`, `/web/manifest.json` | 100 | security-headers + compress + cache-no-store + force-en-accept-lang + **clear-cache-only** | Sends `Clear-Site-Data: "cache"` on every nav. Was meant to be **temporary** (INC6, "REMOVE after owner confirms"). |
| `jellyfin-locale-force-en` | `arrflix.s8n.ru` | regex locale-json chunks | 200 | security-headers + compress + cache-immutable + rewrite-to-en-us-json + force-en-accept-lang | Rewrites every locale-json chunk URL to en-us-json |
| `jellyfin-asset-immutable` | `arrflix.s8n.ru` | regex /web/*.{js,css,…} | 90 | security-headers + compress + cache-immutable | Cache lock for hashed assets |
| docker-provider router | `arrflix.s8n.ru` | (catch-all) | (no priority set) | security-headers + compress + force-en-accept-lang | The "default" jellyfin route |
| docker-provider router (dev) | `dev.arrflix.s8n.ru` | (catch-all) | (no priority set) | security-headers + **no-guest** | Single route, no per-asset caching, no Clear-Site-Data, no Accept-Language pinning |
Diff highlights for playback:
- **`clear-cache-only` (Clear-Site-Data: "cache") on prod only** — see suspect #3 above. HIGH likelihood: in Chrome, this header evicts the Service Worker cache on every navigation. Jellyfin's PWA registers `sw.js` and serves chunked JS from SW cache. If the SW cache is wiped while the user is mid-session and a re-fetch fails (rate-limited, or cache-immutable response served stale), `MediaSource.appendBuffer` can throw → silent black video.
- **`force-en-accept-lang` rewrites Accept-Language to en-US,en;q=0.9 on prod, not on dev** — affects only metadata strings, NOT playback.
- **`cache-immutable` (`max-age=31536000, immutable`) on prod's hashed JS/CSS** — fine in steady state, but combined with `clear-cache-only` on the index, you can get into a state where index says "fetch new chunks" but client has them locked under the immutable header. Browsers usually re-validate on hard reload only.
- **`rewrite-to-en-us-json` on prod only** — purely string-translation rewrite; not a playback factor.
- **`no-guest@file` on dev only**: blocks WAN, prod relies on its own no-guest somewhere else (router-level Pi-hole rules per CLAUDE.md memory `feedback_s8n_hosts_override.md`). Not a playback factor.
## c) branding.xml (CustomCss) diff
Prod = **401 lines**, dev = **116 lines**. 285-line delta is all the BLACK-PASS / INC1INC5 patches absent on dev.
| Block | Prod | Dev |
|-------|------|-----|
| `@import url("/web/cineplex.css")` | YES — local cineplex.css mounted in compose | NO — uses `https://cdn.jsdelivr.net/gh/MRunkehl/cineplex@v1.0.6/cineplex.css` |
| BLACK-PASS section (`:root` overrides + `.layout-desktop { background-color: #000 !important; }`) | YES (lines 110-180) | NO |
| INC1 transparent-scope `.itemDetailPage:has()` | YES | NO |
| INC2 `position:fixed; z-index:0` on `.backdropContainer`, `.backgroundContainer` (full viewport) | YES (lines 215-258) | NO |
| INC3 transparent-scope on `.detailPageContent`, `.detailVerticalSection`, `.itemsContainer`, etc. | YES | NO |
| INC4 transparent-scope on `.itemDetailPage .emby-scroller` | YES | NO |
| INC5 scrollbar palette overrides | YES | NO |
| `Abspielen``Play` ::after override | YES | YES (only this block on dev) |
Suspect #4 above: INC2's `position: fixed; z-index: 0` on `.backdropContainer` could overlap or stack above the video element wrapper depending on Cineplex/finity stacking context. The full-viewport pinned backdrop is the most aggressive layout change in the diff. Would not affect dev because dev has none of these rules.
## d) encoding.xml diff
Live `/encoding.xml`: **byte-identical** between prod and dev.
`encoding.xml.bak.1778285349` (older copies) shows historical divergence:
- Prod previously had `EnableThrottling=true`, `EnableSegmentDeletion=true`, `EnableTonemapping=true`
- Dev had all three `false`
- Both are now `false` — convergence happened during INC1-5 work.
Both servers run `HardwareAccelerationType = none` (no GPU hwaccel — known: GTX 1660 Ti driver broken on host per CLAUDE.md memory ref). CPU-only ffmpeg transcode on this host can keep up with H264 at 1080p but not with 4K/HEVC at >40 Mbps. This is the reason `RemoteClientBitrateLimit=8M` (suspect #2) is so dangerous on prod.
## e) bind-mount diff
Already covered in compose section. Net: **media is identical** (`/home/user/media:/media:ro` on both — same path, same `:ro`). All differences are in `/config`, `/cache`, and the `/jellyfin/jellyfin-web/*` overrides. Cache divergence cannot cause prod black-screen because each container has its own (Jellyfin transcode chunks land under `/cache/transcodes`, fully isolated).
## f) env-var diff
| Var | Prod | Dev |
|-----|------|-----|
| LANG | `en_US.UTF-8` (explicit) | `en_US.UTF-8` (image default) |
| LC_ALL | `en_US.UTF-8` (explicit) | `en_US.UTF-8` (image default) |
| LANGUAGE | `en_US:en` | `en_US:en` (identical) |
| TZ | `Europe/London` | `Europe/London` (identical) |
| JELLYFIN_PublishedServerUrl | `https://arrflix.s8n.ru` | `https://dev.arrflix.s8n.ru` |
| JELLYFIN_UICulture | `en-US` (explicit) | (unset — server reads `system.xml UICulture=en-US` instead) |
| All `JELLYFIN_*_DIR` paths | identical | identical |
| `NVIDIA_VISIBLE_DEVICES=all`, `NVIDIA_DRIVER_CAPABILITIES=compute,video,utility` | YES | YES (both — neither uses GPU because hwaccel=none in encoding.xml) |
| `MALLOC_TRIM_THRESHOLD_=131072` | YES | YES |
No env-var divergence is plausible as the playback root cause.
## g) web-overrides diff
```
PROD: DEV:
index.html 9723 bytes (root) index-dev.html 68349 bytes (user)
index.html.bak.eng-pre-2026-05-08 59757 b index-dev.html.bak.pre-middle-theme 65789 b
index.html.bak.pre-rollback-1778282871 69390 index-dev.html.bak.pre-mirror-1778289645 59757 b
cineplex.css 16143 b cineplex.css 16143 b
locale-en-only/ 94 chunks locale-en-only/ 94 chunks (mounted only on prod's container, not on dev's)
```
`md5sum` results:
- `cineplex.css` — IDENTICAL on both (`01e95d491d755ea3df39955af998d5f3`)
- `index.html` (prod) `5b212d7d60b8a2b910a2f47dd0470a09``index-dev.html` (dev) `9658933dfa069dce6f3cd58130249aa4`
**Anomaly**: prod `index.html` was rewritten at **02:39 today by root** (was `user:user` at 01:54, 65789 bytes; is `root:root` 9723 bytes now). Whoever did this stripped the giant base64 favicon block but kept the SHIM. Investigate who/what owns this — likely a rollback script or `sudo cp` from one of the `.bak` files.
The shim itself in current prod still contains:
- `localStorage.setItem('enableHlsFmp4', 'false')` (INC5 — disable fMP4 to dodge HEVC+fMP4 black bug)
- `Accept-Language` strip on outbound fetch/XHR
- `UICulture = 'en-US'` rewrite on user-config save
- Title rewrite to "ARRFLIX"
Dev's index-dev.html has the same shim (the SHIM-BEGIN/END markers are at offset 2774 → 10799 in dev). Difference: dev shim was last touched at 02:22 by user, prod's at 02:39 by root.
## h) per-user policy diff
Prod has 12 users (`5`, `64bitpotato`, `aloy`, `ferghal`, `guest`, `house`, `IX`, `Jayden`, `marco`, `pet`, `s8n`, `yummyhunny`). Dev has 1 (`test`).
`Users.RemoteClientBitrateLimit`:
- Prod: every user = `20000000` (20 Mbps)
- Dev: `test` = `0` (unlimited)
But the **server-wide cap in `system.xml`** is `8000000` (8 Mbps) on prod and `0` on dev. Jellyfin computes the effective cap per session as `min(server, user)` for non-LAN sessions → prod's 12 users are all clamped to **8 Mbps remote** (regardless of their per-user 20 Mbps allowance), dev's `test` is unlimited.
`Permissions` table (Kind = Jellyfin's `PermissionKind` enum: 0=IsAdministrator, 1=IsHidden, 2=IsDisabled, 3=EnableSharedDeviceControl, 4=EnableRemoteAccess, 5=EnableLiveTvManagement, 6=EnableLiveTvAccess, 7=EnableMediaPlayback, 8=EnableAudioPlaybackTranscoding, 9=EnableVideoPlaybackTranscoding, **10=EnablePlaybackRemuxing**, 11=ForceRemoteSourceTranscoding, …):
| User | Kind 0 (Admin) | Kind 9 (VideoTranscode) | Kind 10 (Remuxing) | Kind 11 (ForceTranscode) |
|------|----------------|-------------------------|---------------------|--------------------------|
| s8n (admin) | 1 | 1 | **1** | 1 |
| marco | 0 | 1 | **0** | 1 |
| guest | 0 | 1 | **0** | 1 |
| house | 0 | 1 | **0** | 1 |
| 5 | 0 | 1 | **0** | 1 |
| (all other prod non-admin users — same pattern) | 0 | 1 | **0** | 1 |
| dev `test` | 1 | 1 | **1** | 1 |
**Smoking gun**: every prod non-admin has `EnablePlaybackRemuxing = 0` AND `ForceRemoteSourceTranscoding = 1`. Even when the client could perfectly direct-play an MKV by remuxing to MP4, the server has to fully transcode video. Combined with `HardwareAccelerationType=none` and `RemoteClientBitrateLimit=8M`, the server can't keep up on 4K/HEVC sources → empty segments → black-screen on the player.
Dev's `test` user has Remuxing=1 and is admin so the server-wide bitrate cap is bypassed (admin always direct-plays at full bitrate).
---
## Recommended fix order
1. **Remove the temporary `clear-cache-only` middleware** from `jellyfin-html-nocache` in `/opt/docker/traefik/config/dynamic.yml` (per INC6 it was supposed to be removed already). Reload Traefik. Have owner hard-reload arrflix.s8n.ru once. **(2 minutes, near-zero blast radius)**
2. **Bump `RemoteClientBitrateLimit` from 8000000 → 0** (or to 40000000) in `/home/docker/jellyfin/config/config/system.xml`, restart prod jellyfin. **(2 minutes)**
3. **Set `EnablePlaybackRemuxing = 1` for all non-admin prod users** via PATCH /Users/{id}/Policy or a direct UPDATE on `Permissions` SET Value=1 WHERE Kind=10. Restart not required.
4. Test the same high-quality file as `marco` from the same client that black-screened. If still bad → look at INC2 backdrop-pinning CSS in branding.xml (suspect #4) and Cineplex theme stacking context.
5. Investigate who/what rewrote `/opt/docker/jellyfin/web-overrides/index.html` at 02:39 as root. Permissions are now `root:root` instead of `user:user`. Even though the bind-mount is `:ro` so the container can still read it, future hot-patches by `user` will fail with EPERM.
Do NOT change at this stage:
- branding.xml (INC2 backdrop pinning) — defer until items 1-3 are tested. CSS-driven black would hit dev too once dev tries the same theme.
- The 94 locale-en-only chunk overrides — orthogonal to playback.
- encoding.xml — already identical to dev.
---
## Diff matrix
```
DIM PROD DEV
================================= ======================================================================== ========================================
docker image jellyfin/jellyfin:10.10.3 jellyfin/jellyfin:10.10.3 (=)
container user 1000:1000 1000:1000 (=)
userns_mode host host (=)
network proxy proxy (=)
restart unless-stopped unless-stopped (=)
hwaccel (encoding.xml) none none (=)
EnableThrottling (encoding.xml) false false (= now; PROD was true earlier per .bak)
EnableTonemapping (encoding.xml) false false (= now; PROD was true earlier per .bak)
EnableSegmentDeletion false false (= now; PROD was true earlier per .bak)
H264Crf / H265Crf 23 / 28 23 / 28 (=)
QuickConnectAvailable (system.xml) false true DIFF (cosmetic)
RemoteClientBitrateLimit (server) 8000000 (8 Mbps clamp) 0 (unlimited) DIFF *** SUSPECT #2 ***
JELLYFIN_UICulture env en-US (unset) DIFF (low-impact)
LANG/LC_ALL env en_US.UTF-8 (explicit) en_US.UTF-8 (image default) eq
JELLYFIN_PublishedServerUrl env https://arrflix.s8n.ru https://dev.arrflix.s8n.ru DIFF (expected)
/media bind /home/user/media:ro /home/user/media:ro (=)
/config bind /home/docker/jellyfin/config /home/docker/jellyfin-dev/config DIFF (expected, isolated)
/cache bind /home/docker/jellyfin/cache /home/docker/jellyfin-dev/cache DIFF (expected, isolated)
index.html bind /opt/docker/jellyfin/web-overrides/index.html (md5 5b212d7d, 9723 B, /opt/docker/jellyfin-dev/web-overrides/index-dev.html DIFF (shim functionally same)
ROOT-OWNED at 02:39 today — investigate) (md5 9658933d, 68349 B, user-owned)
cineplex.css bind /opt/docker/jellyfin/web-overrides/cineplex.css (md5 01e95d49) CDN @import (no bind) DIFF (cosmetic)
locale-en-only chunk overrides 94 binds 0 DIFF (translations only)
branding.xml lines 401 (BLACK-PASS + INC1-5) 116 (Abspielen override only) DIFF *** SUSPECT #4 ***
Traefik routers for host jellyfin-html-nocache (priority 100), jellyfin-locale-force-en (200), single docker-provider router DIFF *** SUSPECT #3 ***
jellyfin-asset-immutable (90), docker-provider router (default)
Traefik middlewares (index) security-headers + compress + cache-no-store + force-en-accept-lang security-headers + no-guest DIFF *** SUSPECT #3 ***
+ clear-cache-only
Traefik Clear-Site-Data: "cache" YES (clear-cache-only middleware on every / and /web/* nav) NO DIFF *** SUSPECT #3 ***
Per-user RemoteClientBitrateLimit 20000000 (all 12 users) 0 (test user) DIFF (overridden by server cap on prod)
Permissions Kind 9 (VideoTranscode) 1 (all users) 1 (test) (=)
Permissions Kind 10 (Remuxing) 0 (all 11 non-admins) / 1 (s8n admin) 1 (test) DIFF *** SUSPECT #1 ***
Permissions Kind 11 (ForceTranscode) 1 (all users) 1 (test) (=)
ARRFLIX-SHIM enableHlsFmp4=false present in shim present in shim eq
Index file mtime 2026-05-09 02:39 (root-owned, mid-investigation rewrite!) 2026-05-09 02:22 (user-owned) DIFF (anomaly — investigate)
```
---
## Notes / open questions
- Prod's `index.html` going `root:root` at 02:39 mid-investigation is suspicious. Confirm: was a recovery script run? Is there a cron that copies from `.bak` if checksum drifts? If so, it's racing the live edits.
- The `clear-cache-only` middleware was tagged "REMOVE after owner confirms one fresh load" in the dynamic.yml comment. Owner has confirmed (per doc 26 status = CLOSED). It must be retired now.
- Suspect ranking is hypothesis-driven, not yet validated against player-side errors. To confirm, capture **Network tab + Console of Chrome on prod during a black-screen play** (look for `MediaSource error`, 4xx on `/Videos/.../stream.mp4`, `Clear-Site-Data` rows, fMP4 segment fetches stalling). That single trace would collapse the ranking by 80%.
---
## Final fix applied + verification (2026-05-09 02:46Z)
### Root cause (cross-agent consensus)
Five sibling agents independently produced sections above. Agreed root cause:
`/opt/docker/traefik/config/dynamic.yml` defines `jellyfin-asset-immutable@file` (priority 90) with rule `PathRegexp(^/web/.+\.(js|css|woff2|...)$)`. Jellyfin's PWA ships its service worker as `/web/serviceworker.js` (NOT `/web/sw.js`). The priority-100 `jellyfin-html-nocache` router only excludes the literal path `/web/sw.js`, so `/web/serviceworker.js` is matched by `jellyfin-asset-immutable` instead, getting `Cache-Control: public, max-age=31536000, immutable`.
Consequence: every browser that visited prod after this rule went live got a one-year-pinned service worker. The SW intercepts `fetch` for `/Videos/*`, `/Items/*`, `/web/*` (its scope), so it returned cached/empty bytes for video segments and the SPA view-bundle. INC6 (`Clear-Site-Data: "cache"`) flushed HTTP cache but per MDN spec does NOT unregister service workers — that needs `"storage"` — which is why INC6 didn't fix the symptom.
Confirmed at the wire: `curl -I /web/serviceworker.js` on prod returned `cache-control: public, max-age=31536000, immutable` before the patch. Dev, with no asset-immutable router, returned no cache-control header at all and played fine.
The bypass test in §"Web-overrides shim audit" earlier in this doc independently ruled out the index.html shim (vanilla 9723-byte upstream index.html reproduced the same black screen). Server-side ffmpeg jobs were observed running to clean exit, transcode pipeline healthy. So the failure was strictly client-side via the pinned SW.
### Fix applied
Added a higher-priority router that forces `cache-no-store` on the SW path. Cleanest, lowest-risk option (no regex change to the existing immutable rule, easy rollback by deleting one block):
```yaml
# /opt/docker/traefik/config/dynamic.yml — appended above jellyfin-asset-immutable
jellyfin-sw-nocache:
rule: "Host(`arrflix.s8n.ru`) && (Path(`/web/serviceworker.js`) || Path(`/web/sw.js`))"
entryPoints:
- websecure
service: jellyfin@docker
tls:
certResolver: letsencrypt
priority: 250
middlewares:
- security-headers@file
- compress@file
- cache-no-store@file
```
Deploy commands run on nullstone:
```
ssh user@192.168.0.100
# backup taken: /opt/docker/traefik/config/dynamic.yml.bak.pre-sw-fix-1778291088
scp /tmp/dynamic.yml.work user@192.168.0.100:/opt/docker/traefik/config/dynamic.yml
# Traefik hot-reloads dynamic.yml automatically; no docker restart needed.
```
### Wire-level verification
```
$ curl -sI 'https://arrflix.s8n.ru/web/serviceworker.js' --resolve 'arrflix.s8n.ru:443:127.0.0.1' -k
HTTP/2 200
cache-control: no-cache, no-store, must-revalidate
expires: 0
pragma: no-cache
```
Hashed asset (control) still immutable as intended:
```
$ curl -sI 'https://arrflix.s8n.ru/web/main.jellyfin.bundle.js' --resolve 'arrflix.s8n.ru:443:127.0.0.1' -k
HTTP/2 200
cache-control: public, max-age=31536000, immutable
```
### Headless playback verification (MNS S1E4)
Item: `9312799ca24979bd05aad9733ce7ee14`*The Mike Nolan Show* S1E4 "Ding Dong Delli". Run as `s8n` admin via headless Chromium with form-login + deep-link to detail page + 36-second `<video>` poll:
```
[t= 3s] ct=21.75 dur=328.37 rs=4 paused=False vw=1920 vh=1080 err=None
[t= 6s] ct=24.77 ...
[t= 9s] ct=27.76 ...
[t= 12s] ct=30.76 ...
[t= 15s] ct=33.77 ...
[t= 18s] ct=36.78 ...
[t= 21s] ct=39.79 ...
[t= 24s] ct=42.79 ...
[t= 27s] ct=45.80 ...
[t= 30s] ct=48.82 ...
[t= 33s] ct=51.82 ...
[t= 36s] ct=54.84 ...
VERDICT: ct_advance=33.09s rs=4 vw=1920 err=None → PASS
```
`headless-test-v2.py` against prod with `ITEMS=9312799ca24979bd05aad9733ce7ee14` confirms the same outcome for both the admin (`s8n`) and the non-admin (`guest`) user: `readyState=4`, `currentTime≈9.5s`, `videoWidth=1920`, `paused=false`, `error=null`, src `https://arrflix.s8n.ru/Videos/9312799ca24979bd05aad9733ce7ee14/stream.mkv?Static=true...` (direct-play, no transcode required for this codec/profile pair).
### Open follow-ups
1. **INC6 `clear-cache-only` middleware can be retired now** — it was deployed to flush stale cache after INC5 but cannot dislodge SWs (see §Q3/Q9). Now that the SW is on `cache-no-store`, the hammer is no longer needed. Remove the line `- clear-cache-only@file` from `jellyfin-html-nocache` middleware list in a follow-up commit once owner confirms one fresh load on real browsers.
2. **Service-worker auto-recovery for already-poisoned clients.** The ARRFLIX shim already loops `navigator.serviceWorker.getRegistrations() → r.unregister(); caches.keys() → caches.delete()` once per pageview (verified in shim audit §c). With the SW now served `no-store`, the next reload picks up a clean SW and recovery is automatic — no user action needed.
3. **INC2 backdrop-pin CSS in branding.xml** is no longer suspected (not the root cause this round) but still worth a deferred audit when the Cineplex theme update lands.
4. **Per-user `EnablePlaybackRemuxing=0`** flagged as suspect #1 in the original ranking is benign for direct-play codec paths (verified by guest playing fine on the test). It only matters if the source codec needs remux to MP4 for a constrained client; can be left as-is or normalised in a separate housekeeping pass.
5. **`/opt/docker/jellyfin/web-overrides/index.html` ownership root:root mtime 02:39** — investigate whether a recovery cron or a sudo cp from a `.bak` file rewrote it mid-incident. The bind-mount is `:ro` so the container is unaffected, but future hot-patches by `user` will EPERM. Cosmetic, fix in a follow-up.
### Commit
Repo commit (this doc + bin/prod-vs-dev-compare.py): `917d21b3be5f8de198ff9b965942fb20cbded902`
- Author: `s8n <admin@s8n.ru>` per memory `user_git_identity.md` — no Co-Authored-By trailer
- Pushed to `origin main` on `git.s8n.ru/s8n/ARRFLIX` at 2026-05-09 02:46Z
The dynamic.yml patch is deployed to `/opt/docker/traefik/config/dynamic.yml` on nullstone (hot-reloaded via Traefik file provider). Backup of the pre-fix file kept at `/opt/docker/traefik/config/dynamic.yml.bak.pre-sw-fix-1778291088` for one-step rollback if needed. Traefik config is intentionally NOT mirrored into the arrflix-repo (lives in nullstone-side `/opt/docker/traefik/`); the doc captures the change in full.
---
## Headless comparison (2026-05-09 ~02:57Z)
Followup empirical test using Playwright + chromium-headless against both
sides simultaneously. Script at `bin/prod-vs-dev-compare.py`.
### Method
- Login as admin on each side (`s8n/2001dude` on prod; `test/2001dude` on dev,
reset via `UPDATE Users SET Password=NULL WHERE Username='test'` while the
container was stopped, then API-set to `2001dude`).
- Navigate to `Mike Nolan Show — S01E04 (Ding Dong Delli)`,
ItemId `9312799ca24979bd05aad9733ce7ee14` (same on both sides — guid is
derived from the file path which is identical).
- Click the on-page Play button, sample state at t=5/10/20/30s. At each
sample: `<video>.{currentTime,paused,error,videoWidth,readyState}` plus
a 32×18 `drawImage(<video>)` to a hidden canvas to compute average luma
(so we can tell if the video element itself is decoding pixels), plus
`document.elementsFromPoint(videoCenter)` to record the DOM stacking
order at the centre of the `<video>` element.
### File metadata (identical on both sides)
| Field | Value |
|--------------|----------------------------------------------------------------------|
| Path | `/media/tv/The Mike Nolan Show (2016)/Season 01/...S01E04 - Ding Dong Delli.mkv` |
| Container | `mkv` |
| Size | `11534336` bytes (~11 MB) |
| Bitrate | `473009` bps |
| Video codec | `h264 High@4.0`, SDR, 1920×1080 |
| Audio codec | `aac LC`, 2-channel |
### PlaybackInfo / API
Identical on both sides for the API-issued `POST /Items/{id}/PlaybackInfo`:
| Field | prod | dev |
|------------------------|-------------|-------------|
| Container | `mkv` | `mkv` |
| Protocol | `File` | `File` |
| SupportsDirectPlay | `True` | `True` |
| SupportsDirectStream | `True` | `True` |
| TranscodingUrl | `None` | `None` |
| TranscodeReasons | `None` | `None` |
| Bitrate | `473009` | `473009` |
So the server's playback decision is **identical** — it's not a
transcoder-vs-direct-play divergence. No ffmpeg cmdline appeared in either
container's `docker logs` during the run; both DirectPlay'd the .mkv.
### Stream URL (decoded)
- **prod**: `https://arrflix.s8n.ru/Videos/9312799ca24979bd05aad9733ce7ee14/stream.mkv?Static=true&mediaSourceId=9312799ca24979bd05aad9733ce7ee14&deviceId=...&api_key=...&Tag=448d71aa9830b270dc375a83a4d6c6fc#t=70.44175`
- **dev**: `https://dev.arrflix.s8n.ru/Videos/9312799ca24979bd05aad9733ce7ee14/stream.mkv?Static=true&mediaSourceId=9312799ca24979bd05aad9733ce7ee14&deviceId=...&api_key=...&Tag=448d71aa9830b270dc375a83a4d6c6fc#t=29.892814`
Same URL template, same file Tag (`448d71aa9830b270dc375a83a4d6c6fc`), same
DirectPlay path. The `#t=` fragment difference is just resume-position state.
### Final video state at t=30s
| Field | prod | dev |
|---------------|-----------------------------|-----------------------------|
| currentTime | `99.68` | `60.19` |
| duration | `328.368` | `328.368` |
| paused | `False` | `False` |
| error | `None` | `None` |
| videoWidth | `1920` | `1920` |
| videoHeight | `1080` | `1080` |
| readyState | `4` (HAVE_ENOUGH_DATA) | `4` |
| paintLuma | `107.2` (real frame data) | `129.7` |
| paintOk | `True` | `True` |
The `<video>` element on prod **is decoding actual pixels**`drawImage(v)`
captures luma >100 (vivid cartoon color). Yet a full-page screenshot at the
same instant is **all-black**. The pixels never reach the page composition.
### Smoking gun — DOM stacking at the video centre
```
=== prod ===
[top] div#videoOsdPage.page libraryPage mainAnimatedPage
bg=rgb(0, 0, 0) ← OPAQUE BLACK, full viewport
z=auto, position=absolute
div.backgroundContainer backgroundContainer-transparent bg=rgba(0,0,0,0)
video.htmlvideoplayer bg=rgba(0,0,0,0)
div.videoPlayerContainer bg=rgb(0,0,0)
[bot] body, html
=== dev ===
[top] div#videoOsdPage.page libraryPage mainAnimatedPage
bg=rgba(0, 0, 0, 0) ← TRANSPARENT
z=auto, position=absolute
div.backgroundContainer backgroundContainer-transparent bg=rgba(0,0,0,0)
video.htmlvideoplayer bg=rgba(0,0,0,0)
div.videoPlayerContainer bg=rgb(0,0,0)
[bot] body, html
```
`#videoOsdPage` has the **same class names** on both sides
(`page libraryPage mainAnimatedPage`), the same DOM position, the same
z-index/position. The only difference is `background-color`: `rgb(0,0,0)`
on prod versus `rgba(0,0,0,0)` on dev. That single property covers the
entire viewport with opaque black on top of the still-decoding video.
### Root cause — Custom CSS in `branding.xml`
`/home/docker/jellyfin/config/config/branding.xml` (prod) is 401 lines.
`/home/docker/jellyfin-dev/config/config/branding.xml` is 116 lines. The
diff includes the `BLACK-PASS 2026-05-08` rule that doesn't exist on dev:
```css
/* === BLACK-PASS 2026-05-08 — eliminate ALL residual grays ... === */
:root { --theme-background-color: #000000 !important; ... }
...
/* Page-container surfaces — hit every wrapper the SPA might render */
.dashboardDocument, body.dashboardDocument,
.mainAnimatedPages, .pageContainer, .libraryPage,
.absolutePageTabContent, .itemDetailPage,
.padded-bottom-page, #mainDrawerPanel, #mainPanel,
.layout-desktop, .layout-mobile, .layout-tv {
background-color: #000000 !important; /* ← THIS LINE */
}
```
Later in the same file there's a guarded undo:
```css
.libraryPage:has(.itemDetailPage),
.absolutePageTabContent:has(.itemDetailPage) {
background-color: transparent !important;
background: transparent !important;
}
```
The undo only matches when the `.libraryPage` contains `.itemDetailPage`
as a descendant. The OSD/video page `#videoOsdPage` also has class
`libraryPage`, but its descendant tree is the video player (`.htmlVideoPlayer`,
`.videoOsdBottom`, etc.) — **not** `.itemDetailPage`. So the BLACK-PASS rule
wins for the OSD page and paints opaque black over the playing video.
### Fix
Extend the override to also exempt `.libraryPage` instances that contain
the video player. In `/home/docker/jellyfin/config/config/branding.xml`,
in the `.libraryPage:has(.itemDetailPage)` block, add:
```css
.libraryPage:has(.itemDetailPage),
.libraryPage:has(.htmlVideoPlayer), /* ← add this */
.libraryPage:has(.videoPlayerContainer), /* ← and this */
.libraryPage#videoOsdPage, /* ← belt + suspenders */
.absolutePageTabContent:has(.itemDetailPage) {
background-color: transparent !important;
background: transparent !important;
}
```
Or, more surgically, add a single rule:
```css
#videoOsdPage,
.page#videoOsdPage,
.libraryPage#videoOsdPage {
background-color: transparent !important;
background: transparent !important;
}
```
Either form will let the underlying `<video>` element show through the OSD
page wrapper while playback is active. No server / Traefik / Jellyfin-image
change is needed; just edit `branding.xml` (Custom CSS) and the change takes
effect on next hard reload of the web client.
### One-line answer
**prod fails because the `BLACK-PASS 2026-05-08` Custom-CSS rule paints
`#videoOsdPage` (which has class `libraryPage`) with `background:#000 !important`,
covering the still-decoding `<video>` element with an opaque black div whenever
the OSD page is rendered for playback. Dev never shipped that rule, so its
`#videoOsdPage` stays transparent and the video paints through.**
### Artifacts
- `bin/prod-vs-dev-compare.py` — the comparison script (committable)
- `/tmp/arrflix-prod-vs-dev/diff.json` and `/tmp/arrflix-prod-vs-dev/diff.md`
- `/tmp/arrflix-prod-vs-dev/{prod,dev}/result.json` — full per-side JSON
(includes every `/Videos /Items /master.m3u8 /PlaybackInfo /Audio /stream`
request URL + status, browser console, server log tail)
- `/tmp/arrflix-prod-vs-dev/{prod,dev}/play-t{5,10,20,30}.png` — screenshots
- API key `arrflix-prodvsdev-2026-05-09` was created on each side at run
start and deleted at run end (404 on the dev cleanup is benign — the new
key is no longer in the listing because token rotation already invalidated
it after `Auth/Keys` operation; manual confirmation via
`curl https://{prod,dev}.../Auth/Keys` shows no leftover entry).
Note that the test harness ran in headless chromium and was on prod still
**painting actual pixels** to the underlying `<video>` element (paintLuma
~107). On a real browser the same overlay div fully covers the canvas, so
the user reports "black screen" exactly as observed in the screenshots.
---
## INC7 final — CSS overlay was the actual cause
After INC7-attempt-1 (Traefik SW-pin fix) shipped, headless playwright
on prod still measured **`darkPct=100%`** of the visual viewport while
`<video>` element decoded frames (canvas `drawImage` luma=84,
`videoWidth=1920`, `currentTime` advancing). Confirmed agent 2's
hypothesis: `<video>` paints, but a CSS overlay covers it.
### Root cause
`branding.xml` BLACK-PASS rule paints `.libraryPage` with
`background:#000 !important`. Jellyfin's video OSD page renders as
`<div id="videoOsdPage" class="libraryPage">` (id + class).
The class match → opaque black div ABOVE the `<video>` element →
visually black despite real frames decoding underneath.
Dev didn't ship the BLACK-PASS block at all → no overlay → video
visible.
### Fix (CSS, server-side branding.xml CustomCss)
```css
.libraryPage:has(.htmlVideoPlayer),
.libraryPage#videoOsdPage,
#videoOsdPage,
#videoOsdPage .pageContainer,
#videoOsdPage .layout-desktop,
#videoOsdPage .mainAnimatedPages {
background-color: transparent !important;
background: transparent !important;
}
```
### Verified
Post-fix headless playwright: `darkPct=9.8%`. Screenshot `/tmp/inc7-after.png`
shows actual MNS S1E4 video frame (sasquatch in cage). Real visual paint.
### Cleanup
- Removed `clear-cache-only@file` middleware attachment from
`jellyfin-html-nocache` router. INC7 SW-pin fix + INC7 CSS fix
together close the case; the temporary cache-wipe middleware is no
longer needed and would burn HTTP cache on every visit.
- Backup: `/opt/docker/traefik/config/dynamic.yml.bak.inc6-removal.*`
### Lesson
Agent 6 marked "verified" using video-element state alone (currentTime
advancing, readyState=4, videoWidth>0). Element decoded fine — but
CSS overlay above it made it visually black. Headless test must
ALSO sample pixel histogram + canvas drawImage on the actual painted
viewport, not just element properties.
`bin/headless-test-v2.py` already includes the canvas-drawImage paint
check (Pillow + drawImage luma). Add a `darkPct` assertion to surface
this class of regression next time.
### Status
INC7 FINAL — case closed. Owner action: hard-reload browser,
confirm visual paint.