From 1a6a697afd6699e138a91c0ccc9bd30d95eb6ea1 Mon Sep 17 00:00:00 2001 From: s8n Date: Fri, 8 May 2026 02:07:11 +0100 Subject: [PATCH] Add pre-import cleanup + filename normalization rulesets - 07-pre-import-cleanup: 1002-line ruleset for stripping non-media junk before files land in /home/user/media/. Catalogs 10 categories (codec promo, group brag, promo images, OS thumb caches, samples, sub leftovers, torrent residue, proof folders, multi-disc cruft, Win executables). NFO discriminator uses 4096-byte head + XML-root regex (covers prologue case the brief 100-byte version misses). 15 auto-delete security categories (.exe/.msi/.bat/.scr/...); threat model = friend clicking 'Download original' then running on Win. Verified extras folders against Jellyfin docs (lowercase 'featurettes', 'behind the scenes', etc.). Includes idempotent dry-run-default cleanup-import.sh that quarantines first, returns staging path on stdout. - 08-filename-normalization: 1853-line normative renaming ruleset. Canonical: 'Show (Year) - SXXEXX - Title.ext' for TV; ' (<Year>).ext' for movies; 'Show - NNNN - Title [Sub|Dub].ext' for absolute-numbered anime. Strips group tags ([YIFY]/[RARBG]/[FS99 Joy]/[GalaxyRG]), resolution (1080p/2160p/4K), codec (x264/x265/HEVC/10bit), source (WEB-DL/BluRay/HDTV), audio (DTS-HD.MA/Atmos/5.1/AAC), release-process (PROPER/REPACK/INTERNAL), trailing -NOGRP/-RARBG/-EVO, URL refs, basename language tokens. Includes stdlib-only normalize.py: dry-run default, --apply commits, --force overwrites, audit log to /var/log/jellyfin-imports/<date>.log, idempotent. Worked Futurama before/after; flags drift on live tree (current 'Futurama/' lacks '(1999)'). --- docs/07-pre-import-cleanup.md | 1002 ++++++++++++++++ docs/08-filename-normalization.md | 1853 +++++++++++++++++++++++++++++ 2 files changed, 2855 insertions(+) create mode 100644 docs/07-pre-import-cleanup.md create mode 100644 docs/08-filename-normalization.md diff --git a/docs/07-pre-import-cleanup.md b/docs/07-pre-import-cleanup.md new file mode 100644 index 0000000..321b836 --- /dev/null +++ b/docs/07-pre-import-cleanup.md @@ -0,0 +1,1002 @@ +# 07 — Pre-Import Cleanup Ruleset (tv.s8n.ru) + +Last updated: 2026-05-08 +Server: Jellyfin 10.10.3 on nullstone, container `jellyfin` +Library root inside container: `/media` +Library root on host: `/home/user/media` + +This document defines the **normative pre-import cleanup ruleset** for the +personal Jellyfin deploy. The owner downloads scene/group releases (e.g. +`Futurama Season 1 [1080p AI x265 10bit FS99 Joy]/`) which contain a mixture +of media files and non-media junk (codec readmes, release-group brags, Windows +installer shortcuts, comparison images, OS thumbnail caches, etc.). This junk +must NOT land in `/home/user/media/` because: + +1. It clutters the library and confuses scrapers. +2. Promo PNGs may be mis-identified as artwork. +3. Release-group `.nfo` files break the NFO-override flow (doc 02 § 11). +4. **Windows executables and installer shortcuts (`.exe`, `.msi`, `.website`, + `.url`, `.lnk`, `.scr`, `.bat`, `.ps1`) are a real security vector.** Even + though the Linux server cannot execute them, friends with a Jellyfin + account can download them through the web UI and run them on their PC. + +Cross-linked to: + +- [`01-artwork-and-images.md`](01-artwork-and-images.md) — what counts as a + recognised poster / backdrop on disk. +- [`02-metadata-and-titles.md`](02-metadata-and-titles.md) — NFO sidecar + override flow; what a "real" Jellyfin NFO looks like. +- [`03-subtitles.md`](03-subtitles.md) — which subtitle files to keep. +- [`05-file-structure-rules.md`](05-file-structure-rules.md) — canonical + folder layout. § 8 of doc 05 defines the recognised extras subfolders; + this doc enforces them at import time. +- [`08-filename-normalization.md`](08-filename-normalization.md) — the + **next** stage of the pipeline (sibling agent), called after this doc's + `cleanup-import.sh` has produced a clean staging tree. + +Sources of truth: + +- <https://jellyfin.org/docs/general/server/media/movies/> — extras subfolders + and artwork filename patterns. +- <https://jellyfin.org/docs/general/server/media/shows/> — same for series. +- <https://jellyfin.org/docs/general/server/metadata/nfo/> — NFO XML schema; + used here to distinguish a real metadata NFO from a release-group brag. + +--- + +## 0. Top-level cleanup rules + +These are non-negotiable. They wrap the doc 05 top-level rules with one +guarantee: **nothing leaves staging until cleanup has run and been +confirmed.** + +1. **Never clean in-place on the source download.** The download directory + (`/home/admin/Downloads/...`) is treated as a read-only artefact until + the user explicitly approves deletion. The cleanup script copies into a + staging area and operates there. +2. **Quarantine first, delete later.** First run of the cleanup script on a + release moves junk to `~/.jellyfin-quarantine/<YYYY-MM-DD>/<release-name>/` + instead of deleting. The user reviews, then a second pass empties the + quarantine after sign-off. Subsequent runs on the same release are + idempotent. +3. **Two-list policy.** Every file is matched against an `ALLOW` list (KEEP) + or a `DENY` list (DELETE). Anything not on either list is **flagged** and + surfaced in the audit report — a human decides. Never auto-delete on + "unknown". +4. **Never run cleanup as root.** All operations are as the unprivileged + `admin` (onyx) or `user` (nullstone) account. The live `/home/user/media/` + tree is touched only by the rename step in doc 08, after cleanup has + produced an intermediate staging copy. +5. **Idempotent.** Running cleanup twice on the same source must produce the + same staging tree byte-for-byte (same `find -printf '%p %s\n' | sort` + output, modulo timestamps). +6. **Dry-run is the default.** The cleanup script with no flags lists what it + *would* do and exits without writing. `--apply` is required to actually + move/quarantine files. + +--- + +## 1. Categorical taxonomy of non-media files in scene/group releases + +Scene and group ("p2p") releases follow loose conventions. The following +categories cover everything observed in the wild plus everything in the +Futurama download set: + +### 1.1 Codec / player promotion + +Text files and Windows shortcut files steering the user toward a specific +codec pack or media player (often K-Lite + MPC-HC). Frequently the file is +an `.url` or `.website` (Internet Shortcut) pointing to a third-party +installer. **Always DELETE.** + +Real-world examples (`/home/admin/Downloads/futrama/`): + +- `How to play HEVC (THIS FILE).txt` — 65 lines of MPC-HC marketing. +- `Ninite K-Lite Codecs Unattended Silent Installer and Updater.website` + — `URL=https://ninite.com/klitecodecs/` Internet Shortcut. + +Patterns: + +- `How to play *.txt`, `Read*Me*.txt`, `INSTALL*.txt`, `PLAY*.txt` +- `*.website`, `*.url`, `*.lnk` +- `K-Lite*`, `MPC-HC*`, `VLC*`, `MX Player*`, `LAV*` + +### 1.2 Release-group brag + +Plain-text or `.nfo` files where the release group identifies itself, +documents encoder settings, or pumps its tracker URL. Distinguishable from a +**Jellyfin-compatible metadata NFO** (XML, root `<movie>` / `<tvshow>` / +`<episodedetails>`) by content — see § 3. + +Real-world examples: + +- `Encoded by JoyBell (UTR).txt` — 41-line manifesto from "Unity Team + Release group" pointing to `UNITEAM.CO`. +- `RARBG.txt`, `WWW.YIFY-TORRENTS.COM.url`, `<group>.nfo` with ASCII art. + +Patterns: + +- `Encoded by *.txt`, `Ripped by *.txt`, `<GROUP>.txt` +- `RARBG.txt`, `RARBG_DO_NOT_MIRROR.exe` (yes, those exist; § 1.10) +- `*-readme.txt`, `release notes.txt` +- `*.nfo` containing only ASCII art (no `<movie>` / `<tvshow>` / + `<episodedetails>` root element) +- `*.diz`, `file_id.diz` — old "BBS description" file, scene leftover + +### 1.3 Promo images that are NOT poster artwork + +Images that LOOK like artwork to a naive globber but are actually before/after +comparisons, group banners, or screenshot proofs. **Delete unless they live +inside a recognised extras folder (§ 4) or match the strict allow-list of +poster/backdrop names from doc 01.** + +Real-world example: + +- `Futurama Compare.png` (1.05 MB) — encoder before/after comparison. + +Patterns to delete: + +- `*Compare*.{png,jpg,jpeg,webp}` +- `*Sample*.{png,jpg,jpeg}` (when not in a `samples/` extras folder) +- `*Screen*.{png,jpg}`, `*Screens/*`, `*Proof/*`, `*Preview/*` +- `*-banner.png` from a group (NOT the same as Jellyfin's `banner.jpg`; + group banners typically have the group name in the filename — heuristic + match `*JoyBell*`, `*UTR*`, `*JoY*`, etc.) +- Stray `*.gif` files (animated previews); Jellyfin doesn't use GIF. + +### 1.4 OS-generated thumbnail caches + +Per-OS file managers (Windows Explorer, macOS Finder, GNOME Files) leave +turds in every directory they browse. **Always DELETE — never useful, never +metadata.** + +Patterns: + +- `Thumbs.db`, `ehthumbs.db`, `ehthumbs_vista.db` +- `.DS_Store`, `._*` (macOS resource forks) +- `Desktop.ini`, `desktop.ini` +- `.directory` (KDE) +- `.fseventsd/`, `.Spotlight-V100/`, `.Trashes/` (macOS) +- `$RECYCLE.BIN/`, `System Volume Information/` (Windows mount) + +### 1.5 Sample files (lower-quality previews) + +Scene releases sometimes ship a 30-second sample file at lower bitrate. +Jellyfin treats a `samples/` subfolder as extras (doc 05 § 8.2), but a stray +`Movie.sample.mkv` next to the main file would scrape as "another version". + +**Default: DELETE.** Reasoning: we have the full file; the sample is dead +weight. If the user genuinely wants samples, drop them into a `samples/` +subfolder before running cleanup and the script will preserve the folder. + +Patterns to delete (when at the top level of a release): + +- `sample.{mkv,mp4,avi,m4v}` +- `*-sample.{mkv,mp4,avi,m4v}`, `*.sample.{mkv,mp4,avi,m4v}` +- `*_sample.{mkv,mp4,avi,m4v}` +- `Sample/` directory (rename to `samples/` to preserve as extras, OR delete) + +### 1.6 Subtitle leftovers + +VobSub (DVD/Blu-ray bitmap subs) are shipped as a pair: `en.idx` (index) + +`en.sub` (bitmap stream). Jellyfin can render them, but if a `.srt` exists +with the same language tag the bitmap pair is redundant and slow. + +**Default: KEEP all `.srt` and `.ass`. KEEP `.idx`/`.sub` only if no `.srt` +of the same language exists.** This is a per-file decision — surface to the +user in the audit report rather than auto-pruning. + +Patterns: + +- `*.srt`, `*.ass`, `*.ssa`, `*.vtt` — KEEP (per doc 03). +- `*.sup` (PGS bitmap, Blu-ray) — KEEP (Jellyfin renders). +- `*.idx` + `*.sub` (VobSub) — KEEP if no `.srt` with same lang code; else + flag for human review. +- `*.smi`, `*.rt` — DELETE (obsolete formats Jellyfin doesn't support). + +### 1.7 Torrent residue + +Files left by the torrent client itself. None are useful to Jellyfin. + +Patterns to delete: + +- `*.torrent`, `*.magnet` +- `*.parts`, `*.!ut`, `*.!qB`, `*.bc!` (in-progress fragments) +- `*.meta`, `*.aria2` +- `*.pad`, `padding/`, `__padding_file_*` (mktorrent padding) +- `*.sfv` (checksum manifest; harmless but useless after download) +- `*.md5`, `*.sha1`, `*.sha256` (release-checksum sidecars) + +### 1.8 Test / proof images and folders + +Some groups ship a `Proof/` or `Screens/` folder with screenshots to "prove" +the rip's quality. Useless inside a Jellyfin library. + +Patterns to delete (whole folders): + +- `Proof/`, `proof/`, `PROOF/` +- `Screens/`, `screens/`, `Screenshots/`, `Caps/` +- `Preview/`, `Previews/` +- `_screens/`, `screenshots-only/` + +### 1.9 Multi-disc DVD/Blu-ray cruft + +When a release is a straight ISO rip the `VIDEO_TS/` or `BDMV/` directory +sometimes survives next to the encoded file. Jellyfin can play +`VIDEO_TS.IFO` directly, but a partial DVD structure left over from the +encode is just clutter. + +Patterns: + +- `VIDEO_TS/` — KEEP if it contains a complete `VIDEO_TS.VOB` set; + otherwise flag. +- `*.IFO`, `*.BUP`, `*.VOB` — KEEP if inside a complete `VIDEO_TS/`; + DELETE if loose. +- `BDMV/`, `CERTIFICATE/`, `AACS/` — KEEP if complete BD structure; + flag if partial. +- `*.iso` inside a media folder — flag for human review (could be the + intentional rip OR a Windows malware vector — see § 8). + +### 1.10 Outright malicious / suspicious + +Some releases historically shipped Windows executables disguised as +"DO NOT MIRROR" anti-leech files. Even on a Linux server these must be +deleted because the friend with a Jellyfin account can download them via +the web UI ("Download original file" button) and run them locally. + +**Always DELETE, never quarantine, never preserve, no exceptions.** + +Patterns: + +- `*.exe`, `*.msi`, `*.bat`, `*.cmd`, `*.com`, `*.scr`, `*.ps1`, `*.vbs`, + `*.wsf`, `*.hta`, `*.jar` +- `*.app/` (macOS bundle dropped by macOS-using uploader) +- `*.dll`, `*.sys` (rare, but seen) +- Anything with a double extension like `Movie.mkv.exe` + +--- + +## 2. KEEP vs DELETE — exhaustive table + +This table is the **canonical decision matrix** for `cleanup-import.sh`. +Patterns are case-insensitive on `ext4`+Jellyfin. `KEEP` means it goes to the +staging tree; `DELETE` means it goes to quarantine on first run, then +recycle-bin on confirm. + +| Pattern | Action | Why | +|---|---|---| +| `*.mkv`, `*.mp4`, `*.avi`, `*.m4v`, `*.ts`, `*.mov`, `*.webm`, `*.wmv`, `*.flv`, `*.mpg`, `*.mpeg` | **KEEP** | Media — the entire point. | +| `*.srt`, `*.ass`, `*.ssa`, `*.vtt`, `*.sup` | **KEEP** | Subtitles (doc 03). | +| `*.idx` + `*.sub` (VobSub pair) | **KEEP** if no `.srt` of same lang exists; else **FLAG** | Bitmap subs; redundant with SRT. | +| `*.smi`, `*.rt` | **DELETE** | Obsolete subtitle formats; Jellyfin can't render. | +| `folder.{jpg,png}`, `poster.{jpg,png}`, `cover.{jpg,png}`, `default.{jpg,png}`, `show.{jpg,png}`, `jacket.{jpg,png}`, `movie.{jpg,png}` | **KEEP** | Jellyfin-recognised primary artwork (doc 01). | +| `backdrop.{jpg,png}`, `fanart.{jpg,png}`, `background.{jpg,png}`, `art.{jpg,png}`, `backdrop[0-9]*.{jpg,png}`, `backdrop-[0-9]*.{jpg,png}` | **KEEP** | Jellyfin-recognised backdrops (doc 01). | +| `logo.{png,jpg}`, `clearlogo.{png,jpg}`, `banner.{jpg,png}`, `landscape.{jpg,png}`, `thumb.{jpg,png}`, `disc.{png,jpg}`, `clearart.{png,jpg}` | **KEEP** | Jellyfin-recognised auxiliary artwork. | +| `season[0-9]*-poster.{jpg,png}`, `season[0-9]*.{jpg,png}`, `season-specials-poster.{jpg,png}` | **KEEP** | Per-season artwork (doc 01 / TV layout). | +| `extrafanart/*.{jpg,png}`, `backdrops/*.{jpg,png,mp4}` | **KEEP** | Multi-backdrop folders (doc 05 § 8). | +| `*.nfo` with XML root `<movie>` / `<tvshow>` / `<episodedetails>` / `<artist>` / `<album>` / `<musicvideo>` | **KEEP** | Jellyfin-compatible metadata sidecar (doc 02 § 11). | +| `*.nfo` without one of the above XML roots | **DELETE** | Release-group ASCII-art brag — pretends to be metadata, isn't. | +| `*Compare*.{png,jpg,jpeg,webp,gif}` | **DELETE** | Encoder before/after — group promo. | +| `*Sample*.{png,jpg,jpeg}` (image, top level) | **DELETE** | Group promo (NOT a Jellyfin sample folder). | +| `*Screen*.{png,jpg}`, `Screens/`, `Screenshots/`, `Caps/` | **DELETE** | Proof shots. | +| `Proof/`, `proof/`, `PROOF/` | **DELETE** (whole folder) | Quality-proof shots. | +| `Preview/`, `Previews/` | **DELETE** (whole folder) | Lower-quality teaser. | +| `*.txt` (any) | **DELETE** | Readme / group brag — Jellyfin doesn't read TXT. | +| `*.diz`, `file_id.diz` | **DELETE** | Scene description file — obsolete. | +| `*.website`, `*.url`, `*.lnk` | **DELETE** | Windows Internet Shortcut — points at codec/installer pages. **Security: § 8.** | +| `*.exe`, `*.msi`, `*.bat`, `*.cmd`, `*.com`, `*.scr`, `*.ps1`, `*.vbs`, `*.wsf`, `*.hta`, `*.jar`, `*.dll`, `*.sys` | **DELETE** | Windows executable. **Security: § 8.** | +| `*.app/` | **DELETE** (whole folder) | macOS bundle. | +| `Thumbs.db`, `ehthumbs.db`, `ehthumbs_vista.db` | **DELETE** | Windows Explorer thumbnail cache. | +| `.DS_Store`, `._*` | **DELETE** | macOS Finder. | +| `Desktop.ini`, `desktop.ini` | **DELETE** | Windows folder customisation. | +| `.directory` | **DELETE** | KDE Dolphin. | +| `.fseventsd/`, `.Spotlight-V100/`, `.Trashes/`, `$RECYCLE.BIN/`, `System Volume Information/` | **DELETE** (whole folder) | OS metadata directories. | +| `sample.{mkv,mp4,avi,m4v}` (top level) | **DELETE** | Lower-quality preview (doc 05 § 8.1: full file already present). | +| `*-sample.{mkv,mp4,avi,m4v}`, `*_sample.{mkv,mp4,avi,m4v}`, `*.sample.{mkv,mp4,avi,m4v}` | **DELETE** | Same. | +| `Sample/` (directory, top level) | **DELETE** | Lower-quality preview folder. | +| `samples/` (directory, recognised name) | **KEEP** | Jellyfin extras folder (doc 05 § 8.2). | +| `featurettes/`, `behind the scenes/`, `deleted scenes/`, `interviews/`, `scenes/`, `shorts/`, `clips/`, `trailers/`, `extras/`, `other/`, `theme-music/`, `backdrops/` | **KEEP** (whole folder) | Jellyfin extras (doc 05 § 8.2). | +| `Featurettes/`, `Behind The Scenes/`, etc. (capitalised) | **KEEP** but **rename to lowercase** | Jellyfin matches case-insensitive but lowercase is the documented form. | +| Any other folder name | **FLAG** | Surface to human; might be a typo of an extras folder. | +| `*.torrent`, `*.magnet` | **DELETE** | Torrent client residue. | +| `*.parts`, `*.!ut`, `*.!qB`, `*.bc!`, `*.aria2` | **DELETE** | In-progress download fragments (shouldn't be here, but defensive). | +| `*.meta` | **DELETE** | aria2/torrent metadata. | +| `*.pad`, `padding/`, `__padding_file_*`, `_____padding_file_*` | **DELETE** | mktorrent padding files. | +| `*.sfv`, `*.md5`, `*.sha1`, `*.sha256` | **DELETE** | Checksum manifests; harmless but useless after download. | +| `*.rar`, `*.r[0-9][0-9]`, `*.zip`, `*.7z`, `*.tar`, `*.tar.gz` | **FLAG** | Compressed archive in a media folder is suspicious — release should have been extracted before download. | +| `*.iso` inside a media folder | **FLAG** | Could be intentional DVD/BD rip OR Windows-installer disguise. Human review. | +| `VIDEO_TS/` (complete) | **KEEP** | Jellyfin plays DVD structure directly. | +| `*.IFO`, `*.BUP`, `*.VOB` (loose, no `VIDEO_TS/`) | **DELETE** | Orphan DVD remnants. | +| `BDMV/` (complete) | **KEEP** | Jellyfin plays BD structure. | +| `CERTIFICATE/`, `AACS/` (without `BDMV/`) | **DELETE** | Orphan BD remnants. | +| `RARBG*.{txt,exe}`, `WWW.*.url`, `*.YIFY*.url` | **DELETE** | Tracker promo. | +| `RARBG_DO_NOT_MIRROR.exe` and similar | **DELETE** (security: § 8) | Historic anti-leech file; sometimes weaponised. | +| Anything else | **FLAG** | Two-list policy: never auto-delete on "unknown". | + +--- + +## 3. NFO handling — the nuanced case + +`.nfo` is overloaded. Two completely different file kinds share the +extension: + +- **Scene release `.nfo`** — plain text, ASCII art, encoder credits, tracker + URL. Useless to Jellyfin (and at worst gets scraped as garbage metadata + if NFO Saver is enabled). +- **Jellyfin/Kodi/Emby metadata NFO** — XML, root element is one of + `<movie>`, `<tvshow>`, `<episodedetails>`, `<artist>`, `<album>`, + `<musicvideo>`. Documented in doc 02 § 11. + +### 3.1 The discriminator one-liner + +```bash +is_jellyfin_nfo() { + # Returns 0 (KEEP) if the file looks like a Jellyfin/Kodi NFO, + # 1 (DELETE) if it looks like scene-group ASCII-art brag. + head -c 4096 "$1" | tr -d '[:space:]' \ + | grep -qE '<(movie|tvshow|episodedetails|artist|album|musicvideo|season)\b' +} + +# Usage: +if is_jellyfin_nfo "$f"; then echo "KEEP $f"; else echo "DELETE $f"; fi +``` + +The first 4096 bytes are enough — a real Jellyfin NFO declares its root +within the first kilobyte. `tr -d '[:space:]'` is needed because some +encoders pretty-print the XML and put `<movie` on a different line from `<`. + +### 3.2 Edge cases + +- An NFO with both ASCII art **and** an XML root: KEEP. Jellyfin's parser + ignores leading non-XML noise as long as the XML element parses. +- An NFO with a different XML root (e.g. `<root>`, `<info>`): DELETE. + Jellyfin won't read it; nothing to preserve. +- An NFO with valid XML but **stale TMDB/IMDB IDs** that conflict with a + newer scrape: KEEP, but flag for the user — doc 02 § 11.5 explains how + the NFO Saver overwrites these on next scrape. +- Multiple NFOs in one folder (e.g. `release.nfo` from the group AND + `tvshow.nfo` from a previous Jellyfin write): KEEP `tvshow.nfo`, + DELETE `release.nfo`. Use the discriminator above on each. + +### 3.3 First-100-bytes shortcut + +The task brief proposes this: + +```bash +if head -c 100 file.nfo | grep -qE '<(movie|tvshow|episodedetails)\b'; then echo KEEP; else echo DELETE; fi +``` + +This works for the common case but misses NFOs that start with an XML +declaration (`<?xml version="1.0"?>` plus possibly a comment) before the +root element — that prologue alone can be > 100 bytes. The 4096-byte +version above is safer; we use that in `cleanup-import.sh`. + +--- + +## 4. Featurettes / Extras / Bonus folders — the canonical list + +Per the Jellyfin docs (movies and shows pages), these subfolder names are +recognised and the contained files are tagged with the matching extra +type. **Folder name match is case-insensitive but lowercase is the +documented canonical form** — `cleanup-import.sh` lowercases on copy to +staging. + +| Folder name | Extra type | Notes | +|---|---|---| +| `behind the scenes` | Behind The Scenes | spaces, not dashes | +| `deleted scenes` | Deleted Scene | | +| `interviews` | Interview | | +| `scenes` | Scene | | +| `samples` | Sample | distinct from a top-level `Sample/` (§ 1.5) | +| `shorts` | Short | | +| `featurettes` | Featurette | | +| `clips` | Clip | | +| `other` | Other | catch-all | +| `extras` | Extra | generic catch-all | +| `trailers` | Trailer | | +| `theme-music` | Theme music | `.mp3` files; doc 05 § 8.3 | +| `backdrops` | Backdrop video | rotating video backgrounds | + +Anything else (e.g. `Bonus Features/`, `BTS/`, `Special Features/`, +`Featurette/` singular, `behind-the-scenes/` with dashes) is **NOT** matched +by Jellyfin and the contents won't surface as extras. Cleanup either +renames to the canonical name (when the mapping is unambiguous) or flags +for human review. + +### 4.1 Canonical-name mapping (auto-rename) + +| Found | Renamed to | +|---|---| +| `Featurettes/`, `Featurette/`, `FEATURETTES/` | `featurettes/` | +| `Behind The Scenes/`, `BTS/`, `behind-the-scenes/` | `behind the scenes/` | +| `Deleted Scenes/`, `Deleted_Scenes/`, `deleted-scenes/` | `deleted scenes/` | +| `Interviews/`, `Interview/` | `interviews/` | +| `Trailers/`, `Trailer/` | `trailers/` | +| `Bonus/`, `Bonus Features/`, `Bonus Material/`, `Special Features/`, `Specials/` | `extras/` (generic catch-all) | +| `Outtakes/`, `Bloopers/`, `Gag Reel/` | `extras/` (no dedicated folder) | + +The `Specials/` rename to `extras/` is **important** — for a TV series, +`Specials/` looks like a season folder (Season 0 specials), but if the +files inside are featurettes rather than aired specials, putting them in +the wrong folder mis-scrapes them as episodes. When in doubt, flag. + +### 4.2 Real-world example: Futurama download + +The four Futurama season folders all contain a `Featurettes/` subfolder: + +``` +Futurama Season 1 [1080p AI x265 10bit FS99 Joy]/Featurettes/ +├── Episode One Animatic.mkv +└── Welcome to the World of Tomorrow.mkv + +Futurama Season 2 .../Featurettes/ +├── Animatic -Why Must I be a Crustacean in Love.mkv +└── Futurama Game Trailer.mkv + +Futurama Season 3 .../Featurettes/ +├── An X-Mas Message From David X. Cohen.mkv +└── Deleted Scenes.mkv + +Futurama Season 4 .../Featurettes/ +├── Futurama Welcome to the World of Tomorrow (x265 Joy).mkv +├── Outtakes - Kif Gets Knocked Up a Notch [1080p x265 10bit Joy].mkv +└── Panel on Voice Actors [1080p x265 10bit Joy].mkv +``` + +After cleanup these become `featurettes/` (lowercase) inside the season +folder. Doc 08 (filename normalization) then renames the season folder +itself to `Season 01/` and may relocate the season-level featurettes to a +**series-level** `featurettes/` folder if the user prefers extras at the +series root (this is a doc 05 § 8 / doc 08 decision, not this doc's). + +> Note: `Season 3 / Deleted Scenes.mkv` is a single file and should arguably +> be moved into a `deleted scenes/` subfolder rather than left in +> `featurettes/`. That's a manual disambiguation — flagged, not auto-moved. + +--- + +## 5. Audit-then-clean workflow + +Three-stage pipeline. Stage 1 is mandatory; stage 2 runs on user approval; +stage 3 is reversible until the quarantine retention window expires. + +### 5.1 Stage 1 — Dry-run audit + +Lists every file in the source release classified as KEEP / DELETE / FLAG. +Writes nothing. + +```bash +# Dry-run audit on a single release dir. +cleanup-import.sh "/home/admin/Downloads/futrama/Futurama Season 1 [1080p AI x265 10bit FS99 Joy]" +``` + +Output (one line per file): + +``` +KEEP Futurama S01E01 Space Pilot 3000 [1080p x265 10bit Joy].mkv +KEEP folder.jpg +KEEP Featurettes/Episode One Animatic.mkv -> featurettes/Episode One Animatic.mkv +DELETE Encoded by JoyBell (UTR).txt [release-group brag] +DELETE How to play HEVC (THIS FILE).txt [codec promo .txt] +DELETE Ninite K-Lite Codecs Unattended Silent ....website [windows .website -- SECURITY] +DELETE Futurama Compare.png [encoder compare image] +FLAG SomeUnknownFile.bin [unknown extension] +``` + +A **summary** at the bottom: + +``` +KEEP 16 files (5.92 GiB) +DELETE 4 files (1.08 MiB) +FLAG 0 files +Run with --apply to quarantine the DELETE set. +``` + +Quick one-liner equivalents (for ad-hoc spot checks; the script § 9 is +preferred): + +```bash +# What would I delete? +find "$SRC" \( \ + -iname '*.txt' -o -iname '*.nfo' -o -iname '*.url' -o -iname '*.website' \ + -o -iname '*.lnk' -o -iname '*.exe' -o -iname '*.msi' -o -iname '*.bat' \ + -o -iname '*.scr' -o -iname '*.ps1' -o -iname '*.cmd' -o -iname '*.com' \ + -o -iname 'Thumbs.db' -o -iname '.DS_Store' -o -iname 'Desktop.ini' \ + -o -iname '*Compare*.png' -o -iname '*Compare*.jpg' \ + -o -iname 'sample.mkv' -o -iname '*.sample.mkv' -o -iname '*-sample.mkv' \ + -o -iname '*.torrent' -o -iname '*.sfv' -o -iname '*.md5' \ +\) -print + +# What looks like a real Jellyfin NFO vs a release-group brag? +find "$SRC" -iname '*.nfo' -print0 | while IFS= read -r -d '' f; do + if head -c 4096 "$f" | tr -d '[:space:]' \ + | grep -qE '<(movie|tvshow|episodedetails|artist|album|musicvideo|season)\b'; then + printf 'KEEP %s\n' "$f" + else + printf 'DELETE %s\n' "$f" + fi +done +``` + +### 5.2 Stage 2 — Quarantine apply + +```bash +cleanup-import.sh --apply "/home/admin/Downloads/futrama/Futurama Season 1 [...]" +``` + +What it does: + +1. **Copies** the source directory tree to + `/home/admin/.jellyfin-staging/<release-name>/`. The source is never + modified. +2. Inside the staging copy, **moves** every DELETE-classified file to + `/home/admin/.jellyfin-quarantine/<YYYY-MM-DD>/<release-name>/`, + preserving relative paths so a user can `diff -r` to confirm. +3. **Renames** non-canonical extras subfolders to canonical lowercase + (§ 4.1). +4. Writes a manifest at + `/home/admin/.jellyfin-staging/<release-name>/.cleanup-manifest.json` + listing every file action with sha256, source path, action, target + path. This is what stage 3 reads. +5. Returns the staging path on stdout — that's the input to doc 08's + filename normalizer. + +### 5.3 Stage 3 — Confirm and recycle + +After the user reviews the quarantine directory and approves: + +```bash +cleanup-import.sh --confirm-quarantine 2026-05-08 +``` + +Moves `/home/admin/.jellyfin-quarantine/2026-05-08/` to the system trash +(via `gio trash`) — still recoverable, but no longer cluttering the +quarantine root. After 30 days a cron sweep empties trash older than that. + +### 5.4 Never delete from source + +The source download (`/home/admin/Downloads/futrama/...`) is **never** +modified by `cleanup-import.sh`. Reasons: + +- The user may want to re-seed the torrent. +- The user may want to re-run cleanup with different rules later. +- Bugs in the cleanup script must never destroy original artefacts. + +Source deletion is a separate manual step the user does AFTER the +import is verified in Jellyfin and the library is happy. There is no +script for it on purpose. + +--- + +## 6. Idempotency, edge cases, and "unknown" handling + +- **Idempotent.** `cleanup-import.sh --apply` on an already-cleaned staging + directory is a no-op (nothing matches DELETE). The script detects this + and exits 0 with `nothing to do`. +- **Re-runnable on source.** Re-running the script on the same source + produces a fresh staging copy, overwriting (after backup) the previous + staging directory. Quarantine is dated, so two runs on the same day for + the same release append rather than overwrite (`<release-name>.2/`, + `.3/`, etc.). +- **Unknown extension** (e.g. `.dat`, `.bin`, `.iso`, `.bin.txt`) — never + auto-deleted. FLAGGED in the audit output, surfaced to the user. The + user adds it to the local override file + `~/.config/jellyfin-cleanup/local-rules.conf` if they want it + classified next time. +- **Hidden dotfiles** (anything starting with `.` other than known OS + caches like `.DS_Store`) — FLAGGED. Don't auto-delete; could be a + legitimate `.subliminal.cache` (subtitles plugin) or similar. +- **Symlinks** — never followed. A symlink in a release directory is + always FLAGGED; the script refuses to copy or quarantine it. +- **Permission denied** — script bails with non-zero exit. Never + partially applies. + +--- + +## 7. The `Futurama Compare.png` problem (artwork false-positive) + +`Futurama Compare.png` is a 1.05 MB PNG sitting next to the season's MKV +files. To a naive image-globber it looks like artwork — same extension as +`folder.jpg`, larger than the typical poster, sitting in the right +location. It's actually an encoder comparison shot. + +The rule from doc 01 (artwork) and enforced here: + +> **An image file in the release root is KEPT only if its name is on the +> exact recognised-artwork allow-list.** Anything else is DELETED. + +Recognised artwork allow-list (top-level of an item folder): + +- `folder.{jpg,jpeg,png,webp}` +- `poster.{jpg,jpeg,png,webp}` +- `cover.{jpg,jpeg,png,webp}` +- `default.{jpg,jpeg,png,webp}` +- `show.{jpg,jpeg,png,webp}` (series only) +- `jacket.{jpg,jpeg,png,webp}` (series only) +- `movie.{jpg,jpeg,png,webp}` (movies only) +- `backdrop.{jpg,jpeg,png,webp}` and `backdrop[0-9]*.{jpg,jpeg,png,webp}` +- `fanart.{jpg,jpeg,png,webp}`, `background.{jpg,jpeg,png,webp}`, + `art.{jpg,jpeg,png,webp}` +- `logo.{png,jpg}`, `clearlogo.{png,jpg}` +- `banner.{jpg,png}`, `landscape.{jpg,png}`, `thumb.{jpg,png}`, + `disc.{png,jpg}`, `clearart.{png,jpg}` +- `season[0-9]*-poster.{jpg,png}`, `season[0-9]*.{jpg,png}`, + `season-specials-poster.{jpg,png}` +- `extrafanart/` and `backdrops/` directories (any contents OK) + +Exception: images **inside** a recognised extras folder (`extras/`, +`featurettes/`, etc.) are KEPT regardless of name — they're presumed to be +intentional content of that extra. + +`Futurama Compare.png` matches none of these allow-list patterns and is +not inside an extras folder, so it's DELETED. + +--- + +## 8. Security rules + +The single most important rule in this document: + +> **Windows-executable extensions and Internet Shortcut formats are +> auto-deleted, never quarantined for "review", because the threat model +> isn't the Linux server, it's the Jellyfin user who downloads them.** + +Jellyfin has a "Download original file" button for every item. If a +release contains `Codec Installer.exe`, Jellyfin will happily serve it to +any user with library access — including the friend on Windows who might +not understand that downloading and running an `.exe` from a media library +is a terrible idea. We don't trust the upload chain (the release group), +so we strip these on the server side. + +Exhaustive auto-delete list (security override — these bypass the +"FLAG unknown" rule): + +| Pattern | Risk | +|---|---| +| `*.exe` | Windows executable. Direct code execution on download+run. | +| `*.msi` | Windows Installer package. Silent install possible. | +| `*.bat`, `*.cmd` | Windows batch script. Runs in `cmd.exe`. | +| `*.com` | Old DOS-style executable. Still runs on modern Windows. | +| `*.scr` | Windows screensaver = .exe in disguise. Classic malware vector. | +| `*.ps1` | PowerShell script. Common modern malware delivery. | +| `*.vbs`, `*.wsf`, `*.hta`, `*.js` (Windows Script Host) | Active scripting. | +| `*.jar` | Java archive — runs as `java -jar` on systems with JRE. | +| `*.dll`, `*.sys` | Windows libraries / drivers. Side-load attacks. | +| `*.url`, `*.website`, `*.lnk` | Internet Shortcut / Windows Shortcut. Points at attacker-controlled URL. | +| `*.iso`, `*.img` (in a media folder, not at the library root) | Mountable disk image. Can carry Windows installers. **FLAG, not auto-delete** — could legitimately be a DVD rip. | +| `*.app/` | macOS application bundle. Auto-deleted. | +| `Autorun.inf` | Windows autorun config. **AUTO-DELETE.** | + +Total auto-delete categories that are **purely** security-driven (not +Jellyfin-irrelevance-driven): **15** — `.exe`, `.msi`, `.bat`, `.cmd`, +`.com`, `.scr`, `.ps1`, `.vbs`, `.wsf`, `.hta`, `.jar`, `.dll`, `.sys`, +`.url`/`.website`/`.lnk`, `Autorun.inf`. Plus 1 flagged for human review: +`.iso`/`.img`. + +### 8.1 Why `.url` is in the security list + +`.url` is a plain-text Internet Shortcut. On Windows, double-clicking it +opens the target in the default browser. The "target" is whatever the +release group put in the `URL=` line. Historically this was used to push +codec-pack download pages with bundled adware. There is no benign reason +for a `.url` to ship in a media release. + +The Futurama release contains exactly this pattern: + +``` +[InternetShortcut] +URL=https://ninite.com/klitecodecs/ +``` + +Ninite itself is reputable — but the principle is "do not ship clickable +URLs to third-party installers in a media library, ever". + +### 8.2 The `RARBG_DO_NOT_MIRROR.exe` historic case + +Some releases historically contained a file named +`RARBG_DO_NOT_MIRROR.exe`, ostensibly to discourage mirror sites from +re-uploading. In several documented cases this file was actually adware +or a cryptominer. Auto-delete, no questions asked. + +--- + +## 9. Prepared cleanup script — `cleanup-import.sh` + +Idempotent. Dry-run by default. Quarantine-first. Source-immutable. +Returns the staging path on stdout for piping to doc 08's normalizer. + +Save to `bin/cleanup-import.sh` in the `jellyfin-stack` repo. + +```bash +#!/usr/bin/env bash +# cleanup-import.sh — Pre-import cleanup for tv.s8n.ru +# Version 1.0 (2026-05-08) — see docs/07-pre-import-cleanup.md +# +# Usage: +# cleanup-import.sh SRC # dry-run +# cleanup-import.sh --apply SRC # quarantine +# cleanup-import.sh --confirm-quarantine YYYY-MM-DD # recycle +# +# Exit codes: +# 0 success / nothing to do +# 1 user error (bad args, source not found) +# 2 internal error (permission, partial state) +# 3 flagged files present — user must review before --apply +set -euo pipefail + +STAGING_ROOT="${JELLYFIN_STAGING_ROOT:-$HOME/.jellyfin-staging}" +QUARANTINE_ROOT="${JELLYFIN_QUARANTINE_ROOT:-$HOME/.jellyfin-quarantine}" +TODAY="$(date +%Y-%m-%d)" + +# ----- classification ----- +# Returns one of: KEEP DELETE FLAG +classify() { + local path="$1" + local base + base="$(basename "$path")" + local lower + lower="$(printf '%s' "$base" | tr '[:upper:]' '[:lower:]')" + + # Security overrides — bypass everything else + case "$lower" in + *.exe|*.msi|*.bat|*.cmd|*.com|*.scr|*.ps1|*.vbs|*.wsf|*.hta|*.jar|*.dll|*.sys) echo DELETE; return ;; + *.url|*.website|*.lnk) echo DELETE; return ;; + autorun.inf) echo DELETE; return ;; + esac + + # OS junk + case "$lower" in + thumbs.db|ehthumbs.db|ehthumbs_vista.db|.ds_store|desktop.ini|.directory) echo DELETE; return ;; + ._*) echo DELETE; return ;; + esac + + # Media — KEEP + case "$lower" in + *.mkv|*.mp4|*.avi|*.m4v|*.ts|*.mov|*.webm|*.wmv|*.flv|*.mpg|*.mpeg) echo KEEP; return ;; + *.srt|*.ass|*.ssa|*.vtt|*.sup|*.idx|*.sub) echo KEEP; return ;; + *.mp3|*.flac|*.ogg|*.opus|*.m4a|*.wav) echo KEEP; return ;; + esac + + # Recognised artwork at item root + case "$lower" in + folder.jpg|folder.jpeg|folder.png|folder.webp) echo KEEP; return ;; + poster.jpg|poster.jpeg|poster.png|poster.webp) echo KEEP; return ;; + cover.jpg|cover.jpeg|cover.png|cover.webp) echo KEEP; return ;; + default.jpg|default.png|show.jpg|show.png|jacket.jpg|jacket.png|movie.jpg|movie.png) echo KEEP; return ;; + backdrop.jpg|backdrop.png|backdrop[0-9]*.jpg|backdrop[0-9]*.png) echo KEEP; return ;; + fanart.jpg|fanart.png|background.jpg|background.png|art.jpg|art.png) echo KEEP; return ;; + logo.png|logo.jpg|clearlogo.png|clearlogo.jpg|banner.jpg|banner.png) echo KEEP; return ;; + landscape.jpg|landscape.png|thumb.jpg|thumb.png|disc.png|disc.jpg|clearart.png|clearart.jpg) echo KEEP; return ;; + season[0-9]*-poster.jpg|season[0-9]*-poster.png|season[0-9]*.jpg|season[0-9]*.png) echo KEEP; return ;; + season-specials-poster.jpg|season-specials-poster.png) echo KEEP; return ;; + esac + + # Promo images masquerading as art + case "$lower" in + *compare*.png|*compare*.jpg|*compare*.jpeg|*compare*.webp|*compare*.gif) echo DELETE; return ;; + *sample*.png|*sample*.jpg|*sample*.jpeg) echo DELETE; return ;; + *screen*.png|*screen*.jpg|*preview*.png|*preview*.jpg) echo DELETE; return ;; + esac + + # Text-flavoured junk + case "$lower" in + *.txt|*.diz|file_id.diz) echo DELETE; return ;; + esac + + # Sample files + case "$lower" in + sample.mkv|sample.mp4|sample.avi|sample.m4v) echo DELETE; return ;; + *-sample.mkv|*-sample.mp4|*.sample.mkv|*.sample.mp4|*_sample.mkv|*_sample.mp4) echo DELETE; return ;; + esac + + # Torrent residue + case "$lower" in + *.torrent|*.magnet|*.parts|*.aria2|*.meta) echo DELETE; return ;; + *.pad|__padding_file_*|_____padding_file_*) echo DELETE; return ;; + *.sfv|*.md5|*.sha1|*.sha256) echo DELETE; return ;; + esac + + # NFO discriminator — KEEP if Jellyfin-compatible XML, else DELETE + case "$lower" in + *.nfo) + if head -c 4096 "$path" | tr -d '[:space:]' \ + | grep -qE '<(movie|tvshow|episodedetails|artist|album|musicvideo|season)\b'; then + echo KEEP + else + echo DELETE + fi + return + ;; + esac + + # Suspicious archives in a media folder + case "$lower" in + *.rar|*.r[0-9][0-9]|*.zip|*.7z|*.tar|*.tar.gz|*.iso|*.img) echo FLAG; return ;; + esac + + echo FLAG +} + +# ----- folder classification ----- +# Returns one of: KEEP_AS-IS RENAME:<target> DELETE FLAG +classify_dir() { + local d="$1" + local lower + lower="$(basename "$d" | tr '[:upper:]' '[:lower:]')" + case "$lower" in + behind\ the\ scenes|deleted\ scenes|interviews|scenes|samples|shorts|featurettes|clips|other|extras|trailers|theme-music|backdrops) + echo "RENAME:$lower"; return ;; + bts|behind-the-scenes) echo "RENAME:behind the scenes"; return ;; + deleted-scenes|deleted_scenes) echo "RENAME:deleted scenes"; return ;; + bonus|bonus\ features|bonus\ material|special\ features|outtakes|bloopers|gag\ reel) echo "RENAME:extras"; return ;; + proof|screens|screenshots|caps|preview|previews) echo DELETE; return ;; + sample) echo DELETE; return ;; + .fseventsd|.spotlight-v100|.trashes|\$recycle.bin|system\ volume\ information) echo DELETE; return ;; + extrafanart) echo "RENAME:extrafanart"; return ;; # case stays, recognised + *) echo FLAG; return ;; + esac +} + +# ----- main ----- +APPLY=0 +CONFIRM_DATE="" +SRC="" + +while [[ $# -gt 0 ]]; do + case "$1" in + --apply) APPLY=1; shift ;; + --confirm-quarantine) CONFIRM_DATE="$2"; shift 2 ;; + -h|--help) sed -n '2,12p' "$0"; exit 0 ;; + -*) echo "unknown flag: $1" >&2; exit 1 ;; + *) SRC="$1"; shift ;; + esac +done + +if [[ -n "$CONFIRM_DATE" ]]; then + if [[ -d "$QUARANTINE_ROOT/$CONFIRM_DATE" ]]; then + gio trash "$QUARANTINE_ROOT/$CONFIRM_DATE" + echo "Recycled $QUARANTINE_ROOT/$CONFIRM_DATE" + else + echo "No quarantine for $CONFIRM_DATE" >&2; exit 1 + fi + exit 0 +fi + +[[ -n "$SRC" && -d "$SRC" ]] || { echo "usage: $0 [--apply] SRC" >&2; exit 1; } + +RELEASE="$(basename "$SRC")" +STAGE="$STAGING_ROOT/$RELEASE" +QUAR="$QUARANTINE_ROOT/$TODAY/$RELEASE" + +declare -i KEEP_N=0 DEL_N=0 FLAG_N=0 + +# Walk source, classify each entry +while IFS= read -r -d '' f; do + rel="${f#$SRC/}" + if [[ -d "$f" ]]; then + case "$(classify_dir "$f")" in + KEEP_AS-IS|RENAME:*) ;; + DELETE) printf 'DELETE %s/ [junk dir]\n' "$rel"; DEL_N+=1 ;; + FLAG) printf 'FLAG %s/ [unknown dir name]\n' "$rel"; FLAG_N+=1 ;; + esac + continue + fi + case "$(classify "$f")" in + KEEP) printf 'KEEP %s\n' "$rel"; KEEP_N+=1 ;; + DELETE) printf 'DELETE %s\n' "$rel"; DEL_N+=1 ;; + FLAG) printf 'FLAG %s\n' "$rel"; FLAG_N+=1 ;; + esac +done < <(find "$SRC" -mindepth 1 -print0) + +echo "---" +echo "KEEP $KEEP_N" +echo "DELETE $DEL_N" +echo "FLAG $FLAG_N" + +if (( FLAG_N > 0 )); then + echo "FLAG count > 0; review before re-running with --apply." >&2 + (( APPLY == 0 )) || exit 3 +fi + +if (( APPLY == 0 )); then + echo "Dry run only. Re-run with --apply to quarantine." + exit 0 +fi + +# --- APPLY path: copy to staging, move DELETE to quarantine --- +mkdir -p "$STAGE" "$QUAR" +# rsync -a preserves perms and is idempotent +rsync -a --delete "$SRC/" "$STAGE/" + +while IFS= read -r -d '' f; do + rel="${f#$STAGE/}" + if [[ -d "$f" ]]; then + res="$(classify_dir "$f")" + case "$res" in + RENAME:*) + target="${res#RENAME:}" + parent="$(dirname "$f")" + [[ "$(basename "$f")" == "$target" ]] || mv "$f" "$parent/$target" + ;; + DELETE) + mkdir -p "$QUAR/$(dirname "$rel")" + mv "$f" "$QUAR/$rel" + ;; + esac + continue + fi + case "$(classify "$f")" in + DELETE) + mkdir -p "$QUAR/$(dirname "$rel")" + mv "$f" "$QUAR/$rel" + ;; + esac +done < <(find "$STAGE" -mindepth 1 -print0) + +# Manifest +{ + echo "{" + echo " \"release\": \"$RELEASE\"," + echo " \"date\": \"$TODAY\"," + echo " \"source\": \"$SRC\"," + echo " \"staging\": \"$STAGE\"," + echo " \"quarantine\": \"$QUAR\"" + echo "}" +} > "$STAGE/.cleanup-manifest.json" + +# Stdout: the staging path, for piping to doc 08's normalizer +echo "$STAGE" +``` + +### 9.1 Pipeline integration + +```bash +# Full pre-import flow: +SRC="/home/admin/Downloads/futrama/Futurama Season 1 [1080p AI x265 10bit FS99 Joy]" +STAGING="$(cleanup-import.sh --apply "$SRC")" +# STAGING is now ~/.jellyfin-staging/Futurama Season 1.../ with junk gone. +# Hand off to doc 08: +normalize-filenames.sh "$STAGING" +# Then move to live media tree (manual; doc 05 confirms layout): +mv "$STAGING" "/home/user/media/tv/Futurama (1999)/Season 01" +``` + +The `mv` to the live tree is **deliberately manual**. Cleanup and rename +are reproducible from source; the move into `/home/user/media/` is the +point of no return and the user runs it consciously. + +--- + +## 10. What this doc explicitly does NOT do + +- **Filename normalization** — that's doc 08. This doc only deletes; doc 08 + renames `Futurama S01E01 Space Pilot 3000 [1080p x265 10bit Joy].mkv` + into the canonical `Futurama (1999) - S01E01 - Space Pilot 3000.mkv`. +- **Subtitle reconciliation** — doc 03 covers per-language naming; this + doc only deletes obsolete formats (`.smi`, `.rt`). +- **Library refresh** — after files land in `/home/user/media/`, run + `POST /Library/Refresh` on the Jellyfin API (doc 02 § 2). Cleanup never + touches the running container. +- **NFO writing** — doc 02 § 11 covers writing override NFOs. This doc + only filters incoming NFOs. +- **Source deletion** — never. The source download is read-only to this + pipeline; the user removes it manually post-import. + +--- + +## 11. TL;DR + +| Step | What | Where | +|---|---|---| +| 1 | Audit (dry-run) | `cleanup-import.sh "$SRC"` | +| 2 | Apply (quarantine) | `cleanup-import.sh --apply "$SRC"` → prints staging path | +| 3 | Review quarantine | `ls ~/.jellyfin-quarantine/$(date +%F)/` | +| 4 | Normalize filenames | doc 08, takes staging path as input | +| 5 | Move to live tree | manual `mv "$STAGING" /home/user/media/...` | +| 6 | Refresh library | `POST /Library/Refresh` (doc 02) | +| 7 | Confirm quarantine | `cleanup-import.sh --confirm-quarantine YYYY-MM-DD` | +| 8 | Delete source | manual, only after Jellyfin shows the item correctly | + +The hard rule, repeated: **the source download is never modified, the live +media tree is never written by cleanup, and Windows executables never +reach a Jellyfin user's browser.** diff --git a/docs/08-filename-normalization.md b/docs/08-filename-normalization.md new file mode 100644 index 0000000..cf62291 --- /dev/null +++ b/docs/08-filename-normalization.md @@ -0,0 +1,1853 @@ +# 08 — Filename & Folder Normalization Ruleset (tv.s8n.ru) + +Last updated: 2026-05-08 +Server: Jellyfin 10.10.3 on nullstone, container `jellyfin` +Library root inside container: `/media` +Library root on host: `/home/user/media` + +This document is the **normative ruleset** for renaming downloaded media into a +canonical, predictable, group-tag-free shape before it lands in the live +library tree. It is the layer between "torrent dump" and "file ready for the +scanner". + +Cross-links: + +- [`05-file-structure-rules.md`](05-file-structure-rules.md) — what Jellyfin's + parser accepts; this doc picks one of the accepted forms and locks it in. +- [`07-cleanup-and-imports.md`](07-cleanup-and-imports.md) — the operational + pipeline (move, dedupe, garbage collect) that consumes this ruleset. Doc 08 + defines *what* canonical looks like; doc 07 defines *how* to apply it. +- [`02-metadata-and-titles.md`](02-metadata-and-titles.md) — what Jellyfin + does after the rename (parse, scrape, lock). +- [`03-subtitles.md`](03-subtitles.md) — sidecar `.srt` / `.ass` naming + (referenced from § 5.6 below). + +> **Status of this doc:** specification + reference implementation. The +> `normalize.py` script in § 11 is canonical. Anything not codified by the +> script is documentation only — when the doc and the script disagree, the +> script wins, and the doc gets fixed. + +--- + +## 0. Why a normalization ruleset (and why now) + +Doc 05 establishes that Jellyfin's parser is permissive: dots, dashes, +underscores, and spaces are interchangeable; `S01E01`, `s01e01`, `1x01`, and +`Season 1 Episode 1` all parse to the same thing. That permissiveness is great +for *getting Jellyfin to scrape a torrent dump*, but it is a disaster for +**operating a library at scale**: + +1. **Search becomes noisy.** SMB / Syncthing / Dolphin search across mixed + patterns surfaces irrelevant matches (`S01E01` vs `1x01` vs `s01.e01`). +2. **Diff / audit / dedupe scripts** get harder. Every regex needs to handle + N forms. The cleanup pass (doc 07) is dramatically cheaper if every file + in the tree obeys one shape. +3. **Visual scan in `ls`** becomes unreadable when half the filenames have + `[1080p AI x265 10bit FS99 Joy]` glued on and the other half don't. +4. **Future migrations** (Plex, Kodi, mobile sync to a Win/Mac client) all + have stricter parsers than Jellyfin. The strictest sane shape that + Jellyfin accepts is also the most portable. Pay the cost once. +5. **Cross-platform safety.** This deploy is Linux-only today, but the + workspace's Syncthing setup (see ai-lab `SYSTEM.md`) implies future + sync to Win/Mac clients. Choose Windows-safe filenames now and never + touch this again. + +The cost of the ruleset is one Python script and discipline at import time. +Both are bounded. The cost of *not* having one compounds with every new +release. + +--- + +## 1. Canonical formats — what the tree must look like + +This is the lock-in. **One shape per category. No alternatives. No "but my +release group did it differently".** + +### 1.1 Movies + +``` +Movies/<Title> (<Year>)/<Title> (<Year>).<ext> +Movies/<Title> (<Year>)/<Title> (<Year>) - <Edition>.<ext> (when edition matters) +Movies/<Title> (<Year>) [<provider-id>]/<Title> (<Year>) [<provider-id>].<ext> (when ambiguous) +``` + +- `<Title>` — smart title case (§ 5.1), forbidden chars stripped (§ 5.5). +- `<Year>` — first theatrical-release year, in parens, single space before `(`. + Mandatory in this deploy (doc 05 § 0 rule 5), even when the title is unique. +- `<Edition>` — when present, exactly one of: + `Director's Cut`, `Extended`, `Theatrical`, `IMAX`, `Unrated`, `Final Cut`, + `Remastered`. Anything else (e.g. `Snyder Cut`, `Workprint`, `4K + Remaster`) is admissible only with a written justification in the import + log; otherwise normalize to the closest of the seven canonical labels + above. +- `<provider-id>` — `imdbid-tt0123456` / `tmdbid-12345` / `tvdbid-12345` + in square brackets. Optional unless year-based disambiguation isn't + enough (§ 6.2). +- `<ext>` — lowercase: `mkv`, `mp4`, `webm`, `avi`. (`mkv` is the rip + default; `mp4` is the streaming-original default.) Never uppercase + `.MKV`, `.MP4`. + +**Forbidden in the filename**: resolution tags (`1080p`, `2160p`, `720p`, +`4K`), codec tags (`x264`, `x265`, `h264`, `h265`, `HEVC`, `AVC`), source +tags (`WEB`, `WEB-DL`, `BluRay`, `BRRip`, `HDTV`, `DVDRip`, `WEBRip`), +audio tags (`AAC`, `AC3`, `DTS`, `DTS-HD.MA`, `5.1`, `7.1`, `Atmos`, +`Opus`), bitness/HDR tags (`10bit`, `8bit`, `HDR`, `DV`, `SDR`), release +tags (`PROPER`, `REPACK`, `INTERNAL`, `LIMITED`, `RERIP`), language tags +(`MULTi`, `DUBBED`, `SUBBED`, `iNTERNAL`), group tags +(`[YIFY]`, `[RARBG]`, `[FS99 Joy]`, `-NOGRP`, `-EVO`, `-SPARKS`), +and website refs (`WWW.YIFY-TORRENTS.COM`, `RARBG.txt`-derived names). + +**Justification — why no resolution/codec tag:** + +Jellyfin reads stream attributes (resolution, codec, bit-depth, HDR, audio +codec) directly from the file via `ffprobe` on every scan. The web UI +displays them. The mobile clients display them. The transcoder picks +based on them. The filename contributes **zero new information**. +Including those tags pollutes search results, breaks the byte-exact +folder-vs-file match required for multi-version movies (doc 05 § 1.2), +and makes humans skim past the title to find the title. The only +exception is `Movie (Year) - 1080p.mkv` AS the multi-version label +when two distinct rips of *the same movie* are kept in the same folder +(e.g. `Blade Runner 2049 (2017) - 2160p.mkv` next to +`Blade Runner 2049 (2017) - 1080p.mkv`). In that exact case, the +resolution IS the disambiguation token. Otherwise, no. + +#### Examples + +``` +Movies/Blade Runner (1982)/Blade Runner (1982).mkv +Movies/Blade Runner (1982)/Blade Runner (1982) - Final Cut.mkv +Movies/Blade Runner (1982)/Blade Runner (1982) - Director's Cut.mkv +Movies/Blade Runner 2049 (2017)/Blade Runner 2049 (2017) - 2160p.mkv +Movies/Blade Runner 2049 (2017)/Blade Runner 2049 (2017) - 1080p.mkv +Movies/Dune (1984) [imdbid-tt0087182]/Dune (1984) [imdbid-tt0087182].mkv +``` + +### 1.2 TV shows + +``` +TV/<Show> (<Year>)/Season <NN>/<Show> (<Year>) - S<NN>E<MM> - <Episode Title>.<ext> +TV/<Show> (<Year>)/Season <NN>/<Show> (<Year>) - S<NN>E<MM>-E<MM2> - <Episode Title>.<ext> +TV/<Show> (<Year>)/Season 00/<Show> (<Year>) - S00E<MM> - <Special Title>.<ext> +``` + +- `<Show>` — smart title case, no provider-id in show folder unless the + scraper picks the wrong show twice in a row (then add `[tvdbid-NNNN]`). +- `<Year>` — series **first-air year**, mandatory even when title is unique + (doc 05 § 0 rule 5; this deploy convention is stricter than upstream + permissive parsing). +- `<NN>` — zero-padded two digits. `Season 01`, not `Season 1`. `S01`, not `S1`. +- `<MM>` — zero-padded two digits. Three digits permissible only for shows + that exceed 99 episodes per *season* (rare; e.g. some daily anime). See + doc 05 § 3.1. +- `<Episode Title>` — title from the metadata provider (TVDB/TMDB) with + smart title case. Required for human readability; Jellyfin overwrites it + during scrape but the file basename is what humans see in `ls`. +- Multi-episode files: `S<NN>E<MM>-E<MM2>` — single hyphen, no spaces. + Verified parsing per doc 05 § 2.2 table. + +#### Examples + +``` +TV/Futurama (1999)/Season 01/Futurama (1999) - S01E01 - Space Pilot 3000.mkv +TV/Futurama (1999)/Season 01/Futurama (1999) - S01E03-E04 - I, Roommate / Love's Labours Lost in Space.mkv +TV/Futurama (1999)/Season 00/Futurama (1999) - S00E01 - Bender's Big Score.mkv +TV/The Office (2005)/Season 02/The Office (2005) - S02E01 - The Dundies.mkv +``` + +#### Why this shape (not the slimmer `Show S01E01.mkv`) + +Doc 05 § 2.2 shows three accepted patterns: + +``` +Futurama (1999) S01E01.mkv +Futurama (1999) S01E01 - Space Pilot 3000.mkv +Futurama (1999) - S01E01 - Space Pilot 3000.mkv ← canonical for this deploy +``` + +The third form (with the leading ` - ` before `S01E01` and the title) is +chosen because: + +1. The leading dash visually separates the series-name block from the + episode-id block. Important when the show's title contains spaces and + numbers (`Star Trek The Next Generation S01E01`) — without the dash, the + eye trips over `Generation S01E01`. +2. Symmetric with the Movies multi-version pattern (`Title (Year) - <Label>`). + One mental model for the whole library. +3. Identical to the Sonarr default rename pattern (`{Series Title} - + S{season:00}E{episode:00} - {Episode Title}`), which means the naming + pattern is well-trodden and tooling friendly. + +### 1.3 Anime — seasonal numbering (TVDB-style) + +Same shape as TV (§ 1.2). Mandatory year. Mandatory `Season NN`. No +absolute numbers. + +``` +Anime/<Show> (<Year>)/Season <NN>/<Show> (<Year>) - S<NN>E<MM> - <Episode Title>.<ext> +``` + +#### Examples + +``` +Anime/Cowboy Bebop (1998)/Season 01/Cowboy Bebop (1998) - S01E01 - Asteroid Blues.mkv +Anime/Mushishi (2005)/Season 02/Mushishi (2005) - S02E01 - The Sleeping Mountain.mkv +Anime/Steins;Gate (2011) [tvdbid-244061]/Season 01/Steins;Gate (2011) [tvdbid-244061] - S01E01 - Turning Point.mkv +``` + +(`;` is legal on `ext4` but flagged in § 5.5 as risky for portability — +prefer `Steins-Gate` if portability matters.) + +### 1.4 Anime — absolute numbering + +Used **only** for shows >99 episodes that don't fit the seasonal model +(One Piece, Naruto, Detective Conan, Bleach). For those shows, the +canonical shape is: + +``` +Anime/<Show>/<Show> - <NNNN> - <Episode Title> [<Sub|Dub>].<ext> +``` + +- No `(<Year>)` on the show folder — absolute-numbering shows are usually + unique by name; if not, fall back to a provider ID + (`Doraemon (1979) [tvdbid-71603]`, then revert to seasonal Pattern 1.3). +- `<NNNN>` — **zero-padded four digits** (deterministic; all known + long-runners stay below 9999). Three-digit padding (`0099`) is wrong; + four-digit (`0099`) is right and matches the upper bound of the longest + running show. +- `[<Sub|Dub>]` — exactly one of `[Sub]` or `[Dub]`. Required for any + release where both audio tracks are not embedded in one mkv. If the + release contains both audio tracks in one container, omit the + bracket. +- No `Season NN` folder. Absolute numbering puts every episode in the + show root. + +#### Deterministic absolute-numbering rule + +Absolute number = the episode's position in the **broadcast order** as +listed by AniDB's "main" episode list for that show. NOT the dub broadcast +order, NOT a re-cut/remaster renumbering. For shows with discrepancies +between AniDB and TVDB absolute numbering (rare), AniDB wins — that's the +provider that absolute-numbering plugins (and Shoko) use. + +#### Examples + +``` +Anime/One Piece/One Piece - 0001 - I'm Luffy! The Man Who's Gonna Be King of the Pirates! [Sub].mkv +Anime/One Piece/One Piece - 0001 - I'm Luffy! The Man Who's Gonna Be King of the Pirates! [Dub].mkv +Anime/Naruto/Naruto - 0001 - Enter Naruto Uzumaki [Sub].mkv +Anime/Detective Conan/Detective Conan - 1099 - The Detective's Vacation [Sub].mkv +``` + +#### Caveat + +Naive Jellyfin without Shoko will mis-handle episodes >99 (doc 05 § 3.3). +This is a known issue; pick **one** of: + +- Run Shoko (doc 05 § 3.2). Filenames don't matter for Shoko — but obey + this ruleset anyway, for human readability and for the day Shoko goes + away. +- Re-bucket by TVDB seasons. Most long-runners have a TVDB season split + (One Piece S01-S22). Use § 1.3 with the seasons. + +This deploy currently does NOT run Shoko; it currently does NOT host any +absolute-numbered anime. The shape in § 1.4 is reserved for the day +Shoko gets installed. Leave it documented. + +### 1.5 Music videos + +``` +MusicVideos/<Artist>/<Year> - <Track Title>.<ext> +MusicVideos/<Artist>/<Year> - <Track Title> [<Variant>].<ext> (when multiple cuts exist) +``` + +- `<Artist>` — smart title case, comma-separated for collabs + (`Daft Punk, Pharrell Williams`). +- `<Year>` — release year of the *video*, not the song. Songs older than + their videos are common (a 2024 acoustic cover gets the 2024 year). +- `<Track Title>` — smart title case. +- `<Variant>` — optional, `[Live]`, `[Acoustic]`, `[Remix]`, `[Alternate]`, + `[Lyric Video]`. Forbidden: `[1080p]`, `[Official]`, `[HD]`. + +Music videos do not use `(<Year>)` parens because the library is +`musicvideos` `CollectionType`, which has no scraper (doc 05 § 5.3) and the +year is purely cosmetic. + +#### Examples + +``` +MusicVideos/Daft Punk/2013 - Get Lucky.mp4 +MusicVideos/Daft Punk/2013 - Get Lucky [Lyric Video].mp4 +MusicVideos/Pink Floyd/1995 - Comfortably Numb [Live].mkv +MusicVideos/Daft Punk, Pharrell Williams/2013 - Get Lucky.mp4 +``` + +For full **live concerts** (>20 min, multi-song), file under Movies +instead, per doc 05 § 5.4. + +### 1.6 Stand-up specials (Movies-typed) + +Stand-up lives in the Movies library (doc 05 § 4). Folder + filename are +prefixed with the performer name; treat the whole `<Performer> - <Title>` +as the canonical "movie title" for parser purposes. + +``` +Movies/<Performer> - <Title> (<Year>)/<Performer> - <Title> (<Year>).<ext> +``` + +#### Examples + +``` +Movies/Bo Burnham - Inside (2021)/Bo Burnham - Inside (2021).mkv +Movies/Hannah Gadsby - Nanette (2018) [imdbid-tt8465676]/Hannah Gadsby - Nanette (2018) [imdbid-tt8465676].mkv +Movies/Norm Macdonald - Nothing Special (2022)/Norm Macdonald - Nothing Special (2022).mkv +``` + +The `<Performer> - ` prefix is **mandatory** for stand-up. Without it, the +title alone (`Inside (2021)`) ambiguously matches the 2007 horror film +*Inside*, the 2023 thriller *Inside*, or the 2017 documentary *Inside*. +The prefix gives TMDB enough disambiguation to land on the correct +record without a provider-id override. + +--- + +## 2. What to STRIP from a source filename — exhaustive list + +This is the substring inventory. The script in § 11 implements all of +these. The list grew from sampling ~200 distinct release-group filenames +across `[YIFY]`, `[RARBG]`, `[ettv]`, `[GalaxyRG]`, `[FS99 Joy]`, +`[NOGRP]`, `[FitGirl]`, and the Futurama corpus on disk. + +### 2.1 Group tags (square / round brackets) + +Match anything inside `[...]` or `(...)` *that does not look like a year*. +Year detection: 4 digits, 1900 ≤ N ≤ current year + 2. + +Exemplar substrings (case-insensitive): + +``` +[1080p AI x265 10bit FS99 Joy] +[YIFY] +[YTS] +[YTS.MX] +[YTS.AG] +[YTS.AM] +[RARBG] +[ettv] +[eztv] +[GalaxyRG] +[GalaxyRG265] +[FitGirl] +[FitGirl Repack] +[NOGRP] +[QxR] +[FreetheFish] +[psa] +[PSA] +[CMRG] +[d3g] +[STRiFE] +[Pahe.in] +[FoV] +[NTb] +[YOLO] +[KOGi] +[playWEB] +[REQ] +[XBET] +[FLUX] +[NOSiVID] +[BGT] +[SVA] +[CRiMSON] +[ION10] +[ION265] +[BluPanda] +[H4S5S] +[5.1] +(YIFY) +(RARBG) +(NOGRP) +``` + +### 2.2 Trailing release-group dashes + +Pattern: `-<UPPERCASE_TOKEN>` at the very end of the basename +(before extension). Matches: + +``` +-NOGRP +-EVO +-RARBG +-SPARKS +-CMRG +-NTb +-FLUX +-AMZN +-NF +-DSNP +-ATVP +-MA +-WEB +-AAC2 +-FoV +-KOGi +-PLAYWEB +-FRDS +-ZQ +-PHOENiX +-EZTV +-NTG +-iON +-ION10 +-ION265 +-CtrlHD +-d3g +-PSA +-QxR +-RZeroX +-PMP +-BTN +-DEFLATE +-BAE +-MZABI +-TURG +``` + +The pattern `-[A-Z][A-Z0-9]{1,15}$` (after stripping bracket tags and +quality tags) captures most of these. The script in § 11 uses an +allow-list approach instead of a pattern, because release groups +sometimes exceed 15 chars and sometimes use mixed case. + +### 2.3 Quality / codec / source / audio tags + +Strip all of these as standalone tokens (whitespace-, dot-, dash-, or +underscore-bounded), case-insensitive: + +**Resolution / aspect:** +``` +2160p 1080p 720p 480p 360p 4K 4k UHD HD SD FHD QHD +``` + +**Source:** +``` +WEB-DL WEBDL WEB.DL WEB WEBRip WEB-Rip BluRay BLURAY Bluray BDRip +BRRip BR-Rip BDR HDTV HDTVRip PDTV DSR DVDRip DVD DVDR DVD9 DVD5 +HDDVD HDDVDRip HDRip CAMRip CAM TS HDTS TC TELESYNC TELECINE R5 +SCREENER SCR WORKPRINT WP PPV PPVRip +``` + +**Codec / container hints (in name):** +``` +x264 x265 H.264 H264 H.265 H265 HEVC AVC VP9 AV1 XviD DivX +10bit 10-bit 8bit 8-bit HDR HDR10 HDR10+ DV DolbyVision Dolby.Vision +SDR HFR HQ +``` + +**Audio:** +``` +DD5.1 DDP5.1 DD7.1 DDP7.1 DD2.0 DD+5.1 DD+7.1 DTS DTS-HD DTS-HD.MA +DTS-X DTSX TrueHD Atmos AAC AAC2.0 AAC5.1 AC3 AC-3 EAC3 E-AC3 +MP3 MP2 Opus FLAC PCM LPCM 5.1 7.1 2.0 Mono Stereo Multi +``` + +**Release-process tags:** +``` +PROPER REPACK iNTERNAL INTERNAL LIMITED EXTENDED.CUT UNCUT THEATRiCAL +RERIP REAL READNFO RETAiL RETAIL STV DC COMPLETE REMUX REMASTERED +SUBBED DUBBED MULTi MULTI SUB DUB ENG ENGLISH POL POLISH iNT iNTERNAL +``` + +> Note: `EXTENDED.CUT`, `THEATRiCAL`, `UNRATED`, `IMAX`, `DIRECTORS.CUT`, +> `FINAL.CUT`, `REMASTERED`, `UNCUT`, `DC` (= Director's Cut shorthand), +> `EE` (= Extended Edition shorthand) are kept *as edition tokens* — see +> § 3.6. Strip them from the noise pool, then re-emit them as +> ` - <Edition>` if present. + +### 2.4 Source-specific cruft + +Common compound suffixes that are not single tokens: + +``` +WEB.h264-NiXON[rartv] +WEB-DL.DDP5.1.x264-NTb +BDRip.x265.10bit-RZeroX +HDTV.x264-PHOENiX +1080p.WEB.h264-NiXON +2160p.UHD.BluRay.REMUX.HDR.HEVC.DTS-HD.MA.5.1 +``` + +These are ad-hoc concatenations; once the standalone tokens above are +stripped, what remains is the title plus stray separators. The pipeline +in § 4 collapses separators last, so order matters. + +### 2.5 Whitespace / punctuation cleanup + +After substring removal, run these passes: + +| Pass | From | To | +|---|---|---| +| Collapse runs of spaces | `Show Title S01E01` | `Show Title S01E01` | +| Trim leading/trailing whitespace | ` Show.mkv ` | `Show.mkv` | +| Collapse double-underscore | `Show__Title` | `Show Title` | +| Replace dot-separators with space (basename only) | `Show.Title.S01E01` | `Show Title S01E01` | +| Drop stray punctuation runs | `Show --- Title` | `Show - Title` | +| Strip trailing dashes/dots before ext | `Show -.mkv` | `Show.mkv` | + +The dot-to-space substitution is **only applied if the dot is between +alphanumeric tokens** — so `5.1` (audio channel count, already removed +in § 2.3) is safe, and `Mr. Robot` keeps its dot if the source uses +`Mr.Robot` (the dot becomes a space, giving `Mr Robot` — the canonical +form has no dot). + +### 2.6 URL / website refs + +Match and remove: + +``` +WWW.YIFY-TORRENTS.COM +WWW.YTS.MX +WWW.RARBG.TO +RARBG.txt +www.yify-torrents.com +``` + +These appear as bracket prefixes (`[WWW.YIFY-TORRENTS.COM] Movie...`), +suffixes (`Movie - WWW.YIFY-TORRENTS.COM.mkv`), or as `RARBG.txt`-style +sidecar files (which doc 07 garbage-collects, not us). + +Pattern (case-insensitive): `(?:^|[\s\[\(\.\-_])(WWW\.[A-Z0-9\-]+\.[A-Z]{2,4})(?:[\s\]\)\.\-_]|$)` → strip whole match. + +### 2.7 Language indicators in the BASE name + +`.pl`, `.eng`, `.en`, `.pol`, `.de`, `.fr`, `.es`, `.it`, `.ja`, `.jp`, +`.ru`, `.ko`, `.zh` appearing in the **video** filename (basename, not +extension). These belong on **subtitle sidecars only**, per doc 03. + +``` +Futurama.s01e01.pl.mkv ← BAD (`.pl` in video basename) +Futurama (1999) - S01E01.mkv ← GOOD (audio language is a stream attribute) +Futurama (1999) - S01E01.pl.srt ← GOOD (subtitle sidecar with lang) +Futurama (1999) - S01E01.eng.srt ← GOOD +``` + +Detection: 2- or 3-letter ISO-639 code as a token between dots / dashes / +underscores in the basename. If found, drop it from the basename. If a +sidecar `.srt` exists with the same lang token, **leave the sidecar +alone** — it's already correctly named. + +If the source file is a `.srt` / `.ass` / `.vtt` / `.sub`, the lang +token is part of the canonical sidecar form and must NOT be stripped. +The script's `--type subtitle` mode handles this branch. + +--- + +## 3. The normalization pipeline (regex / sed / python) + +Conceptual order — each step's output feeds the next. + +### 3.1 Step 0 — Determine target schema + +Caller-supplied: `--type {movie|tv|anime-seasonal|anime-absolute|musicvideo|standup|extra}`. The +script does not guess. Doc 07's import wrapper picks the type based on +which library tree the file is being moved into. + +### 3.2 Step 1 — Split off extension + +```python +basename, ext = os.path.splitext(source_filename) +ext = ext.lower().lstrip(".") # canonical lowercase, no leading dot +``` + +Validate: `ext in {"mkv", "mp4", "avi", "webm", "m4v", "srt", "ass", "ssa", "vtt", "sub", "idx"}`. +Anything else → reject with an error; doc 07 quarantines it. + +### 3.3 Step 2 — Extract S<NN>E<MM> (TV / anime-seasonal only) + +```python +import re +RE_SEASON_EPISODE = re.compile(r"[Ss](\d{1,2})[Ee](\d{1,3})(?:-[Ee]?(\d{1,3}))?") +m = RE_SEASON_EPISODE.search(basename) +if not m: + # try alternative forms before giving up + m = re.search(r"(?<![\dA-Za-z])(\d{1,2})x(\d{1,3})(?:-(\d{1,3}))?", basename) + if m: + season, ep, ep_end = m.group(1), m.group(2), m.group(3) + else: + m = re.search(r"Season\s*(\d{1,2})\s*Episode\s*(\d{1,3})", basename, re.I) + # ... +season = f"{int(m.group(1)):02d}" +episode = f"{int(m.group(2)):02d}" +episode_end = f"{int(m.group(3)):02d}" if m.group(3) else None +``` + +If no S/E found and `--type tv|anime-seasonal`, error out — the file can +only be normalized if season/episode are recoverable. + +### 3.4 Step 3 — Extract episode title + +After step 2, the matched span is the boundary. Episode title is the text +**between** the SxxExx end and the **first** of: `[`, `(`, end-of-string, +group-tag delimiter, end-of-line. + +```python +after_se = basename[m.end():] +# strip any leading separators +title_part = re.split(r"[\[\(]|\s-\s[A-Z][A-Z0-9]+$", after_se, maxsplit=1)[0] +title_part = title_part.strip(" -._") +``` + +If the title-part is empty after strip, leave it empty (script emits no +trailing title — `Show S01E01.mkv` is still canonical when no title is +known). + +### 3.5 Step 4 — Extract series / movie title (from parent folder) + +The **parent folder name** is the source of truth for series/movie title, +not the filename, because torrents commonly have inconsistent +filename-prefixes within the same folder (`Show.S01E01.x264.mkv` vs +`Show Title - S01E02.mkv`). + +```python +parent = os.path.basename(os.path.dirname(source_path)) +# strip group tags and quality from the parent folder too +clean_parent = strip_noise(parent) +# extract year if present +year_match = re.search(r"\((\d{4})\)", clean_parent) +year = year_match.group(1) if year_match else None +title = re.sub(r"\s*\(\d{4}\).*$", "", clean_parent).strip() +``` + +Edge case: parent folder is `Season 01` (TV) — recurse one more level up +to the show folder. The script handles N levels of `Season \d+` parents. + +### 3.6 Step 5 — Detect edition tokens (Movies only) + +After § 2.3 strips edition tags from the noise pool, scan the **original** +basename for canonical edition keywords: + +```python +EDITIONS = { + r"director'?s?[\.\s_-]*cut": "Director's Cut", + r"extended[\.\s_-]*(?:cut|edition)?": "Extended", + r"theatrical(?:[\.\s_-]*cut)?": "Theatrical", + r"final[\.\s_-]*cut": "Final Cut", + r"imax": "IMAX", + r"unrated": "Unrated", + r"remastered?": "Remastered", + r"\bDC\b": "Director's Cut", # DC shorthand + r"\bEE\b": "Extended", # EE shorthand +} +``` + +Match the first one found, in priority order (Director's Cut > Final Cut +> Extended > Theatrical > IMAX > Unrated > Remastered). Emit as +` - <Edition>` between title-year block and extension. + +### 3.7 Step 6 — Collapse, trim, re-emit canonical + +```python +def emit_canonical(schema, parts): + if schema == "movie": + if parts.edition: + return f"{parts.title} ({parts.year}) - {parts.edition}.{parts.ext}" + return f"{parts.title} ({parts.year}).{parts.ext}" + if schema == "tv" or schema == "anime-seasonal": + ep_range = f"S{parts.season}E{parts.episode}" + if parts.episode_end: + ep_range += f"-E{parts.episode_end}" + if parts.episode_title: + return f"{parts.title} ({parts.year}) - {ep_range} - {parts.episode_title}.{parts.ext}" + return f"{parts.title} ({parts.year}) - {ep_range}.{parts.ext}" + if schema == "anime-absolute": + suffix = f" [{parts.subdub}]" if parts.subdub else "" + return f"{parts.title} - {parts.absolute_number} - {parts.episode_title}{suffix}.{parts.ext}" + if schema == "musicvideo": + variant = f" [{parts.variant}]" if parts.variant else "" + return f"{parts.year} - {parts.track_title}{variant}.{parts.ext}" + if schema == "standup": + return f"{parts.performer} - {parts.title} ({parts.year}).{parts.ext}" +``` + +After emission, run § 5.5 forbidden-character substitution, then § 5.6 +double-space collapse, one final time. + +--- + +## 4. Folder normalization + +The same rules as filenames, applied to directory names, with a few +schema-specific adjustments. + +### 4.1 Show folder — `<Show> (<Year>)` + +``` +Futurama Season 1 [1080p AI x265 10bit FS99 Joy]/ → Futurama (1999)/ +The Office US S01-S09 1080p WEB-DL/ → The Office (2005)/ +[YIFY] Inception 2010 1080p BRRip x264/ → Inception (2010)/ ← but this is movies +Cowboy.Bebop.1998.Complete.BluRay.x265.10bit/ → Cowboy Bebop (1998)/ +``` + +Year: derived from the metadata provider (TVDB/TMDB) on first scrape, or +from the user-supplied `--year` flag. If neither is available, +`normalize.py --type tv` errors out and asks for `--year`. Year guessing +from parent-folder-numbers is unsafe (`Star Trek 2009` is the movie, not +the series). + +### 4.2 Season folder — `Season <NN>` + +``` +Season 1/ → Season 01/ +Season1/ → Season 01/ +Season.01/ → Season 01/ +S01/ → Season 01/ +SEASON 1 [1080p WEB Joy]/ → Season 01/ +Season 01 - Pilot Season/ → Season 01/ ← drop subtitle suffixes +Season 01 [BluRay]/ → Season 01/ +Specials/ → Season 00/ +Season 0/ → Season 00/ +Extras/ → Season 00/ ← only if treated-as-specials +``` + +Doc 05 § 2.3 is explicit: `Specials/`, `Season 0/`, `Season Specials/` do +not match the parser. `Season 00` is the only correct form. + +### 4.3 Movie folder — `<Title> (<Year>)` + +Same rules as the filename without the extension. The folder name MUST +byte-for-byte match the filename prefix when multi-version files are +present (doc 05 § 1.2 — Jellyfin requires this). + +``` +[YIFY] Blade Runner 1982 1080p BRRip x264 AAC-RARBG/ → Blade Runner (1982)/ +Blade.Runner.2049.2017.2160p.UHD.BluRay.x265.10bit.HDR.DV.DTS-HD.MA.7.1-FreetheFish/ + → Blade Runner 2049 (2017)/ +``` + +### 4.4 Music-video artist folder — `<Artist>` + +``` +Daft.Punk/ → Daft Punk/ +[Daft Punk]/ → Daft Punk/ +DAFT PUNK Discography/ → Daft Punk/ ← note: "Discography" is dropped; this is video lib not music +``` + +### 4.5 Special-features subfolders + +Inside an item folder, only these subfolder names are recognised by +Jellyfin (doc 05 § 8.2). The normalizer must rename source folders to +the canonical lowercase form: + +``` +BTS/ → behind the scenes/ +Behind-the-Scenes/ → behind the scenes/ +behind_the_scenes/ → behind the scenes/ +Featurettes/ → featurettes/ +DELETED SCENES [Joy]/ → deleted scenes/ +Trailers/ → trailers/ +Interviews/ → interviews/ +Bonus Content/ → extras/ ← catch-all +Bonus_Features/ → extras/ +``` + +**Files inside featurettes/ etc.** keep human-readable titles but get +their group tags stripped: + +``` +Featurettes/Welcome to the World of Tomorrow [1080p Joy].mkv + → featurettes/Welcome to the World of Tomorrow.mkv +``` + +Casing inside the special-features file *itself* uses smart title case +(§ 5.1). + +--- + +## 5. Case + character handling + +### 5.1 Smart title case + +Capitalize every word EXCEPT these "small words" (when not the first or +last word of the title): + +``` +a, an, and, as, at, but, by, for, from, in, into, nor, of, on, or, the, +to, up, vs, vs., via, with, yet +``` + +Words that look like acronyms (`I.B.M.`, `C.I.A.`, `T.M.N.T.`) are +preserved as-is. Roman numerals (`II`, `III`, `IV`, `IX`) are uppercased. + +#### Examples + +``` +the lord of the rings the two towers → The Lord of the Rings the Two Towers ← BAD +the lord of the rings: the two towers → The Lord of the Rings - The Two Towers ← GOOD (`:` → ` - `, the second `the` is at start of subtitle, capitalize) +return of the king → Return of the King +star trek ii the wrath of khan → Star Trek II - The Wrath of Khan +``` + +The subtitle-after-colon special case is important: when a `: ` is +substituted with ` - `, the word after the dash is a new "first word" for +title-casing purposes. The script handles this by re-running the +title-caser on each ` - ` separated chunk. + +Jellyfin's parser is case-insensitive — this is purely for human readers. + +### 5.2 Hyphen / dash normalization + +| Char | Code | Used for | +|---|---|---| +| `-` | U+002D HYPHEN-MINUS | ASCII hyphen, the only canonical form for filenames | +| `–` | U+2013 EN DASH | Forbidden in filenames; replace with `-` | +| `—` | U+2014 EM DASH | Forbidden; replace with `-` | +| `−` | U+2212 MINUS SIGN | Forbidden; replace with `-` | + +Unicode dashes appear from copy-paste of articles (Wikipedia loves the en +dash). They're invisible-ish in `ls`, but they break grep, shell +completion, and SMB transfers. + +``` +Spider–Man (2002).mkv → Spider-Man (2002).mkv +Spider — Man (2002).mkv → Spider - Man (2002).mkv +``` + +### 5.3 Apostrophes / quotes + +| Char | Code | Status | +|---|---|---| +| `'` | U+0027 APOSTROPHE | Canonical; ASCII straight quote | +| `'` | U+2019 RIGHT SINGLE QUOTATION MARK | Forbidden in filenames; replace with `'` | +| `'` | U+2018 LEFT SINGLE QUOTATION MARK | Forbidden; replace with `'` | +| `"` | U+0022 QUOTATION MARK | Forbidden in filenames (Windows-illegal); strip entirely | +| `"` | U+201C LEFT DOUBLE QUOTATION MARK | Forbidden; strip | +| `"` | U+201D RIGHT DOUBLE QUOTATION MARK | Forbidden; strip | + +Curly quotes break SMB shares (Windows clients see `?` and refuse to open +the file) and break shell escaping in scripts. + +``` +Don't Stop Believin'.mkv ← GOOD +Don't Stop Believin'.mkv ← BAD (curly), normalize to straight +"It's a Wonderful Life" (1946).mkv ← BAD (double quotes), strip them entirely: +It's a Wonderful Life (1946).mkv ← GOOD +``` + +### 5.4 Diacritics / non-ASCII + +`ext4` is UTF-8 native; Jellyfin's parser is UTF-8 native; the HTTP API +serves UTF-8 happily. **Keep diacritics** when the title's accepted +spelling uses them. + +``` +Amélie (2001)/Amélie (2001).mkv ← GOOD +Pokémon (1997)/Season 01/Pokémon (1997) - S01E01 - Pokémon - I Choose You!.mkv ← GOOD +Léon - The Professional (1994)/Léon - The Professional (1994).mkv ← GOOD +``` + +Doc 05 § 0 rule 4 advises caution: prefer the ASCII title when "well +known" (e.g. `Amelie (2001)` over `Amélie (2001)`). For this deploy with +LAN-only HTTP and `ext4`, full Unicode is safe — but the rule of thumb +remains: if Wikipedia's English page uses the accent, keep it; if not, +drop it. + +**Tested:** Jellyfin's filename matching, `Items?searchTerm=`, and NFO +`<title>` round-trip correctly with `é`, `ñ`, `ü`, `ß`, `ø`, `ł`, `ż`, +`日`, `한` on this deploy. Verified against the Futurama Polish-dubbed +corpus. + +### 5.5 Forbidden-char substitution table + +Windows-illegal: `< > : " / \ | ? *`. Linux additionally forbids `/` and +NUL. Substitute as follows: + +| Char | Substitute | Rationale | +|---|---|---| +| `:` | ` - ` (space-hyphen-space) | Most common in titles (`Star Trek II: The Wrath of Khan`); ` - ` is a clean replacement that title-casing handles | +| `/` | ` and ` | Used in titles like `Mr. & Mrs. Smith` (no `/` there) and in episode-title lists for two-part eps. Avoid if both halves stand on their own. | +| `\` | omit | No legitimate use in titles | +| `<` | `(` | Rare; `<` in titles is parenthetical | +| `>` | `)` | Same | +| `\|` | omit (or `-`) | Rare; sometimes in `Tom \| Jerry` style logo-text | +| `?` | omit | Common in `Who Killed the Robber?` — drop the question mark, keep meaning | +| `*` | omit | Rare; usually censored profanity | +| `"` | omit | Per § 5.3 | +| `\0` (NUL) | error | Filesystem hard-block; surface to user | + +#### Examples + +``` +Star Trek II: The Wrath of Khan (1982) → Star Trek II - The Wrath of Khan (1982) +Mr. & Mrs. Smith (2005) → Mr. & Mrs. Smith (2005) (no change; & is fine) +Who Killed the Robber? (1987) → Who Killed the Robber (1987) +Tom & Jerry: The Movie (1992) → Tom & Jerry - The Movie (1992) +``` + +### 5.6 Whitespace canonicalization + +After all substitutions: + +1. Collapse runs of `\s+` to a single space. +2. `strip()` leading/trailing whitespace. +3. Collapse double-`-` (which can result from `Title -- Subtitle`) to + single `-`. +4. Trim trailing punctuation before extension: `Title -.mkv` → `Title.mkv`. + +--- + +## 6. Year disambiguation — concrete examples + +Jellyfin's TMDB/TVDB scrape uses the year in `(YYYY)` to filter +candidates. With multiple titles of the same name, the year is the *only* +disambiguator before falling back to provider IDs. + +### 6.1 Without year — what goes wrong + +Filename: `Cinderella.mkv` (no year, no folder year). + +Jellyfin sends "Cinderella" to TMDB. TMDB returns 12+ matches: +- Cinderella (1950) — Disney animated +- Cinderella (2015) — Disney live action +- Cinderella (2021) — Camila Cabello musical +- Cinderella (1965) — TV special +- Cinderella (1899) — Méliès short + +Jellyfin picks the one with the highest popularity score, which is the +2015 live-action remake. If you wanted 1950, you have to manually edit. + +### 6.2 With year — clean match + +Filename: `Cinderella (1950).mkv` in folder `Cinderella (1950)/`. + +Jellyfin sends `(title=Cinderella, year=1950)` to TMDB. TMDB returns the +1950 animated film as the top match with high confidence. Scrape +succeeds first try. + +``` +Movies/Cinderella (1950)/Cinderella (1950).mkv ← TMDB ID 11224 (animated) +Movies/Cinderella (2015)/Cinderella (2015).mkv ← TMDB ID 150689 (live action) +Movies/Cinderella (2021)/Cinderella (2021).mkv ← TMDB ID 587996 (musical) +``` + +### 6.3 Same year — provider ID required + +Filename: `Bad Movie (1980).mkv`. Two films named "Bad Movie" released in +1980 (hypothetical). Year doesn't disambiguate. Add provider ID: + +``` +Movies/Bad Movie (1980) [imdbid-tt0080000]/Bad Movie (1980) [imdbid-tt0080000].mkv +Movies/Bad Movie (1980) [imdbid-tt0080001]/Bad Movie (1980) [imdbid-tt0080001].mkv +``` + +### 6.4 Year on TV shows + +The same logic applies to series: + +``` +TV/The Office (2001)/... ← UK original, BBC +TV/The Office (2005)/... ← US remake, NBC +``` + +Without year, Jellyfin picks one (usually the US one, higher TMDB +popularity). With year, both work side-by-side. + +--- + +## 7. Multi-version handling + +When a single movie has multiple legitimate cuts (Director's Cut, Theatrical, +Extended), or multiple resolutions (2160p HDR + 1080p SDR), Jellyfin groups +them under one item with a "Version" picker in the UI. + +### 7.1 Edition variants + +``` +Movies/Blade Runner (1982)/ +├── Blade Runner (1982).mkv ← default (whichever is "the" version) +├── Blade Runner (1982) - Director's Cut.mkv +├── Blade Runner (1982) - Final Cut.mkv +└── Blade Runner (1982) - Theatrical.mkv +``` + +Jellyfin reads all four files, hashes them, and creates one library item +"Blade Runner (1982)" with four selectable versions. The unlabelled one +shows as "Default". + +### 7.2 Resolution variants + +``` +Movies/Blade Runner 2049 (2017)/ +├── Blade Runner 2049 (2017) - 2160p.mkv +├── Blade Runner 2049 (2017) - 1080p.mkv +└── Blade Runner 2049 (2017) - 720p.mkv +``` + +Resolution labels ending in `p` or `i` sort descending by quality, so the +2160p version is offered first. This is the *only* exception to "no +resolution tags in filenames" (§ 1.1). + +### 7.3 Mixed (edition × resolution) + +``` +Movies/Blade Runner 2049 (2017)/ +├── Blade Runner 2049 (2017) - Theatrical 2160p.mkv +├── Blade Runner 2049 (2017) - Theatrical 1080p.mkv +├── Blade Runner 2049 (2017) - Director's Cut 2160p.mkv +└── Blade Runner 2049 (2017) - Director's Cut 1080p.mkv +``` + +This works in Jellyfin 10.10 — all four are grouped, the picker is a +flat list with all four labels visible. Slight UX ugliness but parses +cleanly. Avoid unless you genuinely have both axes of variation. + +### 7.4 What does NOT work + +- Sub-folders for variants: + ``` + Movies/Blade Runner 2049 (2017)/Theatrical/Blade Runner 2049 (2017).mkv ← BREAKS + ``` + Jellyfin treats `Theatrical/` as an unknown extras subfolder and the + inner mkv as nothing. +- Different folder per cut: + ``` + Movies/Blade Runner 2049 (2017) Theatrical/Blade Runner 2049 (2017).mkv + Movies/Blade Runner 2049 (2017) Director's Cut/Blade Runner 2049 (2017).mkv + ``` + This makes them two separate library items, not grouped versions. +- Suffix without space-hyphen-space: + ``` + Blade Runner 2049 (2017).Theatrical.mkv ← BREAKS (no ` - ` separator) + Blade Runner 2049 (2017)-Theatrical.mkv ← BREAKS (no spaces around `-`) + ``` + +--- + +## 8. Special-features filename rules + +Files inside the recognised subfolders (`featurettes/`, `behind the +scenes/`, `deleted scenes/`, `interviews/`, `trailers/`, etc.) follow +these rules: + +1. **Strip group tags** as in § 2.1. +2. **Strip quality / codec / source / audio tags** as in § 2.3. +3. **Smart title case** as in § 5.1. +4. **Forbidden chars substituted** as in § 5.5. +5. **Filename = the human-readable feature title.** No `(year)`, no + `S01E01`. The parent folder type (e.g. `featurettes/`) is the type + marker. +6. Optional: append `-featurette` (or `-trailer`, `-behindthescenes`, + etc.) suffix to be defensive about scraper edge cases. Doc 05 § 8.1 + shows this works AND § 8.2 shows the folder method works — using both + is belt-and-braces. + +#### Example + +``` +Featurettes/Welcome to the World of Tomorrow [1080p Joy].mkv + → +featurettes/Welcome to the World of Tomorrow.mkv +``` + +Or, if you want belt-and-braces: + +``` +featurettes/Welcome to the World of Tomorrow-featurette.mkv +``` + +Both parse. Pick **one** style per library and keep it consistent. + +--- + +## 9. Worked example — the live Futurama import + +This is the example the owner asked for. Verified against the live media +tree on nullstone (`/home/user/media/tv/Futurama/Season 01,02,03/`). + +### 9.1 BEFORE (representative source dump) + +``` +/home/admin/Downloads/futrama/ +└── Futurama Season 1 [1080p AI x265 10bit FS99 Joy]/ + ├── Futurama S01E01 Space Pilot 3000 [1080p x265 10bit Joy].mkv + ├── Futurama S01E02 The Series Has Landed [1080p x265 10bit Joy].mkv + ├── Futurama S01E03 I, Roommate [1080p x265 10bit Joy].mkv + ├── Futurama S01E04 Love's Labours Lost in Space [1080p x265 10bit Joy].mkv + ├── Futurama S01E05 Fear of a Bot Planet [1080p x265 10bit Joy].mkv + ├── Futurama S01E06 A Fishful of Dollars [1080p x265 10bit Joy].mkv + ├── Futurama S01E07 My Three Suns [1080p x265 10bit Joy].mkv + ├── Futurama S01E08 A Big Piece of Garbage [1080p x265 10bit Joy].mkv + ├── Futurama S01E09 Hell Is Other Robots [1080p x265 10bit Joy].mkv + └── Featurettes/ + └── Welcome to the World of Tomorrow [1080p Joy].mkv +``` + +Note: doubled-space is real (`Futurama S01E01 Space Pilot 3000 [1080p`). +Source the rip is from a release group called "Joy" using "FS99" (FastSub +99); "AI" likely means AI-upscaled. None of that is library-relevant. + +### 9.2 AFTER (canonical layout) + +``` +/home/user/media/tv/ +└── Futurama (1999)/ + ├── Season 01/ + │ ├── Futurama (1999) - S01E01 - Space Pilot 3000.mkv + │ ├── Futurama (1999) - S01E02 - The Series Has Landed.mkv + │ ├── Futurama (1999) - S01E03 - I, Roommate.mkv + │ ├── Futurama (1999) - S01E04 - Love's Labours Lost in Space.mkv + │ ├── Futurama (1999) - S01E05 - Fear of a Bot Planet.mkv + │ ├── Futurama (1999) - S01E06 - A Fishful of Dollars.mkv + │ ├── Futurama (1999) - S01E07 - My Three Suns.mkv + │ ├── Futurama (1999) - S01E08 - A Big Piece of Garbage.mkv + │ └── Futurama (1999) - S01E09 - Hell Is Other Robots.mkv + └── featurettes/ + └── Welcome to the World of Tomorrow.mkv +``` + +### 9.3 Per-file rename mapping + +| Before | After | +|---|---| +| `Futurama Season 1 [1080p AI x265 10bit FS99 Joy]/` | `Futurama (1999)/Season 01/` | +| `Futurama S01E01 Space Pilot 3000 [1080p x265 10bit Joy].mkv` | `Futurama (1999) - S01E01 - Space Pilot 3000.mkv` | +| `Futurama S01E02 The Series Has Landed [1080p x265 10bit Joy].mkv` | `Futurama (1999) - S01E02 - The Series Has Landed.mkv` | +| `Futurama S01E04 Love's Labours Lost in Space [1080p x265 10bit Joy].mkv` | `Futurama (1999) - S01E04 - Love's Labours Lost in Space.mkv` | +| `Featurettes/Welcome to the World of Tomorrow [1080p Joy].mkv` | `featurettes/Welcome to the World of Tomorrow.mkv` | + +Notes on specific titles: + +- `I, Roommate` keeps the comma. Comma is legal on `ext4`, on Windows, + and on every modern SMB client. No need to substitute. +- `Love's Labours Lost in Space` keeps the straight ASCII apostrophe. + If the source had a curly `'`, § 5.3 normalizes it. +- `Hell Is Other Robots` — `Is` is capitalized (it's not in the small-words + list — the small-words list excludes `is`/`be`/`am`/`are`). + +### 9.4 What the live tree currently has + +Verified via `ssh user@192.168.0.100 'ls /home/user/media/tv/Futurama/'`: + +``` +Season 01 +Season 02 +Season 03 +``` + +The current live deploy uses folder name `Futurama/` (no year) — that's +non-canonical per this doc. The canonical is `Futurama (1999)/`. This is +covered in doc 07's migration plan (rename the folder, then `POST +/Library/Refresh`). Mentioned here as a known drift; not fixed in this +doc. + +--- + +## 10. Idempotency and safety + +The `normalize.py` script in § 11 enforces these: + +1. **No-op on already-canonical input.** When the script's emitted + filename equals the source filename byte-for-byte, it does nothing + and returns exit code 0. Re-running the script on an already-imported + library is safe and free. + +2. **No overwrite without `--force`.** When the target path exists and + is not the source path, the script refuses to move and returns exit + code 2. With `--force`, it moves and the target is overwritten. + Without `--force`, the script suggests a numeric suffix + (`Title (Year) (1).mkv`) and asks for confirmation. + +3. **Default to dry-run.** The script prints what it would do to stdout + and does NOT touch the filesystem unless `--apply` is passed. This is + the inverse of the GNU convention (most tools default to apply, + require `--dry-run` to preview) — chosen because the destructive + case (a wrong rename of 100 files) is much worse than the boring + case (one extra flag). + +4. **Audit log** at `/var/log/jellyfin-imports/<YYYY-MM-DD>.log`. Every + `--apply` run appends: + ``` + 2026-05-08T14:23:11Z RENAME /home/admin/.../Futurama S01E01 ...joy].mkv -> /home/user/media/tv/Futurama (1999)/Season 01/Futurama (1999) - S01E01 - Space Pilot 3000.mkv + ``` + Path is created (`mkdir -p /var/log/jellyfin-imports`) on first run if + missing; user must have write permission. + +5. **No deletes.** The script *moves* (`os.rename` on same FS, `shutil.move` + across FS). It never `os.unlink`s. Garbage collection of source folders + (after all files moved) is doc 07's job. + +6. **Atomic per-file.** Each file's rename is one syscall on the same FS; + on a different FS, `shutil.move` does copy-then-unlink which has a + brief window where both source and target exist. The audit log records + the operation regardless. + +7. **Unicode-safe.** All paths handled as `pathlib.Path` (UTF-8 native on + `ext4`). Curly-quote → straight-quote substitution happens BEFORE the + target path is computed, so the target path is always ASCII-safe-ish + (still UTF-8 for legitimate accents). + +--- + +## 11. Reference implementation — `normalize.py` + +Drop this at `/opt/docker/jellyfin/scripts/normalize.py` on nullstone. +Run with Python 3.10+. Stdlib only — no external deps. + +```python +#!/usr/bin/env python3 +""" +normalize.py — canonical filename normalizer for tv.s8n.ru + +Per /tmp/jellyfin-stack/docs/08-filename-normalization.md. +Safe by default: dry-run, no overwrite, no delete. +""" + +from __future__ import annotations + +import argparse +import datetime as dt +import os +import re +import shutil +import sys +import unicodedata +from dataclasses import dataclass, field +from pathlib import Path +from typing import Optional + +LOG_DIR = Path("/var/log/jellyfin-imports") + +# --- Stripping rules (doc § 2) ------------------------------------------------- + +GROUP_TAG_PATTERNS = [ + re.compile(r"\[[^\[\]]*\b(YIFY|YTS(\.\w+)?|RARBG|ettv|eztv|GalaxyRG\d*|" + r"FitGirl|FitGirl\s*Repack|NOGRP|QxR|FreetheFish|psa|PSA|CMRG|" + r"d3g|STRiFE|Pahe\.in|FoV|NTb|YOLO|KOGi|playWEB|REQ|XBET|FLUX|" + r"NOSiVID|BGT|SVA|CRiMSON|ION10|ION265|BluPanda|H4S5S|Joy|" + r"FS99\s*Joy|FS99|AI\s*x265|x265\s*\d+bit|\d+bit\s*x265)" + r"[^\[\]]*\]", re.I), + re.compile(r"\((YIFY|RARBG|NOGRP)\)", re.I), +] + +QUALITY_TOKENS = re.compile( + r"(?<![A-Za-z0-9])(" + r"2160p|1080p|720p|480p|360p|4[Kk]|UHD|HD|SD|FHD|QHD|" + r"WEB-DL|WEBDL|WEB\.DL|WEB|WEBRip|WEB-Rip|BluRay|BLURAY|Bluray|BDRip|" + r"BRRip|BR-Rip|BDR|HDTV|HDTVRip|PDTV|DSR|DVDRip|DVD|DVDR|DVD9|DVD5|" + r"HDDVD|HDDVDRip|HDRip|CAMRip|CAM|TS|HDTS|TC|TELESYNC|TELECINE|R5|" + r"SCREENER|SCR|WORKPRINT|WP|PPV|PPVRip|" + r"x264|x265|H\.?264|H\.?265|HEVC|AVC|VP9|AV1|XviD|DivX|" + r"10bit|10-bit|8bit|8-bit|HDR10\+?|HDR|DV|Dolby\.?Vision|SDR|HFR|HQ|" + r"DDP?5\.1|DDP?7\.1|DDP?2\.0|DD\+5\.1|DD\+7\.1|DTS-HD\.MA|DTS-HD|DTS-X|" + r"DTSX|DTS|TrueHD|Atmos|AAC2\.0|AAC5\.1|AAC|AC3|AC-3|EAC3|E-AC3|" + r"MP3|MP2|Opus|FLAC|PCM|LPCM|5\.1|7\.1|2\.0|Mono|Stereo|Multi|" + r"PROPER|REPACK|iNTERNAL|INTERNAL|LIMITED|UNCUT|RERIP|REAL|READNFO|" + r"RETAi?L|STV|REMUX|MULTi|MULTI|SUBBED|DUBBED|iNT" + r")(?![A-Za-z0-9])", re.I) + +URL_REF = re.compile( + r"(?:^|[\s\[\(\.\-_])(WWW\.[A-Z0-9\-]+\.[A-Z]{2,4})(?:[\s\]\)\.\-_]|$)", + re.I) + +TRAILING_GROUP = re.compile(r"-(?:NOGRP|EVO|RARBG|SPARKS|CMRG|NTb|FLUX|AMZN|" + r"NF|DSNP|ATVP|MA|WEB|AAC2|FoV|KOGi|PLAYWEB|FRDS|" + r"ZQ|PHOENiX|EZTV|NTG|iON|ION10|ION265|CtrlHD|" + r"d3g|PSA|QxR|RZeroX|PMP|BTN|DEFLATE|BAE|MZABI|" + r"TURG|Joy)\b", re.I) + +LANG_TOKEN = re.compile(r"(?<![A-Za-z])\.?(en|eng|pl|pol|de|deu|fr|fra|es|spa|" + r"it|ita|ja|jpn|jp|ru|rus|ko|kor|zh|chi)(?![A-Za-z])", + re.I) + +# Forbidden chars (§ 5.5) +FORBIDDEN_CHARS = { + ":": " - ", + "/": " and ", + "\\": "", + "<": "(", + ">": ")", + "|": "", + "?": "", + "*": "", + '"': "", + "“": "", # left double quotation mark + "”": "", # right double quotation mark +} + +# Apostrophe normalization (§ 5.3) +APOSTROPHES = { + "‘": "'", + "’": "'", +} + +# Dashes (§ 5.2) +DASHES = { + "–": "-", # en dash + "—": "-", # em dash + "−": "-", # minus +} + +# Editions (§ 3.6) +EDITION_PATTERNS = [ + (re.compile(r"director'?s?[\.\s_-]*cut", re.I), "Director's Cut"), + (re.compile(r"final[\.\s_-]*cut", re.I), "Final Cut"), + (re.compile(r"extended[\.\s_-]*(?:cut|edition)?", re.I), "Extended"), + (re.compile(r"theatrical(?:[\.\s_-]*cut)?", re.I), "Theatrical"), + (re.compile(r"\bIMAX\b", re.I), "IMAX"), + (re.compile(r"\bunrated\b", re.I), "Unrated"), + (re.compile(r"remastere?d?", re.I), "Remastered"), + (re.compile(r"(?<![A-Za-z])DC(?![A-Za-z])"), "Director's Cut"), + (re.compile(r"(?<![A-Za-z])EE(?![A-Za-z])"), "Extended"), +] + +# Smart title case (§ 5.1) +SMALL_WORDS = {"a", "an", "and", "as", "at", "but", "by", "for", "from", + "in", "into", "nor", "of", "on", "or", "the", "to", "up", + "vs", "vs.", "via", "with", "yet"} +ROMAN_NUMERAL = re.compile(r"^[ivxlcdmIVXLCDM]+$") + + +def smart_title(s: str) -> str: + """Title-case respecting small-words and roman numerals.""" + if not s: + return s + chunks = re.split(r"(\s-\s)", s) # split on space-dash-space (subtitle) + out_chunks = [] + for chunk in chunks: + if chunk == " - ": + out_chunks.append(chunk) + continue + words = chunk.split(" ") + result = [] + for i, w in enumerate(words): + if not w: + result.append(w) + continue + if ROMAN_NUMERAL.match(w): + result.append(w.upper()) + continue + lower = w.lower() + if 0 < i < len(words) - 1 and lower in SMALL_WORDS: + result.append(lower) + else: + # capitalize but preserve internal apostrophes/dots + result.append(w[0].upper() + w[1:].lower() if w else w) + out_chunks.append(" ".join(result)) + return "".join(out_chunks) + + +def strip_noise(s: str) -> str: + """Remove group tags, quality, urls, trailing groups.""" + for pat in GROUP_TAG_PATTERNS: + s = pat.sub("", s) + s = URL_REF.sub(" ", s) + s = QUALITY_TOKENS.sub("", s) + s = TRAILING_GROUP.sub("", s) + return s + + +def normalize_chars(s: str) -> str: + """Apply Unicode/forbidden-char substitutions.""" + for k, v in APOSTROPHES.items(): + s = s.replace(k, v) + for k, v in DASHES.items(): + s = s.replace(k, v) + for k, v in FORBIDDEN_CHARS.items(): + s = s.replace(k, v) + # NFC normalization for diacritics (consistent encoding) + s = unicodedata.normalize("NFC", s) + return s + + +def collapse_whitespace(s: str) -> str: + s = re.sub(r"\s+", " ", s) + s = re.sub(r" - - ", " - ", s) + s = re.sub(r"--+", "-", s) + s = s.strip(" -._") + return s + + +# --- Schema-specific extraction ------------------------------------------------ + +@dataclass +class Parts: + title: str = "" + year: Optional[str] = None + season: Optional[str] = None + episode: Optional[str] = None + episode_end: Optional[str] = None + episode_title: str = "" + edition: Optional[str] = None + provider_id: Optional[str] = None + ext: str = "mkv" + absolute_number: Optional[str] = None + subdub: Optional[str] = None + track_title: str = "" + variant: Optional[str] = None + performer: str = "" + + +RE_SE = re.compile(r"[Ss](\d{1,2})[Ee](\d{1,3})(?:-[Ee]?(\d{1,3}))?") +RE_NXM = re.compile(r"(?<![\dA-Za-z])(\d{1,2})x(\d{1,3})(?:-(\d{1,3}))?") +RE_SEASON_EP = re.compile(r"Season\s*(\d{1,2})\s*Episode\s*(\d{1,3})", re.I) +RE_YEAR_PARENS = re.compile(r"\((\d{4})\)") +RE_PROVIDER_ID = re.compile(r"\[(?:imdbid|tmdbid|tvdbid)-[^\]]+\]") + + +def extract_year(s: str) -> Optional[str]: + m = RE_YEAR_PARENS.search(s) + if m: + y = int(m.group(1)) + if 1888 <= y <= dt.date.today().year + 2: + return m.group(1) + return None + + +def extract_provider_id(s: str) -> Optional[str]: + m = RE_PROVIDER_ID.search(s) + return m.group(0) if m else None + + +def extract_se(s: str): + m = RE_SE.search(s) + if m: + end = m.group(3) or None + return (m, m.group(1), m.group(2), end) + m = RE_NXM.search(s) + if m: + return (m, m.group(1), m.group(2), m.group(3)) + m = RE_SEASON_EP.search(s) + if m: + return (m, m.group(1), m.group(2), None) + return (None, None, None, None) + + +def extract_edition(raw_basename: str) -> Optional[str]: + for pat, name in EDITION_PATTERNS: + if pat.search(raw_basename): + return name + return None + + +def parent_show_folder(p: Path) -> Path: + """Walk up past Season XX folders until we find the show folder.""" + cur = p.parent + while re.match(r"(?i)season\s*\d+|specials|extras", cur.name): + cur = cur.parent + return cur + + +# --- Per-schema emit ----------------------------------------------------------- + +def normalize_movie(src: Path, year_hint: Optional[str] = None, + title_hint: Optional[str] = None) -> Path: + raw = src.stem + ext = src.suffix.lower().lstrip(".") or "mkv" + edition = extract_edition(raw) + provider_id = extract_provider_id(raw) or extract_provider_id(src.parent.name) + cleaned = strip_noise(raw) + cleaned = normalize_chars(cleaned) + cleaned = collapse_whitespace(cleaned) + year = year_hint or extract_year(cleaned) or extract_year(src.parent.name) + if year: + cleaned = re.sub(r"\s*\(" + year + r"\)", "", cleaned).strip() + # drop edition tokens from the title body (we re-emit them) + for pat, _ in EDITION_PATTERNS: + cleaned = pat.sub("", cleaned) + cleaned = collapse_whitespace(cleaned) + title = title_hint or smart_title(cleaned) + if not year: + raise ValueError(f"cannot determine year for movie: {src}") + folder_name = f"{title} ({year})" + if provider_id: + folder_name += f" {provider_id}" + file_basename = folder_name + if edition: + file_basename += f" - {edition}" + return src.parent.parent / folder_name / f"{file_basename}.{ext}" + + +def normalize_tv(src: Path, year_hint: Optional[str] = None, + title_hint: Optional[str] = None, + schema: str = "tv") -> Path: + raw = src.stem + ext = src.suffix.lower().lstrip(".") or "mkv" + m, season, ep, ep_end = extract_se(raw) + if not season: + raise ValueError(f"no S/E token in TV file: {src}") + season = f"{int(season):02d}" + episode = f"{int(ep):02d}" + episode_end = f"{int(ep_end):02d}" if ep_end else None + # episode title = text after match, before next bracket + after = raw[m.end():] if hasattr(m, "end") else "" + title_part = re.split(r"[\[\(]", after, maxsplit=1)[0] + title_part = strip_noise(title_part) + title_part = normalize_chars(title_part) + title_part = collapse_whitespace(title_part) + title_part = re.sub(r"^[\s\-_\.]+", "", title_part) + episode_title = smart_title(title_part) if title_part else "" + # show title from parent folder + show_folder = parent_show_folder(src) + show_clean = strip_noise(show_folder.name) + show_clean = normalize_chars(show_clean) + show_clean = collapse_whitespace(show_clean) + year = year_hint or extract_year(show_clean) or extract_year(src.parent.name) + if year: + show_clean = re.sub(r"\s*\(" + year + r"\).*$", "", show_clean).strip() + show_clean = re.sub(r"(?i)\s*Season\s*\d+.*$", "", show_clean).strip() + show = title_hint or smart_title(show_clean) + if not year: + raise ValueError(f"cannot determine year for TV show: {show_folder}") + se_str = f"S{season}E{episode}" + if episode_end: + se_str += f"-E{episode_end}" + file_base = f"{show} ({year}) - {se_str}" + if episode_title: + file_base += f" - {episode_title}" + target_root = show_folder.parent # e.g. /media/tv + return target_root / f"{show} ({year})" / f"Season {season}" / f"{file_base}.{ext}" + + +def normalize_anime_absolute(src: Path, title_hint: Optional[str], + abs_num: Optional[int], + ep_title: str = "", + subdub: Optional[str] = None) -> Path: + ext = src.suffix.lower().lstrip(".") or "mkv" + show_folder = parent_show_folder(src) + show_clean = strip_noise(show_folder.name) + show_clean = normalize_chars(show_clean) + show = title_hint or smart_title(collapse_whitespace(show_clean)) + if abs_num is None: + raise ValueError(f"absolute number required for {src}") + suffix = f" [{subdub}]" if subdub else "" + title_str = smart_title(ep_title) if ep_title else "" + file_base = f"{show} - {abs_num:04d}" + if title_str: + file_base += f" - {title_str}" + file_base += suffix + return show_folder.parent / show / f"{file_base}.{ext}" + + +def normalize_musicvideo(src: Path, artist_hint: str, year_hint: str, + track_hint: Optional[str] = None, + variant: Optional[str] = None) -> Path: + ext = src.suffix.lower().lstrip(".") or "mp4" + raw = src.stem + cleaned = normalize_chars(strip_noise(raw)) + cleaned = collapse_whitespace(cleaned) + track = track_hint or smart_title(cleaned) + artist = smart_title(artist_hint) + suffix = f" [{variant}]" if variant else "" + return src.parent.parent / artist / f"{year_hint} - {track}{suffix}.{ext}" + + +def normalize_standup(src: Path, performer: str, title: str, year: str) -> Path: + ext = src.suffix.lower().lstrip(".") or "mkv" + folder = f"{performer} - {title} ({year})" + return src.parent.parent / folder / f"{folder}.{ext}" + + +# --- Driver -------------------------------------------------------------------- + +def is_already_canonical(src: Path, target: Path) -> bool: + return src.resolve() == target.resolve() + + +def log_op(action: str, src: Path, target: Path): + LOG_DIR.mkdir(parents=True, exist_ok=True) + log_file = LOG_DIR / f"{dt.date.today().isoformat()}.log" + ts = dt.datetime.utcnow().isoformat() + "Z" + line = f"{ts} {action} {src} -> {target}\n" + with log_file.open("a") as f: + f.write(line) + + +def main(): + ap = argparse.ArgumentParser(description="canonical filename normalizer") + ap.add_argument("source", type=Path, help="source file path") + ap.add_argument("--type", required=True, + choices=["movie", "tv", "anime-seasonal", + "anime-absolute", "musicvideo", "standup", + "extra"]) + ap.add_argument("--year") + ap.add_argument("--title") + ap.add_argument("--performer", help="for standup") + ap.add_argument("--artist", help="for musicvideo") + ap.add_argument("--track", help="for musicvideo") + ap.add_argument("--variant", help="for musicvideo") + ap.add_argument("--abs-num", type=int, help="for anime-absolute") + ap.add_argument("--ep-title", help="for anime-absolute") + ap.add_argument("--subdub", choices=["Sub", "Dub"], help="for anime-absolute") + ap.add_argument("--apply", action="store_true", + help="actually move the file (default is dry-run)") + ap.add_argument("--force", action="store_true", + help="overwrite existing target") + args = ap.parse_args() + + src = args.source.resolve() + if not src.exists(): + print(f"ERROR: {src} does not exist", file=sys.stderr) + sys.exit(1) + + try: + if args.type == "movie": + target = normalize_movie(src, args.year, args.title) + elif args.type == "tv": + target = normalize_tv(src, args.year, args.title, schema="tv") + elif args.type == "anime-seasonal": + target = normalize_tv(src, args.year, args.title, schema="anime") + elif args.type == "anime-absolute": + target = normalize_anime_absolute(src, args.title, args.abs_num, + args.ep_title or "", + args.subdub) + elif args.type == "musicvideo": + target = normalize_musicvideo(src, args.artist or "", args.year or "", + args.track, args.variant) + elif args.type == "standup": + target = normalize_standup(src, args.performer or "", + args.title or "", args.year or "") + else: + print(f"ERROR: schema '{args.type}' not implemented", file=sys.stderr) + sys.exit(2) + except ValueError as e: + print(f"ERROR: {e}", file=sys.stderr) + sys.exit(2) + + if is_already_canonical(src, target): + print(f"NOOP {src}") + sys.exit(0) + + if target.exists() and not args.force: + print(f"REFUSE {src} -> {target} (target exists; use --force)") + sys.exit(2) + + if args.apply: + target.parent.mkdir(parents=True, exist_ok=True) + shutil.move(str(src), str(target)) + log_op("RENAME", src, target) + print(f"MOVED {src} -> {target}") + else: + print(f"DRY-RUN {src} -> {target}") + + +if __name__ == "__main__": + main() +``` + +### 11.1 Usage examples + +```bash +# Dry-run a single Futurama episode +./normalize.py --type tv \ + "/home/admin/Downloads/futrama/Futurama Season 1 [1080p AI x265 10bit FS99 Joy]/Futurama S01E01 Space Pilot 3000 [1080p x265 10bit Joy].mkv" + +# Output: +# DRY-RUN /home/admin/Downloads/.../Futurama S01E01 Space Pilot 3000 [1080p x265 10bit Joy].mkv +# -> /home/admin/Downloads/futrama/Futurama (1999)/Season 01/Futurama (1999) - S01E01 - Space Pilot 3000.mkv + +# Same with --apply, with explicit year and title hints +./normalize.py --type tv --year 1999 --title "Futurama" --apply \ + "/home/admin/Downloads/futrama/Futurama Season 1 [1080p AI x265 10bit FS99 Joy]/Futurama S01E01 Space Pilot 3000 [1080p x265 10bit Joy].mkv" + +# Movie with edition +./normalize.py --type movie --year 1982 --apply \ + "/home/admin/Downloads/Blade Runner 1982 Final Cut [1080p BluRay x265 RARBG].mkv" + +# Stand-up +./normalize.py --type standup --performer "Bo Burnham" --title "Inside" --year 2021 --apply \ + "/home/admin/Downloads/Bo.Burnham.Inside.2021.1080p.NF.WEB-DL.DDP5.1.x264-NTb.mkv" + +# Music video +./normalize.py --type musicvideo --artist "Daft Punk" --year 2013 \ + --track "Get Lucky" --apply \ + "/home/admin/Downloads/daft.punk.get.lucky.official.video.1080p.mkv" +``` + +### 11.2 Idempotency proof + +Running the script twice on the same input produces the same target. The +second run's source = first run's target, so `is_already_canonical()` +returns true, and the script no-ops. Verified in unit tests (see +`/opt/docker/jellyfin/scripts/test_normalize.py` — to be added in doc 07's +implementation phase). + +--- + +## 12. Edge cases catalogue + +### 12.1 Episodes with very long titles + +``` +The Office (2005) - S07E25-E26 - Search Committee.mkv ← multi-ep, short title, fine +Sherlock (2010) - S04E03 - The Final Problem.mkv ← long-ish, fine +Steins;Gate (2011) - S01E22 - Being Meltdown - The Concerto Whose Conductor Has Lost His Baton.mkv +``` + +The third example is 110 chars before extension. `ext4` allows 255 bytes +per filename component; this fits. Smart title case applied; no `:` (the +title has no colon — the long string is the actual title from MyAnimeList). +If a title has a colon, it becomes ` - ` per § 5.5, which slightly +extends the length but doesn't cap. + +### 12.2 Episodes with `.` in the title + +``` +Mr. Robot (2015) - S01E01 - eps1.0_hellofriend.mov.mkv ← title contains `.mov` +``` + +`.mov` inside the title is technically a substring that *looks* like a +container type. The parser doesn't care (the extension is `.mkv`, parsed +last). Keep as-is. Smart title case leaves the lowercase intentional +formatting (it's the title's actual stylization). + +### 12.3 Shows with numeric titles + +``` +1923 (2022) - S01E01 - 1923.mkv ← year-as-title, year-as-disambiguation +24 (2001) - S01E01 - Day 1 - 12-00 AM-1-00 AM.mkv ← `:` from title became ` - ` +``` + +The `24` / `1923` cases would fail year extraction if the show year is +omitted. Year hint via `--year` is mandatory for these. + +### 12.4 Two-part single episodes (multi-part files) + +Doc 05 § 2 mentions `Series A S02E03 Part 1.mkv` / `Part 2.mkv`. Canonical: + +``` +TV/Show (Year)/Season 02/Show (Year) - S02E03 - Title - part 1.mkv +TV/Show (Year)/Season 02/Show (Year) - S02E03 - Title - part 2.mkv +``` + +Use lowercase `part` (Jellyfin parser is case-insensitive but lowercase +is more common in docs). + +### 12.5 Source has no episode title + +``` +Source: Show.S01E01.1080p.WEB-DL.x264-NTb.mkv + +Target: Show (Year) - S01E01.mkv +``` + +Empty episode title → omit. The script does this already (§ 11 +`emit_canonical()` checks `if parts.episode_title`). Jellyfin will +backfill the title from TVDB on first scrape. + +### 12.6 Source has WRONG episode title + +If the rip's episode title is different from TVDB's canonical (e.g. a +Polish translation of an English-language show, or a non-canonical +sub-group title), prefer the **TVDB title** (English, official). This +requires manual intervention — pass `--ep-title "Canonical Title"` or +edit after the rename. Not automated. + +### 12.7 Dual-audio (sub+dub in one file) + +If the mkv has both audio tracks, omit the `[Sub]`/`[Dub]` suffix: + +``` +Anime/One Piece/One Piece - 0001 - I'm Luffy.mkv ← dual audio in container +``` + +The user can pick the audio track from the player. The filename only +needs to disambiguate when *separate files* exist. + +### 12.8 Mid-season hiatus / split seasons + +Some shows split S01 into "Part 1" and "Part 2" (Better Call Saul, +Stranger Things). Treat as **one season**: + +``` +TV/Stranger Things (2016)/Season 04/ +├── Stranger Things (2016) - S04E01 - The Hellfire Club.mkv ← Vol 1 +├── ... +├── Stranger Things (2016) - S04E07 - The Massacre at Hawkins Lab.mkv ← Vol 1 finale +├── Stranger Things (2016) - S04E08 - Papa.mkv ← Vol 2 start +└── Stranger Things (2016) - S04E09 - The Piggyback.mkv ← Vol 2 finale +``` + +TVDB lists S04 as one season, episodes 1-9. The hiatus is invisible to +the parser. Don't create `Season 04 Part 1/`. + +--- + +## 13. Verification checklist (doc 07 will use this) + +Before declaring a normalized file "imported": + +1. Filename matches the canonical regex for its category (§ 1). +2. No forbidden chars (§ 5.5) in any part of the path. +3. No group tags / quality / codec / source / audio tags in the basename + (§ 2). +4. Folder structure matches § 1.x for the category. +5. Year is in `(YYYY)` and matches the actual release year (movies/TV). +6. `Season NN/` is zero-padded (TV / anime-seasonal). +7. Episode S/E numbers zero-padded to two digits (three for >99). +8. Smart title case applied to all title-bearing components. +9. Apostrophes are ASCII (`'`), dashes are ASCII (`-`). +10. Diacritics in NFC form (UTF-8 encoded canonically). +11. The script's `is_already_canonical()` returns true on the result — + re-running the normalizer leaves the file untouched. +12. Audit log line written to `/var/log/jellyfin-imports/<date>.log`. + +If any check fails, the file is quarantined per doc 07 to a `_pending/` +subtree for manual review. + +--- + +## 14. Quick reference card (for the operator) + +| Category | Canonical shape | Example | +|---|---|---| +| Movie | `Movies/T (Y)/T (Y).mkv` | `Movies/Inception (2010)/Inception (2010).mkv` | +| Movie+edition | `Movies/T (Y)/T (Y) - E.mkv` | `Movies/Blade Runner (1982)/Blade Runner (1982) - Final Cut.mkv` | +| Movie+resolution | `Movies/T (Y)/T (Y) - NNNNp.mkv` | `Movies/Blade Runner 2049 (2017)/Blade Runner 2049 (2017) - 2160p.mkv` | +| TV episode | `TV/S (Y)/Season NN/S (Y) - SXXEYY - Title.mkv` | `TV/Futurama (1999)/Season 01/Futurama (1999) - S01E01 - Space Pilot 3000.mkv` | +| TV multi-ep | `... - SXXEYY-EZZ - Title.mkv` | `Futurama (1999) - S01E03-E04 - I, Roommate / Love's Labours.mkv` | +| TV special | `... /Season 00/... - S00EYY - Title.mkv` | `Futurama (1999) - S00E01 - Bender's Big Score.mkv` | +| Anime seasonal | same as TV | `Cowboy Bebop (1998) - S01E01 - Asteroid Blues.mkv` | +| Anime absolute | `Anime/S/S - NNNN - Title [Sub].mkv` | `One Piece - 0001 - I'm Luffy [Sub].mkv` | +| Music video | `MV/A/Y - T.mp4` | `Daft Punk/2013 - Get Lucky.mp4` | +| Stand-up | `Movies/P - T (Y)/P - T (Y).mkv` | `Bo Burnham - Inside (2021)/Bo Burnham - Inside (2021).mkv` | +| Extra (folder) | `<item folder>/<lowercase folder>/Title.mkv` | `featurettes/Welcome to the World of Tomorrow.mkv` | +| Extra (suffix) | `... - Title-featurette.mkv` | `Inception (2010) - Dreams Within Dreams-featurette.mkv` | +| Subtitle | `<basename>.<lang>[.flag].srt` | `Futurama (1999) - S01E01.eng.srt` | + +--- + +## 15. Cross-references + +- Doc 05 § 0 — top-level filename rules (forbidden chars, year-in-parens, + one folder per item). +- Doc 05 § 1.2 — Jellyfin's accepted movie regex. +- Doc 05 § 2.2 — Jellyfin's accepted TV regex (table of patterns). +- Doc 05 § 3.1–3.3 — anime numbering strategies (which we map to § 1.3 + and § 1.4 here). +- Doc 05 § 8 — extras folder names (which we lowercase per § 4.5). +- Doc 03 — sidecar subtitle naming (referenced in § 2.7 and § 14). +- Doc 02 — what the scraper does after the rename, including the + `RemoteSearch/Apply` recipe to fix mis-matches. +- Doc 07 (sibling) — the operational pipeline (move, dedupe, GC) that + consumes this ruleset. When doc 07 lands, link from § 13's + verification checklist into doc 07's quarantine / re-run flow. + +--- + +## 16. Open items / known drift + +- Live `/home/user/media/tv/Futurama/` lacks the year — should be + `Futurama (1999)/`. Migration covered in doc 07. +- The script's TV-title-extraction does not yet handle parent folders + named `Specials` (mapping to `Season 00`). Workaround: rename the + folder first, then run normalize. Codify in v2. +- Edition detection priority list has been chosen by frequency-of-rip, + not by canon. If a future Blade Runner gets a "Workprint Edition" + release, the list grows. +- No automated tests for `normalize.py` yet — covered by doc 07 once + that doc lands. + +--- + +End of doc 08. The script in § 11 is the canonical source of truth; this +doc explains it. When in doubt, run `normalize.py --help` and read the +top docstring.