s8n 1a6a697afd Add pre-import cleanup + filename normalization rulesets

- 07-pre-import-cleanup: 1002-line ruleset for stripping non-media junk before
  files land in /home/user/media/. Catalogs 10 categories (codec promo,
  group brag, promo images, OS thumb caches, samples, sub leftovers, torrent
  residue, proof folders, multi-disc cruft, Win executables). NFO discriminator
  uses 4096-byte head + XML-root regex (covers prologue case the brief 100-byte
  version misses). 15 auto-delete security categories (.exe/.msi/.bat/.scr/...);
  threat model = friend clicking 'Download original' then running on Win.
  Verified extras folders against Jellyfin docs (lowercase 'featurettes',
  'behind the scenes', etc.). Includes idempotent dry-run-default
  cleanup-import.sh that quarantines first, returns staging path on stdout.

- 08-filename-normalization: 1853-line normative renaming ruleset.
  Canonical: 'Show (Year) - SXXEXX - Title.ext' for TV; '<Title> (<Year>).ext'
  for movies; 'Show - NNNN - Title [Sub|Dub].ext' for absolute-numbered anime.
  Strips group tags ([YIFY]/[RARBG]/[FS99 Joy]/[GalaxyRG]), resolution
  (1080p/2160p/4K), codec (x264/x265/HEVC/10bit), source (WEB-DL/BluRay/HDTV),
  audio (DTS-HD.MA/Atmos/5.1/AAC), release-process (PROPER/REPACK/INTERNAL),
  trailing -NOGRP/-RARBG/-EVO, URL refs, basename language tokens.
  Includes stdlib-only normalize.py: dry-run default, --apply commits,
  --force overwrites, audit log to /var/log/jellyfin-imports/<date>.log,
  idempotent. Worked Futurama before/after; flags drift on live tree
  (current 'Futurama/' lacks '(1999)').

2026-05-08 02:07:11 +01:00

68 KiB

Raw Blame History

08 — Filename & Folder Normalization Ruleset (tv.s8n.ru)

Last updated: 2026-05-08 Server: Jellyfin 10.10.3 on nullstone, container jellyfin Library root inside container: /media Library root on host: /home/user/media

This document is the normative ruleset for renaming downloaded media into a canonical, predictable, group-tag-free shape before it lands in the live library tree. It is the layer between "torrent dump" and "file ready for the scanner".

Cross-links:

05-file-structure-rules.md — what Jellyfin's parser accepts; this doc picks one of the accepted forms and locks it in.
07-cleanup-and-imports.md — the operational pipeline (move, dedupe, garbage collect) that consumes this ruleset. Doc 08 defines what canonical looks like; doc 07 defines how to apply it.
02-metadata-and-titles.md — what Jellyfin does after the rename (parse, scrape, lock).
03-subtitles.md — sidecar .srt / .ass naming (referenced from § 5.6 below).

Status of this doc: specification + reference implementation. The normalize.py script in § 11 is canonical. Anything not codified by the script is documentation only — when the doc and the script disagree, the script wins, and the doc gets fixed.

0. Why a normalization ruleset (and why now)

Doc 05 establishes that Jellyfin's parser is permissive: dots, dashes, underscores, and spaces are interchangeable; S01E01, s01e01, 1x01, and Season 1 Episode 1 all parse to the same thing. That permissiveness is great for getting Jellyfin to scrape a torrent dump, but it is a disaster for operating a library at scale:

Search becomes noisy. SMB / Syncthing / Dolphin search across mixed patterns surfaces irrelevant matches (S01E01 vs 1x01 vs s01.e01).
Diff / audit / dedupe scripts get harder. Every regex needs to handle N forms. The cleanup pass (doc 07) is dramatically cheaper if every file in the tree obeys one shape.
Visual scan in ls becomes unreadable when half the filenames have [1080p AI x265 10bit FS99 Joy] glued on and the other half don't.
Future migrations (Plex, Kodi, mobile sync to a Win/Mac client) all have stricter parsers than Jellyfin. The strictest sane shape that Jellyfin accepts is also the most portable. Pay the cost once.
Cross-platform safety. This deploy is Linux-only today, but the workspace's Syncthing setup (see ai-lab SYSTEM.md) implies future sync to Win/Mac clients. Choose Windows-safe filenames now and never touch this again.

The cost of the ruleset is one Python script and discipline at import time. Both are bounded. The cost of not having one compounds with every new release.

1. Canonical formats — what the tree must look like

This is the lock-in. One shape per category. No alternatives. No "but my release group did it differently".

1.1 Movies

Movies/<Title> (<Year>)/<Title> (<Year>).<ext>
Movies/<Title> (<Year>)/<Title> (<Year>) - <Edition>.<ext>      (when edition matters)
Movies/<Title> (<Year>) [<provider-id>]/<Title> (<Year>) [<provider-id>].<ext>  (when ambiguous)

<Title> — smart title case (§ 5.1), forbidden chars stripped (§ 5.5).
<Year> — first theatrical-release year, in parens, single space before (. Mandatory in this deploy (doc 05 § 0 rule 5), even when the title is unique.
<Edition> — when present, exactly one of: Director's Cut, Extended, Theatrical, IMAX, Unrated, Final Cut, Remastered. Anything else (e.g. Snyder Cut, Workprint, 4K Remaster) is admissible only with a written justification in the import log; otherwise normalize to the closest of the seven canonical labels above.
<provider-id> — imdbid-tt0123456 / tmdbid-12345 / tvdbid-12345 in square brackets. Optional unless year-based disambiguation isn't enough (§ 6.2).
<ext> — lowercase: mkv, mp4, webm, avi. (mkv is the rip default; mp4 is the streaming-original default.) Never uppercase .MKV, .MP4.

Forbidden in the filename: resolution tags (1080p, 2160p, 720p, 4K), codec tags (x264, x265, h264, h265, HEVC, AVC), source tags (WEB, WEB-DL, BluRay, BRRip, HDTV, DVDRip, WEBRip), audio tags (AAC, AC3, DTS, DTS-HD.MA, 5.1, 7.1, Atmos, Opus), bitness/HDR tags (10bit, 8bit, HDR, DV, SDR), release tags (PROPER, REPACK, INTERNAL, LIMITED, RERIP), language tags (MULTi, DUBBED, SUBBED, iNTERNAL), group tags ([YIFY], [RARBG], [FS99 Joy], -NOGRP, -EVO, -SPARKS), and website refs (WWW.YIFY-TORRENTS.COM, RARBG.txt-derived names).

Justification — why no resolution/codec tag:

Jellyfin reads stream attributes (resolution, codec, bit-depth, HDR, audio codec) directly from the file via ffprobe on every scan. The web UI displays them. The mobile clients display them. The transcoder picks based on them. The filename contributes zero new information. Including those tags pollutes search results, breaks the byte-exact folder-vs-file match required for multi-version movies (doc 05 § 1.2), and makes humans skim past the title to find the title. The only exception is Movie (Year) - 1080p.mkv AS the multi-version label when two distinct rips of the same movie are kept in the same folder (e.g. Blade Runner 2049 (2017) - 2160p.mkv next to Blade Runner 2049 (2017) - 1080p.mkv). In that exact case, the resolution IS the disambiguation token. Otherwise, no.

Examples

Movies/Blade Runner (1982)/Blade Runner (1982).mkv
Movies/Blade Runner (1982)/Blade Runner (1982) - Final Cut.mkv
Movies/Blade Runner (1982)/Blade Runner (1982) - Director's Cut.mkv
Movies/Blade Runner 2049 (2017)/Blade Runner 2049 (2017) - 2160p.mkv
Movies/Blade Runner 2049 (2017)/Blade Runner 2049 (2017) - 1080p.mkv
Movies/Dune (1984) [imdbid-tt0087182]/Dune (1984) [imdbid-tt0087182].mkv

1.2 TV shows

TV/<Show> (<Year>)/Season <NN>/<Show> (<Year>) - S<NN>E<MM> - <Episode Title>.<ext>
TV/<Show> (<Year>)/Season <NN>/<Show> (<Year>) - S<NN>E<MM>-E<MM2> - <Episode Title>.<ext>
TV/<Show> (<Year>)/Season 00/<Show> (<Year>) - S00E<MM> - <Special Title>.<ext>

<Show> — smart title case, no provider-id in show folder unless the scraper picks the wrong show twice in a row (then add [tvdbid-NNNN]).
<Year> — series first-air year, mandatory even when title is unique (doc 05 § 0 rule 5; this deploy convention is stricter than upstream permissive parsing).
<NN> — zero-padded two digits. Season 01, not Season 1. S01, not S1.
<MM> — zero-padded two digits. Three digits permissible only for shows that exceed 99 episodes per season (rare; e.g. some daily anime). See doc 05 § 3.1.
<Episode Title> — title from the metadata provider (TVDB/TMDB) with smart title case. Required for human readability; Jellyfin overwrites it during scrape but the file basename is what humans see in ls.
Multi-episode files: S<NN>E<MM>-E<MM2> — single hyphen, no spaces. Verified parsing per doc 05 § 2.2 table.

Examples

TV/Futurama (1999)/Season 01/Futurama (1999) - S01E01 - Space Pilot 3000.mkv
TV/Futurama (1999)/Season 01/Futurama (1999) - S01E03-E04 - I, Roommate / Love's Labours Lost in Space.mkv
TV/Futurama (1999)/Season 00/Futurama (1999) - S00E01 - Bender's Big Score.mkv
TV/The Office (2005)/Season 02/The Office (2005) - S02E01 - The Dundies.mkv

Why this shape (not the slimmer `Show S01E01.mkv`)

Doc 05 § 2.2 shows three accepted patterns:

Futurama (1999) S01E01.mkv
Futurama (1999) S01E01 - Space Pilot 3000.mkv
Futurama (1999) - S01E01 - Space Pilot 3000.mkv      ← canonical for this deploy

The third form (with the leading - before S01E01 and the title) is chosen because:

The leading dash visually separates the series-name block from the episode-id block. Important when the show's title contains spaces and numbers (Star Trek The Next Generation S01E01) — without the dash, the eye trips over Generation S01E01.
Symmetric with the Movies multi-version pattern (Title (Year) - <Label>). One mental model for the whole library.
Identical to the Sonarr default rename pattern ({Series Title} - S{season:00}E{episode:00} - {Episode Title}), which means the naming pattern is well-trodden and tooling friendly.

1.3 Anime — seasonal numbering (TVDB-style)

Same shape as TV (§ 1.2). Mandatory year. Mandatory Season NN. No absolute numbers.

Anime/<Show> (<Year>)/Season <NN>/<Show> (<Year>) - S<NN>E<MM> - <Episode Title>.<ext>

Examples

Anime/Cowboy Bebop (1998)/Season 01/Cowboy Bebop (1998) - S01E01 - Asteroid Blues.mkv
Anime/Mushishi (2005)/Season 02/Mushishi (2005) - S02E01 - The Sleeping Mountain.mkv
Anime/Steins;Gate (2011) [tvdbid-244061]/Season 01/Steins;Gate (2011) [tvdbid-244061] - S01E01 - Turning Point.mkv

(; is legal on ext4 but flagged in § 5.5 as risky for portability — prefer Steins-Gate if portability matters.)

1.4 Anime — absolute numbering

Used only for shows >99 episodes that don't fit the seasonal model (One Piece, Naruto, Detective Conan, Bleach). For those shows, the canonical shape is:

Anime/<Show>/<Show> - <NNNN> - <Episode Title> [<Sub|Dub>].<ext>

No (<Year>) on the show folder — absolute-numbering shows are usually unique by name; if not, fall back to a provider ID (Doraemon (1979) [tvdbid-71603], then revert to seasonal Pattern 1.3).
<NNNN> — zero-padded four digits (deterministic; all known long-runners stay below 9999). Three-digit padding (0099) is wrong; four-digit (0099) is right and matches the upper bound of the longest running show.
[<Sub|Dub>] — exactly one of [Sub] or [Dub]. Required for any release where both audio tracks are not embedded in one mkv. If the release contains both audio tracks in one container, omit the bracket.
No Season NN folder. Absolute numbering puts every episode in the show root.

Deterministic absolute-numbering rule

Absolute number = the episode's position in the broadcast order as listed by AniDB's "main" episode list for that show. NOT the dub broadcast order, NOT a re-cut/remaster renumbering. For shows with discrepancies between AniDB and TVDB absolute numbering (rare), AniDB wins — that's the provider that absolute-numbering plugins (and Shoko) use.

Examples

Anime/One Piece/One Piece - 0001 - I'm Luffy! The Man Who's Gonna Be King of the Pirates! [Sub].mkv
Anime/One Piece/One Piece - 0001 - I'm Luffy! The Man Who's Gonna Be King of the Pirates! [Dub].mkv
Anime/Naruto/Naruto - 0001 - Enter Naruto Uzumaki [Sub].mkv
Anime/Detective Conan/Detective Conan - 1099 - The Detective's Vacation [Sub].mkv

Caveat

Naive Jellyfin without Shoko will mis-handle episodes >99 (doc 05 § 3.3). This is a known issue; pick one of:

Run Shoko (doc 05 § 3.2). Filenames don't matter for Shoko — but obey this ruleset anyway, for human readability and for the day Shoko goes away.
Re-bucket by TVDB seasons. Most long-runners have a TVDB season split (One Piece S01-S22). Use § 1.3 with the seasons.

This deploy currently does NOT run Shoko; it currently does NOT host any absolute-numbered anime. The shape in § 1.4 is reserved for the day Shoko gets installed. Leave it documented.

1.5 Music videos

MusicVideos/<Artist>/<Year> - <Track Title>.<ext>
MusicVideos/<Artist>/<Year> - <Track Title> [<Variant>].<ext>     (when multiple cuts exist)

<Artist> — smart title case, comma-separated for collabs (Daft Punk, Pharrell Williams).
<Year> — release year of the video, not the song. Songs older than their videos are common (a 2024 acoustic cover gets the 2024 year).
<Track Title> — smart title case.
<Variant> — optional, [Live], [Acoustic], [Remix], [Alternate], [Lyric Video]. Forbidden: [1080p], [Official], [HD].

Music videos do not use (<Year>) parens because the library is musicvideos CollectionType, which has no scraper (doc 05 § 5.3) and the year is purely cosmetic.

Examples

MusicVideos/Daft Punk/2013 - Get Lucky.mp4
MusicVideos/Daft Punk/2013 - Get Lucky [Lyric Video].mp4
MusicVideos/Pink Floyd/1995 - Comfortably Numb [Live].mkv
MusicVideos/Daft Punk, Pharrell Williams/2013 - Get Lucky.mp4

For full live concerts (>20 min, multi-song), file under Movies instead, per doc 05 § 5.4.

1.6 Stand-up specials (Movies-typed)

Stand-up lives in the Movies library (doc 05 § 4). Folder + filename are prefixed with the performer name; treat the whole <Performer> - <Title> as the canonical "movie title" for parser purposes.

Movies/<Performer> - <Title> (<Year>)/<Performer> - <Title> (<Year>).<ext>

Examples

Movies/Bo Burnham - Inside (2021)/Bo Burnham - Inside (2021).mkv
Movies/Hannah Gadsby - Nanette (2018) [imdbid-tt8465676]/Hannah Gadsby - Nanette (2018) [imdbid-tt8465676].mkv
Movies/Norm Macdonald - Nothing Special (2022)/Norm Macdonald - Nothing Special (2022).mkv

The <Performer> - prefix is mandatory for stand-up. Without it, the title alone (Inside (2021)) ambiguously matches the 2007 horror film Inside, the 2023 thriller Inside, or the 2017 documentary Inside. The prefix gives TMDB enough disambiguation to land on the correct record without a provider-id override.

2. What to STRIP from a source filename — exhaustive list

This is the substring inventory. The script in § 11 implements all of these. The list grew from sampling ~200 distinct release-group filenames across [YIFY], [RARBG], [ettv], [GalaxyRG], [FS99 Joy], [NOGRP], [FitGirl], and the Futurama corpus on disk.

2.1 Group tags (square / round brackets)

Match anything inside [...] or (...) that does not look like a year. Year detection: 4 digits, 1900 ≤ N ≤ current year + 2.

Exemplar substrings (case-insensitive):

[1080p AI x265 10bit FS99 Joy]
[YIFY]
[YTS]
[YTS.MX]
[YTS.AG]
[YTS.AM]
[RARBG]
[ettv]
[eztv]
[GalaxyRG]
[GalaxyRG265]
[FitGirl]
[FitGirl Repack]
[NOGRP]
[QxR]
[FreetheFish]
[psa]
[PSA]
[CMRG]
[d3g]
[STRiFE]
[Pahe.in]
[FoV]
[NTb]
[YOLO]
[KOGi]
[playWEB]
[REQ]
[XBET]
[FLUX]
[NOSiVID]
[BGT]
[SVA]
[CRiMSON]
[ION10]
[ION265]
[BluPanda]
[H4S5S]
[5.1]
(YIFY)
(RARBG)
(NOGRP)

2.2 Trailing release-group dashes

Pattern: -<UPPERCASE_TOKEN> at the very end of the basename (before extension). Matches:

-NOGRP
-EVO
-RARBG
-SPARKS
-CMRG
-NTb
-FLUX
-AMZN
-NF
-DSNP
-ATVP
-MA
-WEB
-AAC2
-FoV
-KOGi
-PLAYWEB
-FRDS
-ZQ
-PHOENiX
-EZTV
-NTG
-iON
-ION10
-ION265
-CtrlHD
-d3g
-PSA
-QxR
-RZeroX
-PMP
-BTN
-DEFLATE
-BAE
-MZABI
-TURG

The pattern -[A-Z][A-Z0-9]{1,15}$ (after stripping bracket tags and quality tags) captures most of these. The script in § 11 uses an allow-list approach instead of a pattern, because release groups sometimes exceed 15 chars and sometimes use mixed case.

2.3 Quality / codec / source / audio tags

Strip all of these as standalone tokens (whitespace-, dot-, dash-, or underscore-bounded), case-insensitive:

Resolution / aspect:

2160p  1080p  720p  480p  360p  4K  4k  UHD  HD  SD  FHD  QHD

Source:

WEB-DL  WEBDL  WEB.DL  WEB  WEBRip  WEB-Rip  BluRay  BLURAY  Bluray  BDRip
BRRip  BR-Rip  BDR  HDTV  HDTVRip  PDTV  DSR  DVDRip  DVD  DVDR  DVD9  DVD5
HDDVD  HDDVDRip  HDRip  CAMRip  CAM  TS  HDTS  TC  TELESYNC  TELECINE  R5
SCREENER  SCR  WORKPRINT  WP  PPV  PPVRip

Codec / container hints (in name):

x264  x265  H.264  H264  H.265  H265  HEVC  AVC  VP9  AV1  XviD  DivX
10bit  10-bit  8bit  8-bit  HDR  HDR10  HDR10+  DV  DolbyVision  Dolby.Vision
SDR  HFR  HQ

Audio:

DD5.1  DDP5.1  DD7.1  DDP7.1  DD2.0  DD+5.1  DD+7.1  DTS  DTS-HD  DTS-HD.MA
DTS-X  DTSX  TrueHD  Atmos  AAC  AAC2.0  AAC5.1  AC3  AC-3  EAC3  E-AC3
MP3  MP2  Opus  FLAC  PCM  LPCM  5.1  7.1  2.0  Mono  Stereo  Multi

Release-process tags:

PROPER  REPACK  iNTERNAL  INTERNAL  LIMITED  EXTENDED.CUT  UNCUT  THEATRiCAL
RERIP  REAL  READNFO  RETAiL  RETAIL  STV  DC  COMPLETE  REMUX  REMASTERED
SUBBED  DUBBED  MULTi  MULTI  SUB  DUB  ENG  ENGLISH  POL  POLISH  iNT  iNTERNAL

Note: EXTENDED.CUT, THEATRiCAL, UNRATED, IMAX, DIRECTORS.CUT, FINAL.CUT, REMASTERED, UNCUT, DC (= Director's Cut shorthand), EE (= Extended Edition shorthand) are kept as edition tokens — see § 3.6. Strip them from the noise pool, then re-emit them as - <Edition> if present.

2.4 Source-specific cruft

Common compound suffixes that are not single tokens:

WEB.h264-NiXON[rartv]
WEB-DL.DDP5.1.x264-NTb
BDRip.x265.10bit-RZeroX
HDTV.x264-PHOENiX
1080p.WEB.h264-NiXON
2160p.UHD.BluRay.REMUX.HDR.HEVC.DTS-HD.MA.5.1

These are ad-hoc concatenations; once the standalone tokens above are stripped, what remains is the title plus stray separators. The pipeline in § 4 collapses separators last, so order matters.

2.5 Whitespace / punctuation cleanup

After substring removal, run these passes:

Pass	From	To
Collapse runs of spaces	`Show Title S01E01`	`Show Title S01E01`
Trim leading/trailing whitespace	`Show.mkv`	`Show.mkv`
Collapse double-underscore	`Show__Title`	`Show Title`
Replace dot-separators with space (basename only)	`Show.Title.S01E01`	`Show Title S01E01`
Drop stray punctuation runs	`Show --- Title`	`Show - Title`
Strip trailing dashes/dots before ext	`Show -.mkv`	`Show.mkv`

The dot-to-space substitution is only applied if the dot is between alphanumeric tokens — so 5.1 (audio channel count, already removed in § 2.3) is safe, and Mr. Robot keeps its dot if the source uses Mr.Robot (the dot becomes a space, giving Mr Robot — the canonical form has no dot).

2.6 URL / website refs

Match and remove:

WWW.YIFY-TORRENTS.COM
WWW.YTS.MX
WWW.RARBG.TO
RARBG.txt
www.yify-torrents.com

These appear as bracket prefixes ([WWW.YIFY-TORRENTS.COM] Movie...), suffixes (Movie - WWW.YIFY-TORRENTS.COM.mkv), or as RARBG.txt-style sidecar files (which doc 07 garbage-collects, not us).

Pattern (case-insensitive): (?:^|[\s\[$\.\-_])(WWW\.[A-Z0-9\-]+\.[A-Z]{2,4})(?:[\s\]$\.\-_]|$) → strip whole match.

2.7 Language indicators in the BASE name

.pl, .eng, .en, .pol, .de, .fr, .es, .it, .ja, .jp, .ru, .ko, .zh appearing in the video filename (basename, not extension). These belong on subtitle sidecars only, per doc 03.

Futurama.s01e01.pl.mkv             ← BAD (`.pl` in video basename)
Futurama (1999) - S01E01.mkv       ← GOOD (audio language is a stream attribute)
Futurama (1999) - S01E01.pl.srt    ← GOOD (subtitle sidecar with lang)
Futurama (1999) - S01E01.eng.srt   ← GOOD

Detection: 2- or 3-letter ISO-639 code as a token between dots / dashes / underscores in the basename. If found, drop it from the basename. If a sidecar .srt exists with the same lang token, leave the sidecar alone — it's already correctly named.

If the source file is a .srt / .ass / .vtt / .sub, the lang token is part of the canonical sidecar form and must NOT be stripped. The script's --type subtitle mode handles this branch.

3. The normalization pipeline (regex / sed / python)

Conceptual order — each step's output feeds the next.

3.1 Step 0 — Determine target schema

Caller-supplied: --type {movie|tv|anime-seasonal|anime-absolute|musicvideo|standup|extra}. The script does not guess. Doc 07's import wrapper picks the type based on which library tree the file is being moved into.

3.2 Step 1 — Split off extension

basename, ext = os.path.splitext(source_filename)
ext = ext.lower().lstrip(".")  # canonical lowercase, no leading dot

Validate: ext in {"mkv", "mp4", "avi", "webm", "m4v", "srt", "ass", "ssa", "vtt", "sub", "idx"}. Anything else → reject with an error; doc 07 quarantines it.

3.3 Step 2 — Extract SE (TV / anime-seasonal only)

import re
RE_SEASON_EPISODE = re.compile(r"[Ss](\d{1,2})[Ee](\d{1,3})(?:-[Ee]?(\d{1,3}))?")
m = RE_SEASON_EPISODE.search(basename)
if not m:
    # try alternative forms before giving up
    m = re.search(r"(?<![\dA-Za-z])(\d{1,2})x(\d{1,3})(?:-(\d{1,3}))?", basename)
    if m:
        season, ep, ep_end = m.group(1), m.group(2), m.group(3)
    else:
        m = re.search(r"Season\s*(\d{1,2})\s*Episode\s*(\d{1,3})", basename, re.I)
        # ...
season = f"{int(m.group(1)):02d}"
episode = f"{int(m.group(2)):02d}"
episode_end = f"{int(m.group(3)):02d}" if m.group(3) else None

If no S/E found and --type tv|anime-seasonal, error out — the file can only be normalized if season/episode are recoverable.

3.4 Step 3 — Extract episode title

After step 2, the matched span is the boundary. Episode title is the text between the SxxExx end and the first of: [, (, end-of-string, group-tag delimiter, end-of-line.

after_se = basename[m.end():]
# strip any leading separators
title_part = re.split(r"[\[\(]|\s-\s[A-Z][A-Z0-9]+$", after_se, maxsplit=1)[0]
title_part = title_part.strip(" -._")

If the title-part is empty after strip, leave it empty (script emits no trailing title — Show S01E01.mkv is still canonical when no title is known).

3.5 Step 4 — Extract series / movie title (from parent folder)

The parent folder name is the source of truth for series/movie title, not the filename, because torrents commonly have inconsistent filename-prefixes within the same folder (Show.S01E01.x264.mkv vs Show Title - S01E02.mkv).

parent = os.path.basename(os.path.dirname(source_path))
# strip group tags and quality from the parent folder too
clean_parent = strip_noise(parent)
# extract year if present
year_match = re.search(r"\((\d{4})\)", clean_parent)
year = year_match.group(1) if year_match else None
title = re.sub(r"\s*\(\d{4}\).*$", "", clean_parent).strip()

Edge case: parent folder is Season 01 (TV) — recurse one more level up to the show folder. The script handles N levels of Season \d+ parents.

3.6 Step 5 — Detect edition tokens (Movies only)

After § 2.3 strips edition tags from the noise pool, scan the original basename for canonical edition keywords:

EDITIONS = {
    r"director'?s?[\.\s_-]*cut": "Director's Cut",
    r"extended[\.\s_-]*(?:cut|edition)?": "Extended",
    r"theatrical(?:[\.\s_-]*cut)?": "Theatrical",
    r"final[\.\s_-]*cut": "Final Cut",
    r"imax": "IMAX",
    r"unrated": "Unrated",
    r"remastered?": "Remastered",
    r"\bDC\b": "Director's Cut",      # DC shorthand
    r"\bEE\b": "Extended",            # EE shorthand
}

Match the first one found, in priority order (Director's Cut > Final Cut

Extended > Theatrical > IMAX > Unrated > Remastered). Emit as - <Edition> between title-year block and extension.

3.7 Step 6 — Collapse, trim, re-emit canonical

def emit_canonical(schema, parts):
    if schema == "movie":
        if parts.edition:
            return f"{parts.title} ({parts.year}) - {parts.edition}.{parts.ext}"
        return f"{parts.title} ({parts.year}).{parts.ext}"
    if schema == "tv" or schema == "anime-seasonal":
        ep_range = f"S{parts.season}E{parts.episode}"
        if parts.episode_end:
            ep_range += f"-E{parts.episode_end}"
        if parts.episode_title:
            return f"{parts.title} ({parts.year}) - {ep_range} - {parts.episode_title}.{parts.ext}"
        return f"{parts.title} ({parts.year}) - {ep_range}.{parts.ext}"
    if schema == "anime-absolute":
        suffix = f" [{parts.subdub}]" if parts.subdub else ""
        return f"{parts.title} - {parts.absolute_number} - {parts.episode_title}{suffix}.{parts.ext}"
    if schema == "musicvideo":
        variant = f" [{parts.variant}]" if parts.variant else ""
        return f"{parts.year} - {parts.track_title}{variant}.{parts.ext}"
    if schema == "standup":
        return f"{parts.performer} - {parts.title} ({parts.year}).{parts.ext}"

After emission, run § 5.5 forbidden-character substitution, then § 5.6 double-space collapse, one final time.

4. Folder normalization

The same rules as filenames, applied to directory names, with a few schema-specific adjustments.

4.1 Show folder — `<Show> (<Year>)`

Futurama Season 1  [1080p AI x265 10bit FS99 Joy]/   →   Futurama (1999)/
The Office US S01-S09 1080p WEB-DL/                    →   The Office (2005)/
[YIFY] Inception 2010 1080p BRRip x264/                →   Inception (2010)/      ← but this is movies
Cowboy.Bebop.1998.Complete.BluRay.x265.10bit/           →   Cowboy Bebop (1998)/

Year: derived from the metadata provider (TVDB/TMDB) on first scrape, or from the user-supplied --year flag. If neither is available, normalize.py --type tv errors out and asks for --year. Year guessing from parent-folder-numbers is unsafe (Star Trek 2009 is the movie, not the series).

4.2 Season folder — `Season <NN>`

Season 1/                         →   Season 01/
Season1/                          →   Season 01/
Season.01/                        →   Season 01/
S01/                              →   Season 01/
SEASON 1 [1080p WEB Joy]/         →   Season 01/
Season 01 - Pilot Season/         →   Season 01/        ← drop subtitle suffixes
Season 01 [BluRay]/               →   Season 01/
Specials/                         →   Season 00/
Season 0/                         →   Season 00/
Extras/                           →   Season 00/         ← only if treated-as-specials

Doc 05 § 2.3 is explicit: Specials/, Season 0/, Season Specials/ do not match the parser. Season 00 is the only correct form.

4.3 Movie folder — `<Title> (<Year>)`

Same rules as the filename without the extension. The folder name MUST byte-for-byte match the filename prefix when multi-version files are present (doc 05 § 1.2 — Jellyfin requires this).

[YIFY] Blade Runner 1982 1080p BRRip x264 AAC-RARBG/   →   Blade Runner (1982)/
Blade.Runner.2049.2017.2160p.UHD.BluRay.x265.10bit.HDR.DV.DTS-HD.MA.7.1-FreetheFish/
                                                       →   Blade Runner 2049 (2017)/

4.4 Music-video artist folder — `<Artist>`

Daft.Punk/             →   Daft Punk/
[Daft Punk]/           →   Daft Punk/
DAFT PUNK Discography/ →   Daft Punk/   ← note: "Discography" is dropped; this is video lib not music

4.5 Special-features subfolders

Inside an item folder, only these subfolder names are recognised by Jellyfin (doc 05 § 8.2). The normalizer must rename source folders to the canonical lowercase form:

BTS/                          →   behind the scenes/
Behind-the-Scenes/            →   behind the scenes/
behind_the_scenes/            →   behind the scenes/
Featurettes/                  →   featurettes/
DELETED SCENES [Joy]/         →   deleted scenes/
Trailers/                     →   trailers/
Interviews/                   →   interviews/
Bonus Content/                →   extras/             ← catch-all
Bonus_Features/               →   extras/

Files inside featurettes/ etc. keep human-readable titles but get their group tags stripped:

Featurettes/Welcome to the World of Tomorrow [1080p Joy].mkv
                  →   featurettes/Welcome to the World of Tomorrow.mkv

Casing inside the special-features file itself uses smart title case (§ 5.1).

5. Case + character handling

5.1 Smart title case

Capitalize every word EXCEPT these "small words" (when not the first or last word of the title):

a, an, and, as, at, but, by, for, from, in, into, nor, of, on, or, the,
to, up, vs, vs., via, with, yet

Words that look like acronyms (I.B.M., C.I.A., T.M.N.T.) are preserved as-is. Roman numerals (II, III, IV, IX) are uppercased.

Examples

the lord of the rings the two towers           →   The Lord of the Rings the Two Towers     ← BAD
the lord of the rings: the two towers          →   The Lord of the Rings - The Two Towers    ← GOOD (`:` → ` - `, the second `the` is at start of subtitle, capitalize)
return of the king                              →   Return of the King
star trek ii the wrath of khan                  →   Star Trek II - The Wrath of Khan

The subtitle-after-colon special case is important: when a : is substituted with -, the word after the dash is a new "first word" for title-casing purposes. The script handles this by re-running the title-caser on each - separated chunk.

Jellyfin's parser is case-insensitive — this is purely for human readers.

5.2 Hyphen / dash normalization

Char	Code	Used for
`-`	U+002D HYPHEN-MINUS	ASCII hyphen, the only canonical form for filenames
`–`	U+2013 EN DASH	Forbidden in filenames; replace with `-`
`—`	U+2014 EM DASH	Forbidden; replace with `-`
`−`	U+2212 MINUS SIGN	Forbidden; replace with `-`

Unicode dashes appear from copy-paste of articles (Wikipedia loves the en dash). They're invisible-ish in ls, but they break grep, shell completion, and SMB transfers.

Spider–Man (2002).mkv          →   Spider-Man (2002).mkv
Spider — Man (2002).mkv         →   Spider - Man (2002).mkv

5.3 Apostrophes / quotes

Char	Code	Status
`'`	U+0027 APOSTROPHE	Canonical; ASCII straight quote
`'`	U+2019 RIGHT SINGLE QUOTATION MARK	Forbidden in filenames; replace with `'`
`'`	U+2018 LEFT SINGLE QUOTATION MARK	Forbidden; replace with `'`
`"`	U+0022 QUOTATION MARK	Forbidden in filenames (Windows-illegal); strip entirely
`"`	U+201C LEFT DOUBLE QUOTATION MARK	Forbidden; strip
`"`	U+201D RIGHT DOUBLE QUOTATION MARK	Forbidden; strip

Curly quotes break SMB shares (Windows clients see ? and refuse to open the file) and break shell escaping in scripts.

Don't Stop Believin'.mkv           ←   GOOD
Don't Stop Believin'.mkv           ←   BAD (curly), normalize to straight
"It's a Wonderful Life" (1946).mkv ←   BAD (double quotes), strip them entirely:
It's a Wonderful Life (1946).mkv   ←   GOOD

5.4 Diacritics / non-ASCII

ext4 is UTF-8 native; Jellyfin's parser is UTF-8 native; the HTTP API serves UTF-8 happily. Keep diacritics when the title's accepted spelling uses them.

Amélie (2001)/Amélie (2001).mkv                  ← GOOD
Pokémon (1997)/Season 01/Pokémon (1997) - S01E01 - Pokémon - I Choose You!.mkv  ← GOOD
Léon - The Professional (1994)/Léon - The Professional (1994).mkv ← GOOD

Doc 05 § 0 rule 4 advises caution: prefer the ASCII title when "well known" (e.g. Amelie (2001) over Amélie (2001)). For this deploy with LAN-only HTTP and ext4, full Unicode is safe — but the rule of thumb remains: if Wikipedia's English page uses the accent, keep it; if not, drop it.

Tested: Jellyfin's filename matching, Items?searchTerm=, and NFO <title> round-trip correctly with é, ñ, ü, ß, ø, ł, ż, 日, 한 on this deploy. Verified against the Futurama Polish-dubbed corpus.

5.5 Forbidden-char substitution table

Windows-illegal: < > : " / \ | ? *. Linux additionally forbids / and NUL. Substitute as follows:

Char	Substitute	Rationale
`:`	`-` (space-hyphen-space)	Most common in titles (`Star Trek II: The Wrath of Khan`); `-` is a clean replacement that title-casing handles
`/`	`and`	Used in titles like `Mr. & Mrs. Smith` (no `/` there) and in episode-title lists for two-part eps. Avoid if both halves stand on their own.
`\`	omit	No legitimate use in titles
`<`	`(`	Rare; `<` in titles is parenthetical
`>`	`)`	Same
`\|`	omit (or `-`)	Rare; sometimes in `Tom \| Jerry` style logo-text
`?`	omit	Common in `Who Killed the Robber?` — drop the question mark, keep meaning
`*`	omit	Rare; usually censored profanity
`"`	omit	Per § 5.3
`\0` (NUL)	error	Filesystem hard-block; surface to user

Examples

Star Trek II: The Wrath of Khan (1982)        →   Star Trek II - The Wrath of Khan (1982)
Mr. & Mrs. Smith (2005)                        →   Mr. & Mrs. Smith (2005)         (no change; & is fine)
Who Killed the Robber? (1987)                  →   Who Killed the Robber (1987)
Tom & Jerry: The Movie (1992)                  →   Tom & Jerry - The Movie (1992)

5.6 Whitespace canonicalization

After all substitutions:

Collapse runs of \s+ to a single space.
strip() leading/trailing whitespace.
Collapse double-- (which can result from Title -- Subtitle) to single -.
Trim trailing punctuation before extension: Title -.mkv → Title.mkv.

6. Year disambiguation — concrete examples

Jellyfin's TMDB/TVDB scrape uses the year in (YYYY) to filter candidates. With multiple titles of the same name, the year is the only disambiguator before falling back to provider IDs.

6.1 Without year — what goes wrong

Filename: Cinderella.mkv (no year, no folder year).

Jellyfin sends "Cinderella" to TMDB. TMDB returns 12+ matches:

Cinderella (1950) — Disney animated
Cinderella (2015) — Disney live action
Cinderella (2021) — Camila Cabello musical
Cinderella (1965) — TV special
Cinderella (1899) — Méliès short

Jellyfin picks the one with the highest popularity score, which is the 2015 live-action remake. If you wanted 1950, you have to manually edit.

6.2 With year — clean match

Filename: Cinderella (1950).mkv in folder Cinderella (1950)/.

Jellyfin sends (title=Cinderella, year=1950) to TMDB. TMDB returns the 1950 animated film as the top match with high confidence. Scrape succeeds first try.

Movies/Cinderella (1950)/Cinderella (1950).mkv      ← TMDB ID 11224 (animated)
Movies/Cinderella (2015)/Cinderella (2015).mkv      ← TMDB ID 150689 (live action)
Movies/Cinderella (2021)/Cinderella (2021).mkv      ← TMDB ID 587996 (musical)

6.3 Same year — provider ID required

Filename: Bad Movie (1980).mkv. Two films named "Bad Movie" released in 1980 (hypothetical). Year doesn't disambiguate. Add provider ID:

Movies/Bad Movie (1980) [imdbid-tt0080000]/Bad Movie (1980) [imdbid-tt0080000].mkv
Movies/Bad Movie (1980) [imdbid-tt0080001]/Bad Movie (1980) [imdbid-tt0080001].mkv

6.4 Year on TV shows

The same logic applies to series:

TV/The Office (2001)/...    ← UK original, BBC
TV/The Office (2005)/...    ← US remake, NBC

Without year, Jellyfin picks one (usually the US one, higher TMDB popularity). With year, both work side-by-side.

7. Multi-version handling

When a single movie has multiple legitimate cuts (Director's Cut, Theatrical, Extended), or multiple resolutions (2160p HDR + 1080p SDR), Jellyfin groups them under one item with a "Version" picker in the UI.

7.1 Edition variants

Movies/Blade Runner (1982)/
├── Blade Runner (1982).mkv                       ← default (whichever is "the" version)
├── Blade Runner (1982) - Director's Cut.mkv
├── Blade Runner (1982) - Final Cut.mkv
└── Blade Runner (1982) - Theatrical.mkv

Jellyfin reads all four files, hashes them, and creates one library item "Blade Runner (1982)" with four selectable versions. The unlabelled one shows as "Default".

7.2 Resolution variants

Movies/Blade Runner 2049 (2017)/
├── Blade Runner 2049 (2017) - 2160p.mkv
├── Blade Runner 2049 (2017) - 1080p.mkv
└── Blade Runner 2049 (2017) - 720p.mkv

Resolution labels ending in p or i sort descending by quality, so the 2160p version is offered first. This is the only exception to "no resolution tags in filenames" (§ 1.1).

7.3 Mixed (edition × resolution)

Movies/Blade Runner 2049 (2017)/
├── Blade Runner 2049 (2017) - Theatrical 2160p.mkv
├── Blade Runner 2049 (2017) - Theatrical 1080p.mkv
├── Blade Runner 2049 (2017) - Director's Cut 2160p.mkv
└── Blade Runner 2049 (2017) - Director's Cut 1080p.mkv

This works in Jellyfin 10.10 — all four are grouped, the picker is a flat list with all four labels visible. Slight UX ugliness but parses cleanly. Avoid unless you genuinely have both axes of variation.

7.4 What does NOT work

Sub-folders for variants:
```
Movies/Blade Runner 2049 (2017)/Theatrical/Blade Runner 2049 (2017).mkv  ← BREAKS
```
Jellyfin treats Theatrical/ as an unknown extras subfolder and the inner mkv as nothing.

Different folder per cut:

Movies/Blade Runner 2049 (2017) Theatrical/Blade Runner 2049 (2017).mkv
Movies/Blade Runner 2049 (2017) Director's Cut/Blade Runner 2049 (2017).mkv

This makes them two separate library items, not grouped versions.

Suffix without space-hyphen-space:

Blade Runner 2049 (2017).Theatrical.mkv     ← BREAKS (no ` - ` separator)
Blade Runner 2049 (2017)-Theatrical.mkv     ← BREAKS (no spaces around `-`)

8. Special-features filename rules

Files inside the recognised subfolders (featurettes/, behind the scenes/, deleted scenes/, interviews/, trailers/, etc.) follow these rules:

Strip group tags as in § 2.1.
Strip quality / codec / source / audio tags as in § 2.3.
Smart title case as in § 5.1.
Forbidden chars substituted as in § 5.5.
Filename = the human-readable feature title. No (year), no S01E01. The parent folder type (e.g. featurettes/) is the type marker.
Optional: append -featurette (or -trailer, -behindthescenes, etc.) suffix to be defensive about scraper edge cases. Doc 05 § 8.1 shows this works AND § 8.2 shows the folder method works — using both is belt-and-braces.

Example

Featurettes/Welcome to the World of Tomorrow [1080p Joy].mkv
                  →
featurettes/Welcome to the World of Tomorrow.mkv

Or, if you want belt-and-braces:

featurettes/Welcome to the World of Tomorrow-featurette.mkv

Both parse. Pick one style per library and keep it consistent.

9. Worked example — the live Futurama import

This is the example the owner asked for. Verified against the live media tree on nullstone (/home/user/media/tv/Futurama/Season 01,02,03/).

9.1 BEFORE (representative source dump)

/home/admin/Downloads/futrama/
└── Futurama Season 1  [1080p AI x265 10bit FS99 Joy]/
    ├── Futurama S01E01 Space Pilot 3000  [1080p x265 10bit Joy].mkv
    ├── Futurama S01E02 The Series Has Landed  [1080p x265 10bit Joy].mkv
    ├── Futurama S01E03 I, Roommate  [1080p x265 10bit Joy].mkv
    ├── Futurama S01E04 Love's Labours Lost in Space  [1080p x265 10bit Joy].mkv
    ├── Futurama S01E05 Fear of a Bot Planet  [1080p x265 10bit Joy].mkv
    ├── Futurama S01E06 A Fishful of Dollars  [1080p x265 10bit Joy].mkv
    ├── Futurama S01E07 My Three Suns  [1080p x265 10bit Joy].mkv
    ├── Futurama S01E08 A Big Piece of Garbage  [1080p x265 10bit Joy].mkv
    ├── Futurama S01E09 Hell Is Other Robots  [1080p x265 10bit Joy].mkv
    └── Featurettes/
        └── Welcome to the World of Tomorrow [1080p Joy].mkv

Note: doubled-space is real (Futurama S01E01 Space Pilot 3000 [1080p). Source the rip is from a release group called "Joy" using "FS99" (FastSub 99); "AI" likely means AI-upscaled. None of that is library-relevant.

9.2 AFTER (canonical layout)

/home/user/media/tv/
└── Futurama (1999)/
    ├── Season 01/
    │   ├── Futurama (1999) - S01E01 - Space Pilot 3000.mkv
    │   ├── Futurama (1999) - S01E02 - The Series Has Landed.mkv
    │   ├── Futurama (1999) - S01E03 - I, Roommate.mkv
    │   ├── Futurama (1999) - S01E04 - Love's Labours Lost in Space.mkv
    │   ├── Futurama (1999) - S01E05 - Fear of a Bot Planet.mkv
    │   ├── Futurama (1999) - S01E06 - A Fishful of Dollars.mkv
    │   ├── Futurama (1999) - S01E07 - My Three Suns.mkv
    │   ├── Futurama (1999) - S01E08 - A Big Piece of Garbage.mkv
    │   └── Futurama (1999) - S01E09 - Hell Is Other Robots.mkv
    └── featurettes/
        └── Welcome to the World of Tomorrow.mkv

9.3 Per-file rename mapping

Before	After
`Futurama Season 1 [1080p AI x265 10bit FS99 Joy]/`	`Futurama (1999)/Season 01/`
`Futurama S01E01 Space Pilot 3000 [1080p x265 10bit Joy].mkv`	`Futurama (1999) - S01E01 - Space Pilot 3000.mkv`
`Futurama S01E02 The Series Has Landed [1080p x265 10bit Joy].mkv`	`Futurama (1999) - S01E02 - The Series Has Landed.mkv`
`Futurama S01E04 Love's Labours Lost in Space [1080p x265 10bit Joy].mkv`	`Futurama (1999) - S01E04 - Love's Labours Lost in Space.mkv`
`Featurettes/Welcome to the World of Tomorrow [1080p Joy].mkv`	`featurettes/Welcome to the World of Tomorrow.mkv`

Notes on specific titles:

I, Roommate keeps the comma. Comma is legal on ext4, on Windows, and on every modern SMB client. No need to substitute.
Love's Labours Lost in Space keeps the straight ASCII apostrophe. If the source had a curly ', § 5.3 normalizes it.
Hell Is Other Robots — Is is capitalized (it's not in the small-words list — the small-words list excludes is/be/am/are).

9.4 What the live tree currently has

Verified via ssh user@192.168.0.100 'ls /home/user/media/tv/Futurama/':

Season 01
Season 02
Season 03

The current live deploy uses folder name Futurama/ (no year) — that's non-canonical per this doc. The canonical is Futurama (1999)/. This is covered in doc 07's migration plan (rename the folder, then POST /Library/Refresh). Mentioned here as a known drift; not fixed in this doc.

10. Idempotency and safety

The normalize.py script in § 11 enforces these:

No-op on already-canonical input. When the script's emitted filename equals the source filename byte-for-byte, it does nothing and returns exit code 0. Re-running the script on an already-imported library is safe and free.
No overwrite without --force. When the target path exists and is not the source path, the script refuses to move and returns exit code 2. With --force, it moves and the target is overwritten. Without --force, the script suggests a numeric suffix (Title (Year) (1).mkv) and asks for confirmation.
Default to dry-run. The script prints what it would do to stdout and does NOT touch the filesystem unless --apply is passed. This is the inverse of the GNU convention (most tools default to apply, require --dry-run to preview) — chosen because the destructive case (a wrong rename of 100 files) is much worse than the boring case (one extra flag).
Audit log at /var/log/jellyfin-imports/<YYYY-MM-DD>.log. Every --apply run appends:
```
2026-05-08T14:23:11Z RENAME /home/admin/.../Futurama S01E01 ...joy].mkv -> /home/user/media/tv/Futurama (1999)/Season 01/Futurama (1999) - S01E01 - Space Pilot 3000.mkv
```
Path is created (mkdir -p /var/log/jellyfin-imports) on first run if missing; user must have write permission.
No deletes. The script moves (os.rename on same FS, shutil.move across FS). It never os.unlinks. Garbage collection of source folders (after all files moved) is doc 07's job.
Atomic per-file. Each file's rename is one syscall on the same FS; on a different FS, shutil.move does copy-then-unlink which has a brief window where both source and target exist. The audit log records the operation regardless.
Unicode-safe. All paths handled as pathlib.Path (UTF-8 native on ext4). Curly-quote → straight-quote substitution happens BEFORE the target path is computed, so the target path is always ASCII-safe-ish (still UTF-8 for legitimate accents).

11. Reference implementation — `normalize.py`

Drop this at /opt/docker/jellyfin/scripts/normalize.py on nullstone. Run with Python 3.10+. Stdlib only — no external deps.

#!/usr/bin/env python3
"""
normalize.py — canonical filename normalizer for tv.s8n.ru

Per /tmp/jellyfin-stack/docs/08-filename-normalization.md.
Safe by default: dry-run, no overwrite, no delete.
"""

from __future__ import annotations

import argparse
import datetime as dt
import os
import re
import shutil
import sys
import unicodedata
from dataclasses import dataclass, field
from pathlib import Path
from typing import Optional

LOG_DIR = Path("/var/log/jellyfin-imports")

# --- Stripping rules (doc § 2) -------------------------------------------------

GROUP_TAG_PATTERNS = [
    re.compile(r"\[[^\[\]]*\b(YIFY|YTS(\.\w+)?|RARBG|ettv|eztv|GalaxyRG\d*|"
               r"FitGirl|FitGirl\s*Repack|NOGRP|QxR|FreetheFish|psa|PSA|CMRG|"
               r"d3g|STRiFE|Pahe\.in|FoV|NTb|YOLO|KOGi|playWEB|REQ|XBET|FLUX|"
               r"NOSiVID|BGT|SVA|CRiMSON|ION10|ION265|BluPanda|H4S5S|Joy|"
               r"FS99\s*Joy|FS99|AI\s*x265|x265\s*\d+bit|\d+bit\s*x265)"
               r"[^\[\]]*\]", re.I),
    re.compile(r"\((YIFY|RARBG|NOGRP)\)", re.I),
]

QUALITY_TOKENS = re.compile(
    r"(?<![A-Za-z0-9])("
    r"2160p|1080p|720p|480p|360p|4[Kk]|UHD|HD|SD|FHD|QHD|"
    r"WEB-DL|WEBDL|WEB\.DL|WEB|WEBRip|WEB-Rip|BluRay|BLURAY|Bluray|BDRip|"
    r"BRRip|BR-Rip|BDR|HDTV|HDTVRip|PDTV|DSR|DVDRip|DVD|DVDR|DVD9|DVD5|"
    r"HDDVD|HDDVDRip|HDRip|CAMRip|CAM|TS|HDTS|TC|TELESYNC|TELECINE|R5|"
    r"SCREENER|SCR|WORKPRINT|WP|PPV|PPVRip|"
    r"x264|x265|H\.?264|H\.?265|HEVC|AVC|VP9|AV1|XviD|DivX|"
    r"10bit|10-bit|8bit|8-bit|HDR10\+?|HDR|DV|Dolby\.?Vision|SDR|HFR|HQ|"
    r"DDP?5\.1|DDP?7\.1|DDP?2\.0|DD\+5\.1|DD\+7\.1|DTS-HD\.MA|DTS-HD|DTS-X|"
    r"DTSX|DTS|TrueHD|Atmos|AAC2\.0|AAC5\.1|AAC|AC3|AC-3|EAC3|E-AC3|"
    r"MP3|MP2|Opus|FLAC|PCM|LPCM|5\.1|7\.1|2\.0|Mono|Stereo|Multi|"
    r"PROPER|REPACK|iNTERNAL|INTERNAL|LIMITED|UNCUT|RERIP|REAL|READNFO|"
    r"RETAi?L|STV|REMUX|MULTi|MULTI|SUBBED|DUBBED|iNT"
    r")(?![A-Za-z0-9])", re.I)

URL_REF = re.compile(
    r"(?:^|[\s\[\(\.\-_])(WWW\.[A-Z0-9\-]+\.[A-Z]{2,4})(?:[\s\]\)\.\-_]|$)",
    re.I)

TRAILING_GROUP = re.compile(r"-(?:NOGRP|EVO|RARBG|SPARKS|CMRG|NTb|FLUX|AMZN|"
                            r"NF|DSNP|ATVP|MA|WEB|AAC2|FoV|KOGi|PLAYWEB|FRDS|"
                            r"ZQ|PHOENiX|EZTV|NTG|iON|ION10|ION265|CtrlHD|"
                            r"d3g|PSA|QxR|RZeroX|PMP|BTN|DEFLATE|BAE|MZABI|"
                            r"TURG|Joy)\b", re.I)

LANG_TOKEN = re.compile(r"(?<![A-Za-z])\.?(en|eng|pl|pol|de|deu|fr|fra|es|spa|"
                        r"it|ita|ja|jpn|jp|ru|rus|ko|kor|zh|chi)(?![A-Za-z])",
                        re.I)

# Forbidden chars (§ 5.5)
FORBIDDEN_CHARS = {
    ":": " - ",
    "/": " and ",
    "\\": "",
    "<": "(",
    ">": ")",
    "|": "",
    "?": "",
    "*": "",
    '"': "",
    "“": "",  # left double quotation mark
    "”": "",  # right double quotation mark
}

# Apostrophe normalization (§ 5.3)
APOSTROPHES = {
    "‘": "'",
    "’": "'",
}

# Dashes (§ 5.2)
DASHES = {
    "–": "-",  # en dash
    "—": "-",  # em dash
    "−": "-",  # minus
}

# Editions (§ 3.6)
EDITION_PATTERNS = [
    (re.compile(r"director'?s?[\.\s_-]*cut", re.I), "Director's Cut"),
    (re.compile(r"final[\.\s_-]*cut", re.I), "Final Cut"),
    (re.compile(r"extended[\.\s_-]*(?:cut|edition)?", re.I), "Extended"),
    (re.compile(r"theatrical(?:[\.\s_-]*cut)?", re.I), "Theatrical"),
    (re.compile(r"\bIMAX\b", re.I), "IMAX"),
    (re.compile(r"\bunrated\b", re.I), "Unrated"),
    (re.compile(r"remastere?d?", re.I), "Remastered"),
    (re.compile(r"(?<![A-Za-z])DC(?![A-Za-z])"), "Director's Cut"),
    (re.compile(r"(?<![A-Za-z])EE(?![A-Za-z])"), "Extended"),
]

# Smart title case (§ 5.1)
SMALL_WORDS = {"a", "an", "and", "as", "at", "but", "by", "for", "from",
               "in", "into", "nor", "of", "on", "or", "the", "to", "up",
               "vs", "vs.", "via", "with", "yet"}
ROMAN_NUMERAL = re.compile(r"^[ivxlcdmIVXLCDM]+$")


def smart_title(s: str) -> str:
    """Title-case respecting small-words and roman numerals."""
    if not s:
        return s
    chunks = re.split(r"(\s-\s)", s)  # split on space-dash-space (subtitle)
    out_chunks = []
    for chunk in chunks:
        if chunk == " - ":
            out_chunks.append(chunk)
            continue
        words = chunk.split(" ")
        result = []
        for i, w in enumerate(words):
            if not w:
                result.append(w)
                continue
            if ROMAN_NUMERAL.match(w):
                result.append(w.upper())
                continue
            lower = w.lower()
            if 0 < i < len(words) - 1 and lower in SMALL_WORDS:
                result.append(lower)
            else:
                # capitalize but preserve internal apostrophes/dots
                result.append(w[0].upper() + w[1:].lower() if w else w)
        out_chunks.append(" ".join(result))
    return "".join(out_chunks)


def strip_noise(s: str) -> str:
    """Remove group tags, quality, urls, trailing groups."""
    for pat in GROUP_TAG_PATTERNS:
        s = pat.sub("", s)
    s = URL_REF.sub(" ", s)
    s = QUALITY_TOKENS.sub("", s)
    s = TRAILING_GROUP.sub("", s)
    return s


def normalize_chars(s: str) -> str:
    """Apply Unicode/forbidden-char substitutions."""
    for k, v in APOSTROPHES.items():
        s = s.replace(k, v)
    for k, v in DASHES.items():
        s = s.replace(k, v)
    for k, v in FORBIDDEN_CHARS.items():
        s = s.replace(k, v)
    # NFC normalization for diacritics (consistent encoding)
    s = unicodedata.normalize("NFC", s)
    return s


def collapse_whitespace(s: str) -> str:
    s = re.sub(r"\s+", " ", s)
    s = re.sub(r" - - ", " - ", s)
    s = re.sub(r"--+", "-", s)
    s = s.strip(" -._")
    return s


# --- Schema-specific extraction ------------------------------------------------

@dataclass
class Parts:
    title: str = ""
    year: Optional[str] = None
    season: Optional[str] = None
    episode: Optional[str] = None
    episode_end: Optional[str] = None
    episode_title: str = ""
    edition: Optional[str] = None
    provider_id: Optional[str] = None
    ext: str = "mkv"
    absolute_number: Optional[str] = None
    subdub: Optional[str] = None
    track_title: str = ""
    variant: Optional[str] = None
    performer: str = ""


RE_SE = re.compile(r"[Ss](\d{1,2})[Ee](\d{1,3})(?:-[Ee]?(\d{1,3}))?")
RE_NXM = re.compile(r"(?<![\dA-Za-z])(\d{1,2})x(\d{1,3})(?:-(\d{1,3}))?")
RE_SEASON_EP = re.compile(r"Season\s*(\d{1,2})\s*Episode\s*(\d{1,3})", re.I)
RE_YEAR_PARENS = re.compile(r"\((\d{4})\)")
RE_PROVIDER_ID = re.compile(r"\[(?:imdbid|tmdbid|tvdbid)-[^\]]+\]")


def extract_year(s: str) -> Optional[str]:
    m = RE_YEAR_PARENS.search(s)
    if m:
        y = int(m.group(1))
        if 1888 <= y <= dt.date.today().year + 2:
            return m.group(1)
    return None


def extract_provider_id(s: str) -> Optional[str]:
    m = RE_PROVIDER_ID.search(s)
    return m.group(0) if m else None


def extract_se(s: str):
    m = RE_SE.search(s)
    if m:
        end = m.group(3) or None
        return (m, m.group(1), m.group(2), end)
    m = RE_NXM.search(s)
    if m:
        return (m, m.group(1), m.group(2), m.group(3))
    m = RE_SEASON_EP.search(s)
    if m:
        return (m, m.group(1), m.group(2), None)
    return (None, None, None, None)


def extract_edition(raw_basename: str) -> Optional[str]:
    for pat, name in EDITION_PATTERNS:
        if pat.search(raw_basename):
            return name
    return None


def parent_show_folder(p: Path) -> Path:
    """Walk up past Season XX folders until we find the show folder."""
    cur = p.parent
    while re.match(r"(?i)season\s*\d+|specials|extras", cur.name):
        cur = cur.parent
    return cur


# --- Per-schema emit -----------------------------------------------------------

def normalize_movie(src: Path, year_hint: Optional[str] = None,
                    title_hint: Optional[str] = None) -> Path:
    raw = src.stem
    ext = src.suffix.lower().lstrip(".") or "mkv"
    edition = extract_edition(raw)
    provider_id = extract_provider_id(raw) or extract_provider_id(src.parent.name)
    cleaned = strip_noise(raw)
    cleaned = normalize_chars(cleaned)
    cleaned = collapse_whitespace(cleaned)
    year = year_hint or extract_year(cleaned) or extract_year(src.parent.name)
    if year:
        cleaned = re.sub(r"\s*\(" + year + r"\)", "", cleaned).strip()
    # drop edition tokens from the title body (we re-emit them)
    for pat, _ in EDITION_PATTERNS:
        cleaned = pat.sub("", cleaned)
    cleaned = collapse_whitespace(cleaned)
    title = title_hint or smart_title(cleaned)
    if not year:
        raise ValueError(f"cannot determine year for movie: {src}")
    folder_name = f"{title} ({year})"
    if provider_id:
        folder_name += f" {provider_id}"
    file_basename = folder_name
    if edition:
        file_basename += f" - {edition}"
    return src.parent.parent / folder_name / f"{file_basename}.{ext}"


def normalize_tv(src: Path, year_hint: Optional[str] = None,
                 title_hint: Optional[str] = None,
                 schema: str = "tv") -> Path:
    raw = src.stem
    ext = src.suffix.lower().lstrip(".") or "mkv"
    m, season, ep, ep_end = extract_se(raw)
    if not season:
        raise ValueError(f"no S/E token in TV file: {src}")
    season = f"{int(season):02d}"
    episode = f"{int(ep):02d}"
    episode_end = f"{int(ep_end):02d}" if ep_end else None
    # episode title = text after match, before next bracket
    after = raw[m.end():] if hasattr(m, "end") else ""
    title_part = re.split(r"[\[\(]", after, maxsplit=1)[0]
    title_part = strip_noise(title_part)
    title_part = normalize_chars(title_part)
    title_part = collapse_whitespace(title_part)
    title_part = re.sub(r"^[\s\-_\.]+", "", title_part)
    episode_title = smart_title(title_part) if title_part else ""
    # show title from parent folder
    show_folder = parent_show_folder(src)
    show_clean = strip_noise(show_folder.name)
    show_clean = normalize_chars(show_clean)
    show_clean = collapse_whitespace(show_clean)
    year = year_hint or extract_year(show_clean) or extract_year(src.parent.name)
    if year:
        show_clean = re.sub(r"\s*\(" + year + r"\).*$", "", show_clean).strip()
    show_clean = re.sub(r"(?i)\s*Season\s*\d+.*$", "", show_clean).strip()
    show = title_hint or smart_title(show_clean)
    if not year:
        raise ValueError(f"cannot determine year for TV show: {show_folder}")
    se_str = f"S{season}E{episode}"
    if episode_end:
        se_str += f"-E{episode_end}"
    file_base = f"{show} ({year}) - {se_str}"
    if episode_title:
        file_base += f" - {episode_title}"
    target_root = show_folder.parent  # e.g. /media/tv
    return target_root / f"{show} ({year})" / f"Season {season}" / f"{file_base}.{ext}"


def normalize_anime_absolute(src: Path, title_hint: Optional[str],
                             abs_num: Optional[int],
                             ep_title: str = "",
                             subdub: Optional[str] = None) -> Path:
    ext = src.suffix.lower().lstrip(".") or "mkv"
    show_folder = parent_show_folder(src)
    show_clean = strip_noise(show_folder.name)
    show_clean = normalize_chars(show_clean)
    show = title_hint or smart_title(collapse_whitespace(show_clean))
    if abs_num is None:
        raise ValueError(f"absolute number required for {src}")
    suffix = f" [{subdub}]" if subdub else ""
    title_str = smart_title(ep_title) if ep_title else ""
    file_base = f"{show} - {abs_num:04d}"
    if title_str:
        file_base += f" - {title_str}"
    file_base += suffix
    return show_folder.parent / show / f"{file_base}.{ext}"


def normalize_musicvideo(src: Path, artist_hint: str, year_hint: str,
                         track_hint: Optional[str] = None,
                         variant: Optional[str] = None) -> Path:
    ext = src.suffix.lower().lstrip(".") or "mp4"
    raw = src.stem
    cleaned = normalize_chars(strip_noise(raw))
    cleaned = collapse_whitespace(cleaned)
    track = track_hint or smart_title(cleaned)
    artist = smart_title(artist_hint)
    suffix = f" [{variant}]" if variant else ""
    return src.parent.parent / artist / f"{year_hint} - {track}{suffix}.{ext}"


def normalize_standup(src: Path, performer: str, title: str, year: str) -> Path:
    ext = src.suffix.lower().lstrip(".") or "mkv"
    folder = f"{performer} - {title} ({year})"
    return src.parent.parent / folder / f"{folder}.{ext}"


# --- Driver --------------------------------------------------------------------

def is_already_canonical(src: Path, target: Path) -> bool:
    return src.resolve() == target.resolve()


def log_op(action: str, src: Path, target: Path):
    LOG_DIR.mkdir(parents=True, exist_ok=True)
    log_file = LOG_DIR / f"{dt.date.today().isoformat()}.log"
    ts = dt.datetime.utcnow().isoformat() + "Z"
    line = f"{ts} {action} {src} -> {target}\n"
    with log_file.open("a") as f:
        f.write(line)


def main():
    ap = argparse.ArgumentParser(description="canonical filename normalizer")
    ap.add_argument("source", type=Path, help="source file path")
    ap.add_argument("--type", required=True,
                    choices=["movie", "tv", "anime-seasonal",
                             "anime-absolute", "musicvideo", "standup",
                             "extra"])
    ap.add_argument("--year")
    ap.add_argument("--title")
    ap.add_argument("--performer", help="for standup")
    ap.add_argument("--artist", help="for musicvideo")
    ap.add_argument("--track", help="for musicvideo")
    ap.add_argument("--variant", help="for musicvideo")
    ap.add_argument("--abs-num", type=int, help="for anime-absolute")
    ap.add_argument("--ep-title", help="for anime-absolute")
    ap.add_argument("--subdub", choices=["Sub", "Dub"], help="for anime-absolute")
    ap.add_argument("--apply", action="store_true",
                    help="actually move the file (default is dry-run)")
    ap.add_argument("--force", action="store_true",
                    help="overwrite existing target")
    args = ap.parse_args()

    src = args.source.resolve()
    if not src.exists():
        print(f"ERROR: {src} does not exist", file=sys.stderr)
        sys.exit(1)

    try:
        if args.type == "movie":
            target = normalize_movie(src, args.year, args.title)
        elif args.type == "tv":
            target = normalize_tv(src, args.year, args.title, schema="tv")
        elif args.type == "anime-seasonal":
            target = normalize_tv(src, args.year, args.title, schema="anime")
        elif args.type == "anime-absolute":
            target = normalize_anime_absolute(src, args.title, args.abs_num,
                                              args.ep_title or "",
                                              args.subdub)
        elif args.type == "musicvideo":
            target = normalize_musicvideo(src, args.artist or "", args.year or "",
                                          args.track, args.variant)
        elif args.type == "standup":
            target = normalize_standup(src, args.performer or "",
                                       args.title or "", args.year or "")
        else:
            print(f"ERROR: schema '{args.type}' not implemented", file=sys.stderr)
            sys.exit(2)
    except ValueError as e:
        print(f"ERROR: {e}", file=sys.stderr)
        sys.exit(2)

    if is_already_canonical(src, target):
        print(f"NOOP {src}")
        sys.exit(0)

    if target.exists() and not args.force:
        print(f"REFUSE {src} -> {target} (target exists; use --force)")
        sys.exit(2)

    if args.apply:
        target.parent.mkdir(parents=True, exist_ok=True)
        shutil.move(str(src), str(target))
        log_op("RENAME", src, target)
        print(f"MOVED {src} -> {target}")
    else:
        print(f"DRY-RUN {src} -> {target}")


if __name__ == "__main__":
    main()

11.1 Usage examples

# Dry-run a single Futurama episode
./normalize.py --type tv \
  "/home/admin/Downloads/futrama/Futurama Season 1  [1080p AI x265 10bit FS99 Joy]/Futurama S01E01 Space Pilot 3000  [1080p x265 10bit Joy].mkv"

# Output:
# DRY-RUN /home/admin/Downloads/.../Futurama S01E01 Space Pilot 3000  [1080p x265 10bit Joy].mkv
#         -> /home/admin/Downloads/futrama/Futurama (1999)/Season 01/Futurama (1999) - S01E01 - Space Pilot 3000.mkv

# Same with --apply, with explicit year and title hints
./normalize.py --type tv --year 1999 --title "Futurama" --apply \
  "/home/admin/Downloads/futrama/Futurama Season 1  [1080p AI x265 10bit FS99 Joy]/Futurama S01E01 Space Pilot 3000  [1080p x265 10bit Joy].mkv"

# Movie with edition
./normalize.py --type movie --year 1982 --apply \
  "/home/admin/Downloads/Blade Runner 1982 Final Cut [1080p BluRay x265 RARBG].mkv"

# Stand-up
./normalize.py --type standup --performer "Bo Burnham" --title "Inside" --year 2021 --apply \
  "/home/admin/Downloads/Bo.Burnham.Inside.2021.1080p.NF.WEB-DL.DDP5.1.x264-NTb.mkv"

# Music video
./normalize.py --type musicvideo --artist "Daft Punk" --year 2013 \
               --track "Get Lucky" --apply \
  "/home/admin/Downloads/daft.punk.get.lucky.official.video.1080p.mkv"

11.2 Idempotency proof

Running the script twice on the same input produces the same target. The second run's source = first run's target, so is_already_canonical() returns true, and the script no-ops. Verified in unit tests (see /opt/docker/jellyfin/scripts/test_normalize.py — to be added in doc 07's implementation phase).

12. Edge cases catalogue

12.1 Episodes with very long titles

The Office (2005) - S07E25-E26 - Search Committee.mkv     ← multi-ep, short title, fine
Sherlock (2010) - S04E03 - The Final Problem.mkv           ← long-ish, fine
Steins;Gate (2011) - S01E22 - Being Meltdown - The Concerto Whose Conductor Has Lost His Baton.mkv

The third example is 110 chars before extension. ext4 allows 255 bytes per filename component; this fits. Smart title case applied; no : (the title has no colon — the long string is the actual title from MyAnimeList). If a title has a colon, it becomes - per § 5.5, which slightly extends the length but doesn't cap.

12.2 Episodes with `.` in the title

Mr. Robot (2015) - S01E01 - eps1.0_hellofriend.mov.mkv     ← title contains `.mov`

.mov inside the title is technically a substring that looks like a container type. The parser doesn't care (the extension is .mkv, parsed last). Keep as-is. Smart title case leaves the lowercase intentional formatting (it's the title's actual stylization).

12.3 Shows with numeric titles

1923 (2022) - S01E01 - 1923.mkv               ← year-as-title, year-as-disambiguation
24 (2001) - S01E01 - Day 1 - 12-00 AM-1-00 AM.mkv     ← `:` from title became ` - `

The 24 / 1923 cases would fail year extraction if the show year is omitted. Year hint via --year is mandatory for these.

12.4 Two-part single episodes (multi-part files)

Doc 05 § 2 mentions Series A S02E03 Part 1.mkv / Part 2.mkv. Canonical:

TV/Show (Year)/Season 02/Show (Year) - S02E03 - Title - part 1.mkv
TV/Show (Year)/Season 02/Show (Year) - S02E03 - Title - part 2.mkv

Use lowercase part (Jellyfin parser is case-insensitive but lowercase is more common in docs).

12.5 Source has no episode title

Source: Show.S01E01.1080p.WEB-DL.x264-NTb.mkv

Target: Show (Year) - S01E01.mkv

Empty episode title → omit. The script does this already (§ 11 emit_canonical() checks if parts.episode_title). Jellyfin will backfill the title from TVDB on first scrape.

12.6 Source has WRONG episode title

If the rip's episode title is different from TVDB's canonical (e.g. a Polish translation of an English-language show, or a non-canonical sub-group title), prefer the TVDB title (English, official). This requires manual intervention — pass --ep-title "Canonical Title" or edit after the rename. Not automated.

12.7 Dual-audio (sub+dub in one file)

If the mkv has both audio tracks, omit the [Sub]/[Dub] suffix:

Anime/One Piece/One Piece - 0001 - I'm Luffy.mkv      ← dual audio in container

The user can pick the audio track from the player. The filename only needs to disambiguate when separate files exist.

12.8 Mid-season hiatus / split seasons

Some shows split S01 into "Part 1" and "Part 2" (Better Call Saul, Stranger Things). Treat as one season:

TV/Stranger Things (2016)/Season 04/
├── Stranger Things (2016) - S04E01 - The Hellfire Club.mkv     ← Vol 1
├── ...
├── Stranger Things (2016) - S04E07 - The Massacre at Hawkins Lab.mkv ← Vol 1 finale
├── Stranger Things (2016) - S04E08 - Papa.mkv                  ← Vol 2 start
└── Stranger Things (2016) - S04E09 - The Piggyback.mkv         ← Vol 2 finale

TVDB lists S04 as one season, episodes 1-9. The hiatus is invisible to the parser. Don't create Season 04 Part 1/.

13. Verification checklist (doc 07 will use this)

Before declaring a normalized file "imported":

Filename matches the canonical regex for its category (§ 1).
No forbidden chars (§ 5.5) in any part of the path.
No group tags / quality / codec / source / audio tags in the basename (§ 2).
Folder structure matches § 1.x for the category.
Year is in (YYYY) and matches the actual release year (movies/TV).
Season NN/ is zero-padded (TV / anime-seasonal).
Episode S/E numbers zero-padded to two digits (three for >99).
Smart title case applied to all title-bearing components.
Apostrophes are ASCII ('), dashes are ASCII (-).
Diacritics in NFC form (UTF-8 encoded canonically).
The script's is_already_canonical() returns true on the result — re-running the normalizer leaves the file untouched.
Audit log line written to /var/log/jellyfin-imports/<date>.log.

If any check fails, the file is quarantined per doc 07 to a _pending/ subtree for manual review.

14. Quick reference card (for the operator)

Category	Canonical shape	Example
Movie	`Movies/T (Y)/T (Y).mkv`	`Movies/Inception (2010)/Inception (2010).mkv`
Movie+edition	`Movies/T (Y)/T (Y) - E.mkv`	`Movies/Blade Runner (1982)/Blade Runner (1982) - Final Cut.mkv`
Movie+resolution	`Movies/T (Y)/T (Y) - NNNNp.mkv`	`Movies/Blade Runner 2049 (2017)/Blade Runner 2049 (2017) - 2160p.mkv`
TV episode	`TV/S (Y)/Season NN/S (Y) - SXXEYY - Title.mkv`	`TV/Futurama (1999)/Season 01/Futurama (1999) - S01E01 - Space Pilot 3000.mkv`
TV multi-ep	`... - SXXEYY-EZZ - Title.mkv`	`Futurama (1999) - S01E03-E04 - I, Roommate / Love's Labours.mkv`
TV special	`... /Season 00/... - S00EYY - Title.mkv`	`Futurama (1999) - S00E01 - Bender's Big Score.mkv`
Anime seasonal	same as TV	`Cowboy Bebop (1998) - S01E01 - Asteroid Blues.mkv`
Anime absolute	`Anime/S/S - NNNN - Title [Sub].mkv`	`One Piece - 0001 - I'm Luffy [Sub].mkv`
Music video	`MV/A/Y - T.mp4`	`Daft Punk/2013 - Get Lucky.mp4`
Stand-up	`Movies/P - T (Y)/P - T (Y).mkv`	`Bo Burnham - Inside (2021)/Bo Burnham - Inside (2021).mkv`
Extra (folder)	`<item folder>/<lowercase folder>/Title.mkv`	`featurettes/Welcome to the World of Tomorrow.mkv`
Extra (suffix)	`... - Title-featurette.mkv`	`Inception (2010) - Dreams Within Dreams-featurette.mkv`
Subtitle	`<basename>.<lang>[.flag].srt`	`Futurama (1999) - S01E01.eng.srt`

15. Cross-references

Doc 05 § 0 — top-level filename rules (forbidden chars, year-in-parens, one folder per item).
Doc 05 § 1.2 — Jellyfin's accepted movie regex.
Doc 05 § 2.2 — Jellyfin's accepted TV regex (table of patterns).
Doc 05 § 3.1–3.3 — anime numbering strategies (which we map to § 1.3 and § 1.4 here).
Doc 05 § 8 — extras folder names (which we lowercase per § 4.5).
Doc 03 — sidecar subtitle naming (referenced in § 2.7 and § 14).
Doc 02 — what the scraper does after the rename, including the RemoteSearch/Apply recipe to fix mis-matches.
Doc 07 (sibling) — the operational pipeline (move, dedupe, GC) that consumes this ruleset. When doc 07 lands, link from § 13's verification checklist into doc 07's quarantine / re-run flow.

16. Open items / known drift

Live /home/user/media/tv/Futurama/ lacks the year — should be Futurama (1999)/. Migration covered in doc 07.
The script's TV-title-extraction does not yet handle parent folders named Specials (mapping to Season 00). Workaround: rename the folder first, then run normalize. Codify in v2.
Edition detection priority list has been chosen by frequency-of-rip, not by canon. If a future Blade Runner gets a "Workprint Edition" release, the list grows.
No automated tests for normalize.py yet — covered by doc 07 once that doc lands.

End of doc 08. The script in § 11 is the canonical source of truth; this doc explains it. When in doubt, run normalize.py --help and read the top docstring.

68 KiB Raw Blame History Unescape Escape

08 — Filename & Folder Normalization Ruleset (tv.s8n.ru)

0. Why a normalization ruleset (and why now)

1. Canonical formats — what the tree must look like

1.1 Movies

Examples

1.2 TV shows

Examples

Why this shape (not the slimmer Show S01E01.mkv)

1.3 Anime — seasonal numbering (TVDB-style)

Examples

1.4 Anime — absolute numbering

Deterministic absolute-numbering rule

Examples

Caveat

1.5 Music videos

Examples

1.6 Stand-up specials (Movies-typed)

Examples

2. What to STRIP from a source filename — exhaustive list

2.1 Group tags (square / round brackets)

2.2 Trailing release-group dashes

2.3 Quality / codec / source / audio tags

2.4 Source-specific cruft

2.5 Whitespace / punctuation cleanup

2.6 URL / website refs

2.7 Language indicators in the BASE name

3. The normalization pipeline (regex / sed / python)

3.1 Step 0 — Determine target schema

3.2 Step 1 — Split off extension

3.3 Step 2 — Extract SE (TV / anime-seasonal only)

3.4 Step 3 — Extract episode title

3.5 Step 4 — Extract series / movie title (from parent folder)

3.6 Step 5 — Detect edition tokens (Movies only)

3.7 Step 6 — Collapse, trim, re-emit canonical

4. Folder normalization

4.1 Show folder — <Show> (<Year>)

4.2 Season folder — Season <NN>

4.3 Movie folder — <Title> (<Year>)

4.4 Music-video artist folder — <Artist>

4.5 Special-features subfolders

5. Case + character handling

5.1 Smart title case

Examples

5.2 Hyphen / dash normalization

5.3 Apostrophes / quotes

5.4 Diacritics / non-ASCII

5.5 Forbidden-char substitution table

Examples

5.6 Whitespace canonicalization

6. Year disambiguation — concrete examples

6.1 Without year — what goes wrong

6.2 With year — clean match

6.3 Same year — provider ID required

6.4 Year on TV shows

7. Multi-version handling

7.1 Edition variants

7.2 Resolution variants

7.3 Mixed (edition × resolution)

7.4 What does NOT work

8. Special-features filename rules

Example

9. Worked example — the live Futurama import

9.1 BEFORE (representative source dump)

9.2 AFTER (canonical layout)

9.3 Per-file rename mapping

9.4 What the live tree currently has

10. Idempotency and safety

11. Reference implementation — normalize.py

11.1 Usage examples

11.2 Idempotency proof

12. Edge cases catalogue

12.1 Episodes with very long titles

12.2 Episodes with . in the title

12.3 Shows with numeric titles

12.4 Two-part single episodes (multi-part files)

12.5 Source has no episode title

12.6 Source has WRONG episode title

12.7 Dual-audio (sub+dub in one file)

12.8 Mid-season hiatus / split seasons

68 KiB

Raw Blame History

Why this shape (not the slimmer `Show S01E01.mkv`)

4.1 Show folder — `<Show> (<Year>)`

4.2 Season folder — `Season <NN>`

4.3 Movie folder — `<Title> (<Year>)`

4.4 Music-video artist folder — `<Artist>`

11. Reference implementation — `normalize.py`

12.2 Episodes with `.` in the title