# DMHY Template Labeling Requirements This document records the current labeling contract for the DMHY template metadata workflow. It is intentionally stricter than the old weak-label export: precision is preferred over coverage, especially for `TITLE`, `EPISODE`, and `SEASON`. ## Source And Pipeline - Source snapshot: `datasets/AnimeName/dmhy_list.jsonl`. - Optional original source: `D:\WorkSpace\Python\dmhy-parser\dmhy_anime.db`. - Template recipe generation and application live in `tools/rust_dmhy_template_apply`. - Generated training JSONL rows must contain at least `filename`, `tokens`, `labels`, `template_id`, and `template`. - Reports and intermediate audits belong under `reports/`; they are diagnostic artifacts, not authoritative dataset files. ## Critical Label Semantics - `TITLE` is the anime/work title. It must be one contiguous span whenever a single title is being emitted. - `EPISODE` is the episode number or explicit episode marker span. - `SEASON` is the season/cour/part marker when the filename explicitly encodes season structure, such as `S2`, `2nd Season`, `Second Season`, `第2季`, or `Part 5` in series-part naming. - `GROUP` is a release group or subtitle group, not the title. - `SOURCE` covers media/source/codec/language/platform-ish release metadata such as `BDRip`, `WEB-DL`, `HEVC`, `AAC`, `Baha`, `CR`, `CHS`, `CHT`, `GB`, and `BIG5`. - `RESOLUTION` covers explicit resolution values such as `720P`, `1080p`, and `1920x1080`. - `SPECIAL` covers non-episode extras such as `NCOP`, `NCED`, `PV`, `CM`, `Menu`, `Trailer`, `Creditless ED`, and movie/special numbering when it is not an episodic number. - Hash-like suffixes are retained as text in source filenames when useful, but they must not become entity labels in generated training data. ## Title Rules - Avoid duplicate titles. If the leaf filename already carries a complete title, season, and episode structure, drop redundant parent directory titles. - If precision is uncertain, prefer skipping the row/template over producing a duplicated or discontinuous `TITLE`. - A title may contain punctuation or symbols. Internal title joiners must stay inside the title span, including common ASCII separators and known Unicode title punctuation such as `‐`, `–`, `—`, `$`, `∽`, `꞉`, and `♥`. - Multiple title candidates in one filename should be handled explicitly: bilingual title aliases and special-program titles are allowed in rich review metadata, but the final weak training row should not emit arbitrary non-contiguous titles unless that structure has been reviewed. - Generic prefixes such as `TV`, `TVアニメ`, or `アニメ` are not title when a real title follows. ## Episode And Season Rules - `TITLE`, `EPISODE`, and `SEASON` are the highest-risk labels; errors here have higher training cost than dropping a row. - `SxxExx` means season plus episode. `S` identifies season and `E` identifies episode. If the tokenizer keeps `S01E02` as one compact token, project it to season and episode components during normalization; if split into marker and number tokens, the numeric value must carry `SEASON`/`EPISODE` and the marker may remain structural `O`. - `01v2` means episode `01` version `2`; the episode value must not be treated as title. - Episode ranges such as `01-13`, `#1-3`, and CJK forms like `第10話` should remain episode spans. - Decimal episode-like values such as `14.5` may be valid recap or midpoint episodes and should not be discarded only because they contain a decimal point. - Title-internal numbers stay in `TITLE` when they are part of the work name, such as `Eien no 831`, `Zom 100`, or movie titles like `Movie 27 The Million-Dollar Pentagram`. ## Path And Noise Rules - BDMV expanded paths such as `BDMV/STREAM/00006` are not useful training filenames and should be skipped. - Non-anime or abstract path data, including obvious `MTV` paths and tourism / railway program dumps, should be skipped. - Mojibake and encoding-noise rows should be skipped unless explicitly kept for diagnosis. - Jellyfin-like paths (`Title/Season 1/E07 - Full Title ...`) are valid, but the output should avoid duplicate title spans. - Parent directory context is allowed only when the leaf filename is too weak to identify the title; otherwise the leaf filename should dominate. ## Review Strategy - High-frequency templates affect training most and must be sampled more heavily. - Low-frequency templates are gated conservatively; ambiguous cases are sent to review instead of generated training data. - Middle-frequency templates should be audited by sampling a few examples from every template class, then grouping failures by rule rather than patching single examples blindly. - A template can enter the generated training set only when its `TITLE`, `EPISODE`, and `SEASON` behavior is defensible across sampled rows. ## Character Dataset Projection - Regex-token JSONL is converted to character JSONL by projecting BIO labels: first character keeps `B-X`, later characters become `I-X`; `O` remains `O`. - Punctuation tokens must remain independently represented before character projection so the model can learn filename structure boundaries.