DMHY Template Labeling Requirements

This document records the current labeling contract for the DMHY template metadata workflow. It is intentionally stricter than the old weak-label export: precision is preferred over coverage, especially for TITLE, EPISODE, and SEASON.

Source And Pipeline

Source snapshot: datasets/AnimeName/dmhy_list.jsonl.
Optional original source: D:\WorkSpace\Python\dmhy-parser\dmhy_anime.db.
Template recipe generation and application live in tools/rust_dmhy_template_apply.
Generated training JSONL rows must contain at least filename, tokens, labels, template_id, and template.
Reports and intermediate audits belong under reports/; they are diagnostic artifacts, not authoritative dataset files.

Critical Label Semantics

TITLE is the anime/work title. It must be one contiguous span whenever a single title is being emitted.
EPISODE is the episode number or explicit episode marker span.
SEASON is the season/cour/part marker when the filename explicitly encodes season structure, such as S2, 2nd Season, Second Season, 第2季, or Part 5 in series-part naming.
GROUP is a release group or subtitle group, not the title.
SOURCE covers media/source/codec/language/platform-ish release metadata such as BDRip, WEB-DL, HEVC, AAC, Baha, CR, CHS, CHT, GB, and BIG5.
RESOLUTION covers explicit resolution values such as 720P, 1080p, and 1920x1080.
SPECIAL covers non-episode extras such as NCOP, NCED, PV, CM, Menu, Trailer, Creditless ED, and movie/special numbering when it is not an episodic number.
Hash-like suffixes are retained as text in source filenames when useful, but they must not become entity labels in generated training data.

Title Rules

Avoid duplicate titles. If the leaf filename already carries a complete title, season, and episode structure, drop redundant parent directory titles.
If precision is uncertain, prefer skipping the row/template over producing a duplicated or discontinuous TITLE.
A title may contain punctuation or symbols. Internal title joiners must stay inside the title span, including common ASCII separators and known Unicode title punctuation such as ‐, –, —, ＄, ∽, ꞉, and ♥.
Multiple title candidates in one filename should be handled explicitly: bilingual title aliases and special-program titles are allowed in rich review metadata, but the final weak training row should not emit arbitrary non-contiguous titles unless that structure has been reviewed.
Generic prefixes such as TV, TVアニメ, or アニメ are not title when a real title follows.

Episode And Season Rules

TITLE, EPISODE, and SEASON are the highest-risk labels; errors here have higher training cost than dropping a row.
SxxExx means season plus episode. S identifies season and E identifies episode. If the tokenizer keeps S01E02 as one compact token, project it to season and episode components during normalization; if split into marker and number tokens, the numeric value must carry SEASON/EPISODE and the marker may remain structural O.
01v2 means episode 01 version 2; the episode value must not be treated as title.
Episode ranges such as 01-13, #1-3, and CJK forms like 第10話 should remain episode spans.
Decimal episode-like values such as 14.5 may be valid recap or midpoint episodes and should not be discarded only because they contain a decimal point.
Title-internal numbers stay in TITLE when they are part of the work name, such as Eien no 831, Zom 100, or movie titles like Movie 27 The Million-Dollar Pentagram.

Path And Noise Rules

BDMV expanded paths such as BDMV/STREAM/00006 are not useful training filenames and should be skipped.
Non-anime or abstract path data, including obvious MTV paths and tourism / railway program dumps, should be skipped.
Mojibake and encoding-noise rows should be skipped unless explicitly kept for diagnosis.
Jellyfin-like paths (Title/Season 1/E07 - Full Title ...) are valid, but the output should avoid duplicate title spans.
Parent directory context is allowed only when the leaf filename is too weak to identify the title; otherwise the leaf filename should dominate.

Review Strategy

High-frequency templates affect training most and must be sampled more heavily.
Low-frequency templates are gated conservatively; ambiguous cases are sent to review instead of generated training data.
Middle-frequency templates should be audited by sampling a few examples from every template class, then grouping failures by rule rather than patching single examples blindly.
A template can enter the generated training set only when its TITLE, EPISODE, and SEASON behavior is defensible across sampled rows.

Character Dataset Projection

Regex-token JSONL is converted to character JSONL by projecting BIO labels: first character keeps B-X, later characters become I-X; O remains O.
Punctuation tokens must remain independently represented before character projection so the model can learn filename structure boundaries.