File size: 5,280 Bytes

d92b315

# DMHY Template Labeling Requirements

This document records the current labeling contract for the DMHY template
metadata workflow. It is intentionally stricter than the old weak-label export:
precision is preferred over coverage, especially for `TITLE`, `EPISODE`, and
`SEASON`.

## Source And Pipeline

- Source snapshot: `datasets/AnimeName/dmhy_list.jsonl`.
- Optional original source: `D:\WorkSpace\Python\dmhy-parser\dmhy_anime.db`.
- Template recipe generation and application live in
  `tools/rust_dmhy_template_apply`.
- Generated training JSONL rows must contain at least `filename`, `tokens`,
  `labels`, `template_id`, and `template`.
- Reports and intermediate audits belong under `reports/`; they are diagnostic
  artifacts, not authoritative dataset files.

## Critical Label Semantics

- `TITLE` is the anime/work title. It must be one contiguous span whenever a
  single title is being emitted.
- `EPISODE` is the episode number or explicit episode marker span.
- `SEASON` is the season/cour/part marker when the filename explicitly encodes
  season structure, such as `S2`, `2nd Season`, `Second Season`, `第2季`, or
  `Part 5` in series-part naming.
- `GROUP` is a release group or subtitle group, not the title.
- `SOURCE` covers media/source/codec/language/platform-ish release metadata
  such as `BDRip`, `WEB-DL`, `HEVC`, `AAC`, `Baha`, `CR`, `CHS`, `CHT`, `GB`,
  and `BIG5`.
- `RESOLUTION` covers explicit resolution values such as `720P`, `1080p`, and
  `1920x1080`.
- `SPECIAL` covers non-episode extras such as `NCOP`, `NCED`, `PV`, `CM`,
  `Menu`, `Trailer`, `Creditless ED`, and movie/special numbering when it is not
  an episodic number.
- Hash-like suffixes are retained as text in source filenames when useful, but
  they must not become entity labels in generated training data.

## Title Rules

- Avoid duplicate titles. If the leaf filename already carries a complete title,
  season, and episode structure, drop redundant parent directory titles.
- If precision is uncertain, prefer skipping the row/template over producing a
  duplicated or discontinuous `TITLE`.
- A title may contain punctuation or symbols. Internal title joiners must stay
  inside the title span, including common ASCII separators and known Unicode
  title punctuation such as `‐`, `–`, `—`, `＄`, `∽`, `꞉`, and `♥`.
- Multiple title candidates in one filename should be handled explicitly:
  bilingual title aliases and special-program titles are allowed in rich review
  metadata, but the final weak training row should not emit arbitrary
  non-contiguous titles unless that structure has been reviewed.
- Generic prefixes such as `TV`, `TVアニメ`, or `アニメ` are not title when a real
  title follows.

## Episode And Season Rules

- `TITLE`, `EPISODE`, and `SEASON` are the highest-risk labels; errors here have
  higher training cost than dropping a row.
- `SxxExx` means season plus episode. `S` identifies season and `E` identifies
  episode. If the tokenizer keeps `S01E02` as one compact token, project it to
  season and episode components during normalization; if split into marker and
  number tokens, the numeric value must carry `SEASON`/`EPISODE` and the marker
  may remain structural `O`.
- `01v2` means episode `01` version `2`; the episode value must not be treated
  as title.
- Episode ranges such as `01-13`, `#1-3`, and CJK forms like `第10話` should
  remain episode spans.
- Decimal episode-like values such as `14.5` may be valid recap or midpoint
  episodes and should not be discarded only because they contain a decimal point.
- Title-internal numbers stay in `TITLE` when they are part of the work name,
  such as `Eien no 831`, `Zom 100`, or movie titles like `Movie 27 The
  Million-Dollar Pentagram`.

## Path And Noise Rules

- BDMV expanded paths such as `BDMV/STREAM/00006` are not useful training
  filenames and should be skipped.
- Non-anime or abstract path data, including obvious `MTV` paths and tourism /
  railway program dumps, should be skipped.
- Mojibake and encoding-noise rows should be skipped unless explicitly kept for
  diagnosis.
- Jellyfin-like paths (`Title/Season 1/E07 - Full Title ...`) are valid, but
  the output should avoid duplicate title spans.
- Parent directory context is allowed only when the leaf filename is too weak to
  identify the title; otherwise the leaf filename should dominate.

## Review Strategy

- High-frequency templates affect training most and must be sampled more heavily.
- Low-frequency templates are gated conservatively; ambiguous cases are sent to
  review instead of generated training data.
- Middle-frequency templates should be audited by sampling a few examples from
  every template class, then grouping failures by rule rather than patching
  single examples blindly.
- A template can enter the generated training set only when its `TITLE`,
  `EPISODE`, and `SEASON` behavior is defensible across sampled rows.

## Character Dataset Projection

- Regex-token JSONL is converted to character JSONL by projecting BIO labels:
  first character keeps `B-X`, later characters become `I-X`; `O` remains `O`.
- Punctuation tokens must remain independently represented before character
  projection so the model can learn filename structure boundaries.