AniFileBERT / docs /dmhy_template_labeling.md
ModerRAS's picture
Document and tighten DMHY template labels
d92b315
|
raw
history blame
5.28 kB
# DMHY Template Labeling Requirements
This document records the current labeling contract for the DMHY template
metadata workflow. It is intentionally stricter than the old weak-label export:
precision is preferred over coverage, especially for `TITLE`, `EPISODE`, and
`SEASON`.
## Source And Pipeline
- Source snapshot: `datasets/AnimeName/dmhy_list.jsonl`.
- Optional original source: `D:\WorkSpace\Python\dmhy-parser\dmhy_anime.db`.
- Template recipe generation and application live in
`tools/rust_dmhy_template_apply`.
- Generated training JSONL rows must contain at least `filename`, `tokens`,
`labels`, `template_id`, and `template`.
- Reports and intermediate audits belong under `reports/`; they are diagnostic
artifacts, not authoritative dataset files.
## Critical Label Semantics
- `TITLE` is the anime/work title. It must be one contiguous span whenever a
single title is being emitted.
- `EPISODE` is the episode number or explicit episode marker span.
- `SEASON` is the season/cour/part marker when the filename explicitly encodes
season structure, such as `S2`, `2nd Season`, `Second Season`, `第2季`, or
`Part 5` in series-part naming.
- `GROUP` is a release group or subtitle group, not the title.
- `SOURCE` covers media/source/codec/language/platform-ish release metadata
such as `BDRip`, `WEB-DL`, `HEVC`, `AAC`, `Baha`, `CR`, `CHS`, `CHT`, `GB`,
and `BIG5`.
- `RESOLUTION` covers explicit resolution values such as `720P`, `1080p`, and
`1920x1080`.
- `SPECIAL` covers non-episode extras such as `NCOP`, `NCED`, `PV`, `CM`,
`Menu`, `Trailer`, `Creditless ED`, and movie/special numbering when it is not
an episodic number.
- Hash-like suffixes are retained as text in source filenames when useful, but
they must not become entity labels in generated training data.
## Title Rules
- Avoid duplicate titles. If the leaf filename already carries a complete title,
season, and episode structure, drop redundant parent directory titles.
- If precision is uncertain, prefer skipping the row/template over producing a
duplicated or discontinuous `TITLE`.
- A title may contain punctuation or symbols. Internal title joiners must stay
inside the title span, including common ASCII separators and known Unicode
title punctuation such as `‐`, `–`, `—`, `$`, `∽`, `꞉`, and `♥`.
- Multiple title candidates in one filename should be handled explicitly:
bilingual title aliases and special-program titles are allowed in rich review
metadata, but the final weak training row should not emit arbitrary
non-contiguous titles unless that structure has been reviewed.
- Generic prefixes such as `TV`, `TVアニメ`, or `アニメ` are not title when a real
title follows.
## Episode And Season Rules
- `TITLE`, `EPISODE`, and `SEASON` are the highest-risk labels; errors here have
higher training cost than dropping a row.
- `SxxExx` means season plus episode. `S` identifies season and `E` identifies
episode. If the tokenizer keeps `S01E02` as one compact token, project it to
season and episode components during normalization; if split into marker and
number tokens, the numeric value must carry `SEASON`/`EPISODE` and the marker
may remain structural `O`.
- `01v2` means episode `01` version `2`; the episode value must not be treated
as title.
- Episode ranges such as `01-13`, `#1-3`, and CJK forms like `第10話` should
remain episode spans.
- Decimal episode-like values such as `14.5` may be valid recap or midpoint
episodes and should not be discarded only because they contain a decimal point.
- Title-internal numbers stay in `TITLE` when they are part of the work name,
such as `Eien no 831`, `Zom 100`, or movie titles like `Movie 27 The
Million-Dollar Pentagram`.
## Path And Noise Rules
- BDMV expanded paths such as `BDMV/STREAM/00006` are not useful training
filenames and should be skipped.
- Non-anime or abstract path data, including obvious `MTV` paths and tourism /
railway program dumps, should be skipped.
- Mojibake and encoding-noise rows should be skipped unless explicitly kept for
diagnosis.
- Jellyfin-like paths (`Title/Season 1/E07 - Full Title ...`) are valid, but
the output should avoid duplicate title spans.
- Parent directory context is allowed only when the leaf filename is too weak to
identify the title; otherwise the leaf filename should dominate.
## Review Strategy
- High-frequency templates affect training most and must be sampled more heavily.
- Low-frequency templates are gated conservatively; ambiguous cases are sent to
review instead of generated training data.
- Middle-frequency templates should be audited by sampling a few examples from
every template class, then grouping failures by rule rather than patching
single examples blindly.
- A template can enter the generated training set only when its `TITLE`,
`EPISODE`, and `SEASON` behavior is defensible across sampled rows.
## Character Dataset Projection
- Regex-token JSONL is converted to character JSONL by projecting BIO labels:
first character keeps `B-X`, later characters become `I-X`; `O` remains `O`.
- Punctuation tokens must remain independently represented before character
projection so the model can learn filename structure boundaries.