AniFileBERT / docs /dmhy_template_labeling.md

Document and tighten DMHY template labels

d92b315 8 days ago

5.28 kB

	# DMHY Template Labeling Requirements

	This document records the current labeling contract for the DMHY template
	metadata workflow. It is intentionally stricter than the old weak-label export:
	precision is preferred over coverage, especially for `TITLE`, `EPISODE`, and
	`SEASON`.

	## Source And Pipeline

	- Source snapshot: `datasets/AnimeName/dmhy_list.jsonl`.
	- Optional original source: `D:\WorkSpace\Python\dmhy-parser\dmhy_anime.db`.
	- Template recipe generation and application live in
	`tools/rust_dmhy_template_apply`.
	- Generated training JSONL rows must contain at least `filename`, `tokens`,
	`labels`, `template_id`, and `template`.
	- Reports and intermediate audits belong under `reports/`; they are diagnostic
	artifacts, not authoritative dataset files.

	## Critical Label Semantics

	- `TITLE` is the anime/work title. It must be one contiguous span whenever a
	single title is being emitted.
	- `EPISODE` is the episode number or explicit episode marker span.
	- `SEASON` is the season/cour/part marker when the filename explicitly encodes
	season structure, such as `S2`, `2nd Season`, `Second Season`, `第2季`, or
	`Part 5` in series-part naming.
	- `GROUP` is a release group or subtitle group, not the title.
	- `SOURCE` covers media/source/codec/language/platform-ish release metadata
	such as `BDRip`, `WEB-DL`, `HEVC`, `AAC`, `Baha`, `CR`, `CHS`, `CHT`, `GB`,
	and `BIG5`.
	- `RESOLUTION` covers explicit resolution values such as `720P`, `1080p`, and
	`1920x1080`.
	- `SPECIAL` covers non-episode extras such as `NCOP`, `NCED`, `PV`, `CM`,
	`Menu`, `Trailer`, `Creditless ED`, and movie/special numbering when it is not
	an episodic number.
	- Hash-like suffixes are retained as text in source filenames when useful, but
	they must not become entity labels in generated training data.

	## Title Rules

	- Avoid duplicate titles. If the leaf filename already carries a complete title,
	season, and episode structure, drop redundant parent directory titles.
	- If precision is uncertain, prefer skipping the row/template over producing a
	duplicated or discontinuous `TITLE`.
	- A title may contain punctuation or symbols. Internal title joiners must stay
	inside the title span, including common ASCII separators and known Unicode
	title punctuation such as `‐`, `–`, `—`, `＄`, `∽`, `꞉`, and `♥`.
	- Multiple title candidates in one filename should be handled explicitly:
	bilingual title aliases and special-program titles are allowed in rich review
	metadata, but the final weak training row should not emit arbitrary
	non-contiguous titles unless that structure has been reviewed.
	- Generic prefixes such as `TV`, `TVアニメ`, or `アニメ` are not title when a real
	title follows.

	## Episode And Season Rules

	- `TITLE`, `EPISODE`, and `SEASON` are the highest-risk labels; errors here have
	higher training cost than dropping a row.
	- `SxxExx` means season plus episode. `S` identifies season and `E` identifies
	episode. If the tokenizer keeps `S01E02` as one compact token, project it to
	season and episode components during normalization; if split into marker and
	number tokens, the numeric value must carry `SEASON`/`EPISODE` and the marker
	may remain structural `O`.
	- `01v2` means episode `01` version `2`; the episode value must not be treated
	as title.
	- Episode ranges such as `01-13`, `#1-3`, and CJK forms like `第10話` should
	remain episode spans.
	- Decimal episode-like values such as `14.5` may be valid recap or midpoint
	episodes and should not be discarded only because they contain a decimal point.
	- Title-internal numbers stay in `TITLE` when they are part of the work name,
	such as `Eien no 831`, `Zom 100`, or movie titles like `Movie 27 The
	Million-Dollar Pentagram`.

	## Path And Noise Rules

	- BDMV expanded paths such as `BDMV/STREAM/00006` are not useful training
	filenames and should be skipped.
	- Non-anime or abstract path data, including obvious `MTV` paths and tourism /
	railway program dumps, should be skipped.
	- Mojibake and encoding-noise rows should be skipped unless explicitly kept for
	diagnosis.
	- Jellyfin-like paths (`Title/Season 1/E07 - Full Title ...`) are valid, but
	the output should avoid duplicate title spans.
	- Parent directory context is allowed only when the leaf filename is too weak to
	identify the title; otherwise the leaf filename should dominate.

	## Review Strategy

	- High-frequency templates affect training most and must be sampled more heavily.
	- Low-frequency templates are gated conservatively; ambiguous cases are sent to
	review instead of generated training data.
	- Middle-frequency templates should be audited by sampling a few examples from
	every template class, then grouping failures by rule rather than patching
	single examples blindly.
	- A template can enter the generated training set only when its `TITLE`,
	`EPISODE`, and `SEASON` behavior is defensible across sampled rows.

	## Character Dataset Projection

	- Regex-token JSONL is converted to character JSONL by projecting BIO labels:
	first character keeps `B-X`, later characters become `I-X`; `O` remains `O`.
	- Punctuation tokens must remain independently represented before character
	projection so the model can learn filename structure boundaries.