AniFileBERT / docs /dmhy_template_labeling.md
ModerRAS's picture
Document and tighten DMHY template labels
d92b315
|
raw
history blame
5.28 kB

DMHY Template Labeling Requirements

This document records the current labeling contract for the DMHY template metadata workflow. It is intentionally stricter than the old weak-label export: precision is preferred over coverage, especially for TITLE, EPISODE, and SEASON.

Source And Pipeline

  • Source snapshot: datasets/AnimeName/dmhy_list.jsonl.
  • Optional original source: D:\WorkSpace\Python\dmhy-parser\dmhy_anime.db.
  • Template recipe generation and application live in tools/rust_dmhy_template_apply.
  • Generated training JSONL rows must contain at least filename, tokens, labels, template_id, and template.
  • Reports and intermediate audits belong under reports/; they are diagnostic artifacts, not authoritative dataset files.

Critical Label Semantics

  • TITLE is the anime/work title. It must be one contiguous span whenever a single title is being emitted.
  • EPISODE is the episode number or explicit episode marker span.
  • SEASON is the season/cour/part marker when the filename explicitly encodes season structure, such as S2, 2nd Season, Second Season, 第2季, or Part 5 in series-part naming.
  • GROUP is a release group or subtitle group, not the title.
  • SOURCE covers media/source/codec/language/platform-ish release metadata such as BDRip, WEB-DL, HEVC, AAC, Baha, CR, CHS, CHT, GB, and BIG5.
  • RESOLUTION covers explicit resolution values such as 720P, 1080p, and 1920x1080.
  • SPECIAL covers non-episode extras such as NCOP, NCED, PV, CM, Menu, Trailer, Creditless ED, and movie/special numbering when it is not an episodic number.
  • Hash-like suffixes are retained as text in source filenames when useful, but they must not become entity labels in generated training data.

Title Rules

  • Avoid duplicate titles. If the leaf filename already carries a complete title, season, and episode structure, drop redundant parent directory titles.
  • If precision is uncertain, prefer skipping the row/template over producing a duplicated or discontinuous TITLE.
  • A title may contain punctuation or symbols. Internal title joiners must stay inside the title span, including common ASCII separators and known Unicode title punctuation such as , , , , , , and .
  • Multiple title candidates in one filename should be handled explicitly: bilingual title aliases and special-program titles are allowed in rich review metadata, but the final weak training row should not emit arbitrary non-contiguous titles unless that structure has been reviewed.
  • Generic prefixes such as TV, TVアニメ, or アニメ are not title when a real title follows.

Episode And Season Rules

  • TITLE, EPISODE, and SEASON are the highest-risk labels; errors here have higher training cost than dropping a row.
  • SxxExx means season plus episode. S identifies season and E identifies episode. If the tokenizer keeps S01E02 as one compact token, project it to season and episode components during normalization; if split into marker and number tokens, the numeric value must carry SEASON/EPISODE and the marker may remain structural O.
  • 01v2 means episode 01 version 2; the episode value must not be treated as title.
  • Episode ranges such as 01-13, #1-3, and CJK forms like 第10話 should remain episode spans.
  • Decimal episode-like values such as 14.5 may be valid recap or midpoint episodes and should not be discarded only because they contain a decimal point.
  • Title-internal numbers stay in TITLE when they are part of the work name, such as Eien no 831, Zom 100, or movie titles like Movie 27 The Million-Dollar Pentagram.

Path And Noise Rules

  • BDMV expanded paths such as BDMV/STREAM/00006 are not useful training filenames and should be skipped.
  • Non-anime or abstract path data, including obvious MTV paths and tourism / railway program dumps, should be skipped.
  • Mojibake and encoding-noise rows should be skipped unless explicitly kept for diagnosis.
  • Jellyfin-like paths (Title/Season 1/E07 - Full Title ...) are valid, but the output should avoid duplicate title spans.
  • Parent directory context is allowed only when the leaf filename is too weak to identify the title; otherwise the leaf filename should dominate.

Review Strategy

  • High-frequency templates affect training most and must be sampled more heavily.
  • Low-frequency templates are gated conservatively; ambiguous cases are sent to review instead of generated training data.
  • Middle-frequency templates should be audited by sampling a few examples from every template class, then grouping failures by rule rather than patching single examples blindly.
  • A template can enter the generated training set only when its TITLE, EPISODE, and SEASON behavior is defensible across sampled rows.

Character Dataset Projection

  • Regex-token JSONL is converted to character JSONL by projecting BIO labels: first character keeps B-X, later characters become I-X; O remains O.
  • Punctuation tokens must remain independently represented before character projection so the model can learn filename structure boundaries.