Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
DMHY Template Labeling Requirements
This document records the current labeling contract for the DMHY template
metadata workflow. It is intentionally stricter than the old weak-label export:
precision is preferred over coverage, especially for TITLE, EPISODE, and
SEASON.
Source And Pipeline
- Source snapshot:
datasets/AnimeName/dmhy_list.jsonl. - Optional original source:
D:\WorkSpace\Python\dmhy-parser\dmhy_anime.db. - Template recipe generation and application live in
tools/rust_dmhy_template_apply. - Generated training JSONL rows must contain at least
filename,tokens,labels,template_id, andtemplate. - Reports and intermediate audits belong under
reports/; they are diagnostic artifacts, not authoritative dataset files.
Critical Label Semantics
TITLEis the anime/work title. It must be one contiguous span whenever a single title is being emitted.EPISODEis the episode number or explicit episode marker span.SEASONis the season/cour/part marker when the filename explicitly encodes season structure, such asS2,2nd Season,Second Season,第2季, orPart 5in series-part naming.GROUPis a release group or subtitle group, not the title.SOURCEcovers media/source/codec/language/platform-ish release metadata such asBDRip,WEB-DL,HEVC,AAC,Baha,CR,CHS,CHT,GB, andBIG5.RESOLUTIONcovers explicit resolution values such as720P,1080p, and1920x1080.SPECIALcovers non-episode extras such asNCOP,NCED,PV,CM,Menu,Trailer,Creditless ED, and movie/special numbering when it is not an episodic number.- Hash-like suffixes are retained as text in source filenames when useful, but they must not become entity labels in generated training data.
Title Rules
- Avoid duplicate titles. If the leaf filename already carries a complete title, season, and episode structure, drop redundant parent directory titles.
- If precision is uncertain, prefer skipping the row/template over producing a
duplicated or discontinuous
TITLE. - A title may contain punctuation or symbols. Internal title joiners must stay
inside the title span, including common ASCII separators and known Unicode
title punctuation such as
‐,–,—,$,∽,꞉, and♥. - Multiple title candidates in one filename should be handled explicitly: bilingual title aliases and special-program titles are allowed in rich review metadata, but the final weak training row should not emit arbitrary non-contiguous titles unless that structure has been reviewed.
- Generic prefixes such as
TV,TVアニメ, orアニメare not title when a real title follows.
Episode And Season Rules
TITLE,EPISODE, andSEASONare the highest-risk labels; errors here have higher training cost than dropping a row.SxxExxmeans season plus episode.Sidentifies season andEidentifies episode. If the tokenizer keepsS01E02as one compact token, project it to season and episode components during normalization; if split into marker and number tokens, the numeric value must carrySEASON/EPISODEand the marker may remain structuralO.01v2means episode01version2; the episode value must not be treated as title.- Episode ranges such as
01-13,#1-3, and CJK forms like第10話should remain episode spans. - Decimal episode-like values such as
14.5may be valid recap or midpoint episodes and should not be discarded only because they contain a decimal point. - Title-internal numbers stay in
TITLEwhen they are part of the work name, such asEien no 831,Zom 100, or movie titles likeMovie 27 The Million-Dollar Pentagram.
Path And Noise Rules
- BDMV expanded paths such as
BDMV/STREAM/00006are not useful training filenames and should be skipped. - Non-anime or abstract path data, including obvious
MTVpaths and tourism / railway program dumps, should be skipped. - Mojibake and encoding-noise rows should be skipped unless explicitly kept for diagnosis.
- Jellyfin-like paths (
Title/Season 1/E07 - Full Title ...) are valid, but the output should avoid duplicate title spans. - Parent directory context is allowed only when the leaf filename is too weak to identify the title; otherwise the leaf filename should dominate.
Review Strategy
- High-frequency templates affect training most and must be sampled more heavily.
- Low-frequency templates are gated conservatively; ambiguous cases are sent to review instead of generated training data.
- Middle-frequency templates should be audited by sampling a few examples from every template class, then grouping failures by rule rather than patching single examples blindly.
- A template can enter the generated training set only when its
TITLE,EPISODE, andSEASONbehavior is defensible across sampled rows.
Character Dataset Projection
- Regex-token JSONL is converted to character JSONL by projecting BIO labels:
first character keeps
B-X, later characters becomeI-X;OremainsO. - Punctuation tokens must remain independently represented before character projection so the model can learn filename structure boundaries.