Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
| # DMHY Template Labeling Requirements | |
| This document records the current labeling contract for the DMHY template | |
| metadata workflow. It is intentionally stricter than the old weak-label export: | |
| precision is preferred over coverage, especially for `TITLE`, `EPISODE`, and | |
| `SEASON`. | |
| ## Source And Pipeline | |
| - Source snapshot: `datasets/AnimeName/dmhy_list.jsonl`. | |
| - Optional original source: `D:\WorkSpace\Python\dmhy-parser\dmhy_anime.db`. | |
| - Template recipe generation and application live in | |
| `tools/rust_dmhy_template_apply`. | |
| - Generated training JSONL rows must contain at least `filename`, `tokens`, | |
| `labels`, `template_id`, and `template`. | |
| - Reports and intermediate audits belong under `reports/`; they are diagnostic | |
| artifacts, not authoritative dataset files. | |
| ## Critical Label Semantics | |
| - `TITLE` is the anime/work title. It must be one contiguous span whenever a | |
| single title is being emitted. | |
| - `EPISODE` is the episode number or explicit episode marker span. | |
| - `SEASON` is the season/cour/part marker when the filename explicitly encodes | |
| season structure, such as `S2`, `2nd Season`, `Second Season`, `第2季`, or | |
| `Part 5` in series-part naming. | |
| - `GROUP` is a release group or subtitle group, not the title. | |
| - `SOURCE` covers media/source/codec/language/platform-ish release metadata | |
| such as `BDRip`, `WEB-DL`, `HEVC`, `AAC`, `Baha`, `CR`, `CHS`, `CHT`, `GB`, | |
| and `BIG5`. | |
| - `RESOLUTION` covers explicit resolution values such as `720P`, `1080p`, and | |
| `1920x1080`. | |
| - `SPECIAL` covers non-episode extras such as `NCOP`, `NCED`, `PV`, `CM`, | |
| `Menu`, `Trailer`, `Creditless ED`, and movie/special numbering when it is not | |
| an episodic number. | |
| - Hash-like suffixes are retained as text in source filenames when useful, but | |
| they must not become entity labels in generated training data. | |
| ## Title Rules | |
| - Avoid duplicate titles. If the leaf filename already carries a complete title, | |
| season, and episode structure, drop redundant parent directory titles. | |
| - If precision is uncertain, prefer skipping the row/template over producing a | |
| duplicated or discontinuous `TITLE`. | |
| - A title may contain punctuation or symbols. Internal title joiners must stay | |
| inside the title span, including common ASCII separators and known Unicode | |
| title punctuation such as `‐`, `–`, `—`, `$`, `∽`, `꞉`, and `♥`. | |
| - Multiple title candidates in one filename should be handled explicitly: | |
| bilingual title aliases and special-program titles are allowed in rich review | |
| metadata, but the final weak training row should not emit arbitrary | |
| non-contiguous titles unless that structure has been reviewed. | |
| - Generic prefixes such as `TV`, `TVアニメ`, or `アニメ` are not title when a real | |
| title follows. | |
| ## Episode And Season Rules | |
| - `TITLE`, `EPISODE`, and `SEASON` are the highest-risk labels; errors here have | |
| higher training cost than dropping a row. | |
| - `SxxExx` means season plus episode. `S` identifies season and `E` identifies | |
| episode. If the tokenizer keeps `S01E02` as one compact token, project it to | |
| season and episode components during normalization; if split into marker and | |
| number tokens, the numeric value must carry `SEASON`/`EPISODE` and the marker | |
| may remain structural `O`. | |
| - `01v2` means episode `01` version `2`; the episode value must not be treated | |
| as title. | |
| - Episode ranges such as `01-13`, `#1-3`, and CJK forms like `第10話` should | |
| remain episode spans. | |
| - Decimal episode-like values such as `14.5` may be valid recap or midpoint | |
| episodes and should not be discarded only because they contain a decimal point. | |
| - Title-internal numbers stay in `TITLE` when they are part of the work name, | |
| such as `Eien no 831`, `Zom 100`, or movie titles like `Movie 27 The | |
| Million-Dollar Pentagram`. | |
| ## Path And Noise Rules | |
| - BDMV expanded paths such as `BDMV/STREAM/00006` are not useful training | |
| filenames and should be skipped. | |
| - Non-anime or abstract path data, including obvious `MTV` paths and tourism / | |
| railway program dumps, should be skipped. | |
| - Mojibake and encoding-noise rows should be skipped unless explicitly kept for | |
| diagnosis. | |
| - Jellyfin-like paths (`Title/Season 1/E07 - Full Title ...`) are valid, but | |
| the output should avoid duplicate title spans. | |
| - Parent directory context is allowed only when the leaf filename is too weak to | |
| identify the title; otherwise the leaf filename should dominate. | |
| ## Review Strategy | |
| - High-frequency templates affect training most and must be sampled more heavily. | |
| - Low-frequency templates are gated conservatively; ambiguous cases are sent to | |
| review instead of generated training data. | |
| - Middle-frequency templates should be audited by sampling a few examples from | |
| every template class, then grouping failures by rule rather than patching | |
| single examples blindly. | |
| - A template can enter the generated training set only when its `TITLE`, | |
| `EPISODE`, and `SEASON` behavior is defensible across sampled rows. | |
| ## Character Dataset Projection | |
| - Regex-token JSONL is converted to character JSONL by projecting BIO labels: | |
| first character keeps `B-X`, later characters become `I-X`; `O` remains `O`. | |
| - Punctuation tokens must remain independently represented before character | |
| projection so the model can learn filename structure boundaries. | |