Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
license: apache-2.0
library_name: transformers
pipeline_tag: token-classification
tags:
- anime
- filename-parsing
- bert
- token-classification
datasets:
- ModerRAS/AnimeName
language:
- en
- ja
- zh
AniFileBERT
AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.
The checkpoint in this repository is the full-relabel DMHY character-token model used by MiruPlay.
Model
- Architecture:
BertForTokenClassification - Hidden size: 256
- Layers: 4
- Attention heads: 8
- Labels: BIO token labels for
TITLE,SEASON,EPISODE,GROUP,RESOLUTION,SOURCE, andSPECIAL - Tokenizer: custom character tokenizer implemented in
tokenizer.py - Max sequence length: 128
- Parameters: 4,783,631
The model files are stored at the repository root so BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") can load the weights. Use inference.py for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
Dataset
Training data snapshots are published separately in ModerRAS/AnimeName, and this repository includes it as a nested git submodule at datasets/AnimeName.
Current DMHY export waterline (from datasets/AnimeName):
- Last exported
files.id:1675184 - Next incremental export:
--min-id 1675185 - Weak-labeled samples:
632002 - Mixed training samples:
732002
Vocabulary
The published checkpoint uses a character vocabulary. vocab.json at the
repository root is the deployed tokenizer vocab, and vocab.char.json is kept
as a mirrored explicit copy for training/data maintenance. The full DMHY weak
dataset has 6195 unique characters, so the complete character vocab is only
6199 entries including special tokens and reaches 100% token coverage.
The regex vocabulary is still maintained in datasets/AnimeName/vocab.json for
dataset relabeling and diagnostics, but the root checkpoint loads as char.
Evaluation
Final full-relabel char training (632002 DMHY rows, 2 epochs, batch size 256,
seed 48):
| Metric | Value |
|---|---|
| Eval loss | 0.0163 |
| Entity precision | 0.9800 |
| Entity recall | 0.9867 |
| Entity F1 | 0.9833 |
| Token accuracy | 0.9943 |
| Held-out parse full match | 2008/2048 (0.9805) |
| Fixed regression full match | 21/21 (1.0000) |
The fixed regression set includes second-season aliases such as Ni,
Ni no Sara, 貳, and 弐ノ章, plus long-running episode IDs and dense meta
blocks.
Usage
Install dependencies:
uv sync
Parse a filename with this repository cloned locally:
python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
Load only the model weights from the Hub:
from transformers import BertForTokenClassification
model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
For full parsing, clone this repo and use load_tokenizer from tokenizer.py or the CLI in inference.py.
Clone with Dataset Submodule
git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
# or, after a normal clone:
git submodule update --init --recursive
Training
Character-token DMHY training
uv run python convert_to_char_dataset.py \
--input datasets/AnimeName/dmhy_weak.jsonl \
--output datasets/AnimeName/dmhy_weak_char.jsonl \
--vocab-output datasets/AnimeName/vocab.char.json \
--manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json
uv run python train.py --tokenizer char \
--data-file datasets/AnimeName/dmhy_weak_char.jsonl \
--vocab-file datasets/AnimeName/vocab.char.json \
--save-dir checkpoints/dmhy-char-full-relabel \
--init-model-dir . \
--epochs 2 --batch-size 256 \
--learning-rate 0.00008 --warmup-steps 300 \
--checkpoint-steps 1000 --save-total-limit 3 \
--parse-eval-limit 2048 \
--max-seq-length 128 --seed 48
The converter keeps source metadata and adds tokenizer_variant, source token
count, and character token count fields to each record. The char dataset's
p99 length is 107 characters, so --max-seq-length 128 covers almost all rows
while leaving room for [CLS] and [SEP].
Relabel the full dataset
uv run python relabel_dataset_from_filenames.py \
--input datasets/AnimeName/dmhy_weak.jsonl \
--output datasets/AnimeName/dmhy_weak.relabel.jsonl \
--manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json \
--vocab-output datasets/AnimeName/vocab.relabel.json \
--base-vocab datasets/AnimeName/vocab.json \
--max-vocab-size 8000
Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
Copy-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force
Remove-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json -Force
Rebuild vocabulary (if needed)
python -c "
import json, collections
tokens = collections.Counter()
[ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ]
vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])}
json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
"
Export ONNX for MiruPlay Android
uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
Google Colab Training
For Codex-controlled short Colab sessions, see colab/README.md.
Free Colab still has to be started manually, but once colab_worker.py is
running Codex can submit jobs through colab_client.py, tail logs, and inspect
status. Checkpoints live on Google Drive and default profiles resume from the
latest checkpoint automatically.
Manual one-shot runs are also supported:
python colab_train.py --profile dmhy_regex_finetune
Repository Layout
model.safetensors,config.json,vocab.json: default published modeltrain.py,dataset.py,tokenizer.py,model.py: training pipelinedmhy_dataset.py,mix_datasets.py: weak-label export and dataset mixingconvert_to_char_dataset.py: full character-token projection for weak labelsinference.py: end-to-end filename parser CLIexport_onnx.py: ONNX export for Android integrationexports/: exported ONNX model and metadatadatasets/AnimeName/: nested dataset submodule
Maintenance Notes
MiruPlay tracks this repository as tools/anime_parser, and this repository
tracks ModerRAS/AnimeName as datasets/AnimeName. After updating either
repo, remember to commit the submodule pointer in the parent repo.
For the full maintenance workflow, see MiruPlay's
docs/anifilebert-maintenance.md.