Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
license: apache-2.0
library_name: transformers
pipeline_tag: token-classification
tags:
- anime
- filename-parsing
- bert
- token-classification
datasets:
- ModerRAS/AnimeName
language:
- en
- ja
- zh
AniFileBERT
AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.
The checkpoint in this repository is the DMHY weak-label fine-tuned regex-tokenizer model used by MiruPlay.
Model
- Architecture:
BertForTokenClassification - Hidden size: 256
- Layers: 4
- Attention heads: 8
- Labels: BIO token labels for
TITLE,SEASON,EPISODE,GROUP,RESOLUTION,SOURCE, andSPECIAL - Tokenizer: custom regex/structure tokenizer implemented in
tokenizer.py - Max sequence length: 64
- Parameters: about 5M
The model files are stored at the repository root so BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") can load the weights. Use inference.py for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
Dataset
Training data snapshots are published separately in ModerRAS/AnimeName, and this repository includes it as a nested git submodule at datasets/AnimeName.
Current DMHY export waterline (from datasets/AnimeName):
- Last exported
files.id:1675184 - Next incremental export:
--min-id 1675185 - Weak-labeled samples:
632002 - Mixed training samples:
732002
Vocabulary
The default vocab.json contains 8000 tokens (up from 3000) built from frequency
analysis of the full 632K DMHY weak-label dataset. Tokens not in the vocabulary
become [UNK], so larger vocabulary directly improves coverage:
| Vocab size | Coverage | Model params |
|---|---|---|
| 3000 (old) | 90.4% | ~4.0M |
| 8000 (current) | 96.2% | ~5.3M |
Common fansub group names (Snow, LoliHouse, DMG, KTXP, Sakurato, etc.)
and individual bracket characters ([, ], (, )) are included in the new
vocabulary.
For character-token training, vocab.char.json is mirrored at the repository
root for plain git pull users and also lives at
datasets/AnimeName/vocab.char.json beside the dataset. It is built from the
full dmhy_weak_char.jsonl export. The full DMHY weak dataset has 6195
unique characters, so the complete character vocab is only 6199 entries
including special tokens and reaches 100% token coverage.
Evaluation
Balanced mixed-data A/B run (50K synthetic + 50K DMHY weak labels, 1 epoch, batch size 128, seed 42):
| Variant | Max length | Vocab | Params | Eval F1 | Accuracy | Train runtime |
|---|---|---|---|---|---|---|
| regex | 64 | 3000 | 3.96M | 0.9911 | 0.9951 | 827s |
| char | 128 | 2654 | 3.88M | 0.8142 | 0.9637 | 1983s |
Field-level F1 on the same validation split:
| Field | regex | char |
|---|---|---|
| GROUP | 0.9962 | 0.9516 |
| TITLE | 0.9761 | 0.7983 |
| SEASON | 0.9880 | 0.6290 |
| EPISODE | 0.9950 | 0.8082 |
The regex tokenizer remains the default. Both variants can parse simple S01E07, but the character tokenizer was weaker on season/episode boundaries and long title spans.
Usage
Install dependencies:
pip install -r requirements.txt
Parse a filename with this repository cloned locally:
python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
Load only the model weights from the Hub:
from transformers import BertForTokenClassification
model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
For full parsing, clone this repo and use load_tokenizer from tokenizer.py or the CLI in inference.py.
Clone with Dataset Submodule
git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
# or, after a normal clone:
git submodule update --init --recursive
Training
Prerequisites (Windows / Local GPU)
PyTorch 2.11+ with CUDA 12.6 is required for GPU training:
pip install torch --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt
Fine-tune with rebuilt vocabulary
python train.py --data-file datasets/AnimeName/dmhy_weak.jsonl \
--vocab-file datasets/AnimeName/vocab.json \
--save-dir checkpoints/dmhy-finetune \
--init-model-dir . \
--epochs 10 --batch-size 128 \
--learning-rate 0.0003 --warmup-steps 300 --seed 42
The model loads the old 3000-token checkpoint, resize_token_embeddings() adds
5000 new randomly-initialized slots for the new vocabulary, and fine-tuning
trains the full model. About 96% of token occurrences are now covered (vs 90%
with the old 3000-token vocabulary).
Character-token DMHY training
python convert_to_char_dataset.py \
--input datasets/AnimeName/dmhy_weak.jsonl \
--output datasets/AnimeName/dmhy_weak_char.jsonl \
--vocab-output vocab.char.json \
--manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json
python train.py --tokenizer char \
--data-file datasets/AnimeName/dmhy_weak_char.jsonl \
--vocab-file vocab.char.json \
--save-dir checkpoints_char/dmhy-weak-char \
--epochs 1 --batch-size 64 \
--learning-rate 0.0003 --warmup-steps 300 \
--max-seq-length 128 --seed 42
The converter keeps source metadata and adds tokenizer_variant, source token
count, and character token count fields to each record. The char dataset's
p99 length is 107 characters, so --max-seq-length 128 covers almost all rows
while leaving room for [CLS] and [SEP].
Regenerate datasets from source
python data_generator.py --num-samples 100000
python dmhy_dataset.py --db D:/WorkSpace/Python/dmhy-parser/dmhy_anime.db --output data/dmhy/dmhy_weak.jsonl
python mix_datasets.py --synthetic data/synthetic.jsonl --dmhy data/dmhy/dmhy_weak.jsonl --output data/dmhy/mixed_train.jsonl
Rebuild vocabulary (if needed)
python -c "
import json, collections
tokens = collections.Counter()
[ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ]
vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])}
json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
"
Export ONNX for MiruPlay Android
python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --output exports/anime_filename_parser.onnx
Google Colab Training
For Codex-controlled short Colab sessions, see colab/README.md.
Free Colab still has to be started manually, but once colab_worker.py is
running Codex can submit jobs through colab_client.py, tail logs, and inspect
status. Checkpoints live on Google Drive and default profiles resume from the
latest checkpoint automatically.
Manual one-shot runs are also supported:
python colab_train.py --profile dmhy_regex_finetune
Repository Layout
model.safetensors,config.json,vocab.json: default fine-tuned modeltrain.py,dataset.py,tokenizer.py,model.py: training pipelinedmhy_dataset.py,mix_datasets.py: weak-label export and dataset mixingconvert_to_char_dataset.py: full character-token projection for weak labelsinference.py: end-to-end filename parser CLIexport_onnx.py: ONNX export for Android integrationexports/: exported ONNX model and metadatadata/dmhy/*.manifest.json: dataset waterlines and countsdatasets/AnimeName/: nested dataset submodule
Maintenance Notes
MiruPlay tracks this repository as tools/anime_parser, and this repository
tracks ModerRAS/AnimeName as datasets/AnimeName. After updating either
repo, remember to commit the submodule pointer in the parent repo.
For the full maintenance workflow, see MiruPlay's
docs/anifilebert-maintenance.md.