Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| library_name: transformers | |
| pipeline_tag: token-classification | |
| tags: | |
| - anime | |
| - filename-parsing | |
| - bert | |
| - token-classification | |
| datasets: | |
| - ModerRAS/AnimeName | |
| language: | |
| - en | |
| - ja | |
| - zh | |
| # AniFileBERT | |
| AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags. | |
| The checkpoint in this repository is the DMHY weak-label fine-tuned regex-tokenizer model used by MiruPlay. | |
| ## Model | |
| - Architecture: `BertForTokenClassification` | |
| - Hidden size: 256 | |
| - Layers: 4 | |
| - Attention heads: 8 | |
| - Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL` | |
| - Tokenizer: custom regex/structure tokenizer implemented in `tokenizer.py` | |
| - Max sequence length: 64 | |
| - Parameters: about 5M | |
| The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer. | |
| ## Dataset | |
| Training data snapshots are published separately in [`ModerRAS/AnimeName`](https://huggingface.co/datasets/ModerRAS/AnimeName), and this repository includes it as a nested git submodule at `datasets/AnimeName`. | |
| Current DMHY export waterline (from `datasets/AnimeName`): | |
| - Last exported `files.id`: `1675184` | |
| - Next incremental export: `--min-id 1675185` | |
| - Weak-labeled samples: `632002` | |
| - Mixed training samples: `732002` | |
| ## Vocabulary | |
| The default `vocab.json` contains **8000 tokens** (up from 3000) built from frequency | |
| analysis of the full 632K DMHY weak-label dataset. Tokens not in the vocabulary | |
| become `[UNK]`, so larger vocabulary directly improves coverage: | |
| | Vocab size | Coverage | Model params | | |
| |------------|----------|-------------| | |
| | 3000 (old) | 90.4% | ~4.0M | | |
| | 8000 (current) | 96.2% | ~5.3M | | |
| Common fansub group names (`Snow`, `LoliHouse`, `DMG`, `KTXP`, `Sakurato`, etc.) | |
| and individual bracket characters (`[`, `]`, `(`, `)`) are included in the new | |
| vocabulary. | |
| For character-token training, `vocab.char.json` is mirrored at the repository | |
| root for plain `git pull` users and also lives at | |
| `datasets/AnimeName/vocab.char.json` beside the dataset. It is built from the | |
| full `dmhy_weak_char.jsonl` export. The full DMHY weak dataset has **6195 | |
| unique characters**, so the complete character vocab is only **6199** entries | |
| including special tokens and reaches 100% token coverage. | |
| ## Evaluation | |
| Balanced mixed-data A/B run (`50K` synthetic + `50K` DMHY weak labels, 1 epoch, batch size 128, seed 42): | |
| | Variant | Max length | Vocab | Params | Eval F1 | Accuracy | Train runtime | | |
| |---------|------------|-------|--------|---------|----------|---------------| | |
| | regex | 64 | 3000 | 3.96M | 0.9911 | 0.9951 | 827s | | |
| | char | 128 | 2654 | 3.88M | 0.8142 | 0.9637 | 1983s | | |
| Field-level F1 on the same validation split: | |
| | Field | regex | char | | |
| |-------|-------|------| | |
| | GROUP | 0.9962 | 0.9516 | | |
| | TITLE | 0.9761 | 0.7983 | | |
| | SEASON | 0.9880 | 0.6290 | | |
| | EPISODE | 0.9950 | 0.8082 | | |
| The regex tokenizer remains the default. Both variants can parse simple `S01E07`, but the character tokenizer was weaker on season/episode boundaries and long title spans. | |
| ## Usage | |
| Install dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| Parse a filename with this repository cloned locally: | |
| ```bash | |
| python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub" | |
| ``` | |
| Load only the model weights from the Hub: | |
| ```python | |
| from transformers import BertForTokenClassification | |
| model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") | |
| ``` | |
| For full parsing, clone this repo and use `load_tokenizer` from `tokenizer.py` or the CLI in `inference.py`. | |
| ## Clone with Dataset Submodule | |
| ```bash | |
| git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT | |
| # or, after a normal clone: | |
| git submodule update --init --recursive | |
| ``` | |
| ## Training | |
| ### Prerequisites (Windows / Local GPU) | |
| PyTorch 2.11+ with CUDA 12.6 is required for GPU training: | |
| ```bash | |
| pip install torch --index-url https://download.pytorch.org/whl/cu126 | |
| pip install -r requirements.txt | |
| ``` | |
| ### Fine-tune with rebuilt vocabulary | |
| ```bash | |
| python train.py --data-file datasets/AnimeName/dmhy_weak.jsonl \ | |
| --vocab-file datasets/AnimeName/vocab.json \ | |
| --save-dir checkpoints/dmhy-finetune \ | |
| --init-model-dir . \ | |
| --epochs 10 --batch-size 128 \ | |
| --learning-rate 0.0003 --warmup-steps 300 --seed 42 | |
| ``` | |
| The model loads the old 3000-token checkpoint, `resize_token_embeddings()` adds | |
| 5000 new randomly-initialized slots for the new vocabulary, and fine-tuning | |
| trains the full model. About 96% of token occurrences are now covered (vs 90% | |
| with the old 3000-token vocabulary). | |
| ### Character-token DMHY training | |
| ```bash | |
| python convert_to_char_dataset.py \ | |
| --input datasets/AnimeName/dmhy_weak.jsonl \ | |
| --output datasets/AnimeName/dmhy_weak_char.jsonl \ | |
| --vocab-output vocab.char.json \ | |
| --manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json | |
| python train.py --tokenizer char \ | |
| --data-file datasets/AnimeName/dmhy_weak_char.jsonl \ | |
| --vocab-file vocab.char.json \ | |
| --save-dir checkpoints_char/dmhy-weak-char \ | |
| --epochs 1 --batch-size 64 \ | |
| --learning-rate 0.0003 --warmup-steps 300 \ | |
| --max-seq-length 128 --seed 42 | |
| ``` | |
| The converter keeps source metadata and adds `tokenizer_variant`, source token | |
| count, and character token count fields to each record. The char dataset's | |
| p99 length is 107 characters, so `--max-seq-length 128` covers almost all rows | |
| while leaving room for `[CLS]` and `[SEP]`. | |
| ### Regenerate datasets from source | |
| ```bash | |
| python data_generator.py --num-samples 100000 | |
| python dmhy_dataset.py --db D:/WorkSpace/Python/dmhy-parser/dmhy_anime.db --output data/dmhy/dmhy_weak.jsonl | |
| python mix_datasets.py --synthetic data/synthetic.jsonl --dmhy data/dmhy/dmhy_weak.jsonl --output data/dmhy/mixed_train.jsonl | |
| ``` | |
| ### Rebuild vocabulary (if needed) | |
| ```bash | |
| python -c " | |
| import json, collections | |
| tokens = collections.Counter() | |
| [ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ] | |
| vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])} | |
| json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2) | |
| " | |
| ``` | |
| ### Export ONNX for MiruPlay Android | |
| ```bash | |
| python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --output exports/anime_filename_parser.onnx | |
| ``` | |
| --- | |
| ## Google Colab Training | |
| For Codex-controlled short Colab sessions, see [`colab/README.md`](colab/README.md). | |
| Free Colab still has to be started manually, but once `colab_worker.py` is | |
| running Codex can submit jobs through `colab_client.py`, tail logs, and inspect | |
| status. Checkpoints live on Google Drive and default profiles resume from the | |
| latest checkpoint automatically. | |
| Manual one-shot runs are also supported: | |
| ```bash | |
| python colab_train.py --profile dmhy_regex_finetune | |
| ``` | |
| ## Repository Layout | |
| - `model.safetensors`, `config.json`, `vocab.json`: default fine-tuned model | |
| - `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline | |
| - `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing | |
| - `convert_to_char_dataset.py`: full character-token projection for weak labels | |
| - `inference.py`: end-to-end filename parser CLI | |
| - `export_onnx.py`: ONNX export for Android integration | |
| - `exports/`: exported ONNX model and metadata | |
| - `data/dmhy/*.manifest.json`: dataset waterlines and counts | |
| - `datasets/AnimeName/`: nested dataset submodule | |
| ## Maintenance Notes | |
| MiruPlay tracks this repository as `tools/anime_parser`, and this repository | |
| tracks `ModerRAS/AnimeName` as `datasets/AnimeName`. After updating either | |
| repo, remember to commit the submodule pointer in the parent repo. | |
| For the full maintenance workflow, see MiruPlay's | |
| `docs/anifilebert-maintenance.md`. | |