Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
File size: 6,486 Bytes
a94e250 be5f706 a94e250 be5f706 410e000 be5f706 410e000 be5f706 410e000 be5f706 410e000 be5f706 410e000 be5f706 410e000 be5f706 410e000 be5f706 410e000 be5f706 3197202 be5f706 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 | ---
license: apache-2.0
library_name: transformers
pipeline_tag: token-classification
tags:
- anime
- filename-parsing
- bert
- token-classification
datasets:
- ModerRAS/AnimeName
language:
- en
- ja
- zh
---
# AniFileBERT
AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.
The checkpoint in this repository is the DMHY weak-label fine-tuned regex-tokenizer model used by MiruPlay.
## Model
- Architecture: `BertForTokenClassification`
- Hidden size: 256
- Layers: 4
- Attention heads: 8
- Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
- Tokenizer: custom regex/structure tokenizer implemented in `tokenizer.py`
- Max sequence length: 64
- Parameters: about 5M
The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
## Dataset
Training data snapshots are published separately in [`ModerRAS/AnimeName`](https://huggingface.co/datasets/ModerRAS/AnimeName), and this repository includes it as a nested git submodule at `datasets/AnimeName`.
Current DMHY export waterline (from `datasets/AnimeName`):
- Last exported `files.id`: `1675184`
- Next incremental export: `--min-id 1675185`
- Weak-labeled samples: `632002`
- Mixed training samples: `732002`
## Vocabulary
The default `vocab.json` contains **8000 tokens** (up from 3000) built from frequency
analysis of the full 632K DMHY weak-label dataset. Tokens not in the vocabulary
become `[UNK]`, so larger vocabulary directly improves coverage:
| Vocab size | Coverage | Model params |
|------------|----------|-------------|
| 3000 (old) | 90.4% | ~4.0M |
| 8000 (current) | 96.2% | ~5.3M |
Common fansub group names (`Snow`, `LoliHouse`, `DMG`, `KTXP`, `Sakurato`, etc.)
and individual bracket characters (`[`, `]`, `(`, `)`) are included in the new
vocabulary.
## Evaluation
Balanced mixed-data A/B run (`50K` synthetic + `50K` DMHY weak labels, 1 epoch, batch size 128, seed 42):
| Variant | Max length | Vocab | Params | Eval F1 | Accuracy | Train runtime |
|---------|------------|-------|--------|---------|----------|---------------|
| regex | 64 | 3000 | 3.96M | 0.9911 | 0.9951 | 827s |
| char | 128 | 2654 | 3.88M | 0.8142 | 0.9637 | 1983s |
Field-level F1 on the same validation split:
| Field | regex | char |
|-------|-------|------|
| GROUP | 0.9962 | 0.9516 |
| TITLE | 0.9761 | 0.7983 |
| SEASON | 0.9880 | 0.6290 |
| EPISODE | 0.9950 | 0.8082 |
The regex tokenizer remains the default. Both variants can parse simple `S01E07`, but the character tokenizer was weaker on season/episode boundaries and long title spans.
## Usage
Install dependencies:
```bash
pip install -r requirements.txt
```
Parse a filename with this repository cloned locally:
```bash
python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
```
Load only the model weights from the Hub:
```python
from transformers import BertForTokenClassification
model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
```
For full parsing, clone this repo and use `load_tokenizer` from `tokenizer.py` or the CLI in `inference.py`.
## Clone with Dataset Submodule
```bash
git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
# or, after a normal clone:
git submodule update --init --recursive
```
## Training
### Prerequisites (Windows / Local GPU)
PyTorch 2.11+ with CUDA 12.6 is required for GPU training:
```bash
pip install torch --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt
```
### Fine-tune with rebuilt vocabulary
```bash
python train.py --data-file datasets/AnimeName/dmhy_weak.jsonl \
--vocab-file datasets/AnimeName/vocab.json \
--save-dir checkpoints/dmhy-finetune \
--init-model-dir . \
--epochs 10 --batch-size 128 \
--learning-rate 0.0003 --warmup-steps 300 --seed 42
```
The model loads the old 3000-token checkpoint, `resize_token_embeddings()` adds
5000 new randomly-initialized slots for the new vocabulary, and fine-tuning
trains the full model. About 96% of token occurrences are now covered (vs 90%
with the old 3000-token vocabulary).
### Regenerate datasets from source
```bash
python data_generator.py --num-samples 100000
python dmhy_dataset.py --db D:/WorkSpace/Python/dmhy-parser/dmhy_anime.db --output data/dmhy/dmhy_weak.jsonl
python mix_datasets.py --synthetic data/synthetic.jsonl --dmhy data/dmhy/dmhy_weak.jsonl --output data/dmhy/mixed_train.jsonl
```
### Rebuild vocabulary (if needed)
```bash
python -c "
import json, collections
tokens = collections.Counter()
[ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ]
vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])}
json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
"
```
### Export ONNX for MiruPlay Android
```bash
python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --output exports/anime_filename_parser.onnx
```
---
## Google Colab Training
Upload and run [`colab_train.py`](colab_train.py) in a Colab GPU runtime.
It will mount Google Drive, clone both repos, install dependencies, and run
the full training pipeline. Checkpoints are saved to your Drive automatically.
## Repository Layout
- `model.safetensors`, `config.json`, `vocab.json`: default fine-tuned model
- `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline
- `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing
- `inference.py`: end-to-end filename parser CLI
- `export_onnx.py`: ONNX export for Android integration
- `exports/`: exported ONNX model and metadata
- `data/dmhy/*.manifest.json`: dataset waterlines and counts
- `datasets/AnimeName/`: nested dataset submodule
## Maintenance Notes
MiruPlay tracks this repository as `tools/anime_parser`, and this repository
tracks `ModerRAS/AnimeName` as `datasets/AnimeName`. After updating either
repo, remember to commit the submodule pointer in the parent repo.
For the full maintenance workflow, see MiruPlay's
`docs/anifilebert-maintenance.md`.
|