Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
File size: 7,073 Bytes
a94e250 be5f706 a94e250 be5f706 e63569d be5f706 e63569d be5f706 410e000 be5f706 410e000 e63569d 410e000 e63569d 0779202 be5f706 e63569d be5f706 e63569d be5f706 e63569d be5f706 e63569d be5f706 0779202 e63569d 0779202 e63569d 0779202 e63569d 0779202 e63569d 0779202 e63569d be5f706 e63569d be5f706 410e000 be5f706 410e000 be5f706 410e000 be5f706 e63569d be5f706 410e000 e458112 410e000 be5f706 e63569d be5f706 0779202 be5f706 3197202 be5f706 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | ---
license: apache-2.0
library_name: transformers
pipeline_tag: token-classification
tags:
- anime
- filename-parsing
- bert
- token-classification
datasets:
- ModerRAS/AnimeName
language:
- en
- ja
- zh
---
# AniFileBERT
AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.
The checkpoint in this repository is the full-relabel DMHY character-token model used by MiruPlay.
## Model
- Architecture: `BertForTokenClassification`
- Hidden size: 256
- Layers: 4
- Attention heads: 8
- Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
- Tokenizer: custom character tokenizer implemented in `tokenizer.py`
- Max sequence length: 128
- Parameters: 4,783,631
The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
## Dataset
Training data snapshots are published separately in [`ModerRAS/AnimeName`](https://huggingface.co/datasets/ModerRAS/AnimeName), and this repository includes it as a nested git submodule at `datasets/AnimeName`.
Current DMHY export waterline (from `datasets/AnimeName`):
- Last exported `files.id`: `1675184`
- Next incremental export: `--min-id 1675185`
- Weak-labeled samples: `632002`
- Mixed training samples: `732002`
## Vocabulary
The published checkpoint uses a character vocabulary. `vocab.json` at the
repository root is the deployed tokenizer vocab, and `vocab.char.json` is kept
as a mirrored explicit copy for training/data maintenance. The full DMHY weak
dataset has **6195 unique characters**, so the complete character vocab is only
**6199** entries including special tokens and reaches 100% token coverage.
The regex vocabulary is still maintained in `datasets/AnimeName/vocab.json` for
dataset relabeling and diagnostics, but the root checkpoint loads as `char`.
## Evaluation
Final full-relabel char training (`632002` DMHY rows, 2 epochs, batch size 256,
seed 48):
| Metric | Value |
|--------|-------|
| Eval loss | 0.0163 |
| Entity precision | 0.9800 |
| Entity recall | 0.9867 |
| Entity F1 | 0.9833 |
| Token accuracy | 0.9943 |
| Held-out parse full match | 2008/2048 (0.9805) |
| Fixed regression full match | 21/21 (1.0000) |
The fixed regression set includes second-season aliases such as `Ni`,
`Ni no Sara`, `貳`, and `弐ノ章`, plus long-running episode IDs and dense meta
blocks.
## Usage
Install dependencies:
```bash
uv sync
```
Parse a filename with this repository cloned locally:
```bash
python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
```
Load only the model weights from the Hub:
```python
from transformers import BertForTokenClassification
model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
```
For full parsing, clone this repo and use `load_tokenizer` from `tokenizer.py` or the CLI in `inference.py`.
## Clone with Dataset Submodule
```bash
git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
# or, after a normal clone:
git submodule update --init --recursive
```
## Training
### Character-token DMHY training
```bash
uv run python convert_to_char_dataset.py \
--input datasets/AnimeName/dmhy_weak.jsonl \
--output datasets/AnimeName/dmhy_weak_char.jsonl \
--vocab-output datasets/AnimeName/vocab.char.json \
--manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json
uv run python train.py --tokenizer char \
--data-file datasets/AnimeName/dmhy_weak_char.jsonl \
--vocab-file datasets/AnimeName/vocab.char.json \
--save-dir checkpoints/dmhy-char-full-relabel \
--init-model-dir . \
--epochs 2 --batch-size 256 \
--learning-rate 0.00008 --warmup-steps 300 \
--checkpoint-steps 1000 --save-total-limit 3 \
--parse-eval-limit 2048 \
--max-seq-length 128 --seed 48
```
The converter keeps source metadata and adds `tokenizer_variant`, source token
count, and character token count fields to each record. The char dataset's
p99 length is 107 characters, so `--max-seq-length 128` covers almost all rows
while leaving room for `[CLS]` and `[SEP]`.
### Relabel the full dataset
```bash
uv run python relabel_dataset_from_filenames.py \
--input datasets/AnimeName/dmhy_weak.jsonl \
--output datasets/AnimeName/dmhy_weak.relabel.jsonl \
--manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json \
--vocab-output datasets/AnimeName/vocab.relabel.json \
--base-vocab datasets/AnimeName/vocab.json \
--max-vocab-size 8000
Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
Copy-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force
Remove-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json -Force
```
### Rebuild vocabulary (if needed)
```bash
python -c "
import json, collections
tokens = collections.Counter()
[ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ]
vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])}
json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
"
```
### Export ONNX for MiruPlay Android
```bash
uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
```
---
## Google Colab Training
For Codex-controlled short Colab sessions, see [`colab/README.md`](colab/README.md).
Free Colab still has to be started manually, but once `colab_worker.py` is
running Codex can submit jobs through `colab_client.py`, tail logs, and inspect
status. Checkpoints live on Google Drive and default profiles resume from the
latest checkpoint automatically.
Manual one-shot runs are also supported:
```bash
python colab_train.py --profile dmhy_regex_finetune
```
## Repository Layout
- `model.safetensors`, `config.json`, `vocab.json`: default published model
- `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline
- `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing
- `convert_to_char_dataset.py`: full character-token projection for weak labels
- `inference.py`: end-to-end filename parser CLI
- `export_onnx.py`: ONNX export for Android integration
- `exports/`: exported ONNX model and metadata
- `datasets/AnimeName/`: nested dataset submodule
## Maintenance Notes
MiruPlay tracks this repository as `tools/anime_parser`, and this repository
tracks `ModerRAS/AnimeName` as `datasets/AnimeName`. After updating either
repo, remember to commit the submodule pointer in the parent repo.
For the full maintenance workflow, see MiruPlay's
`docs/anifilebert-maintenance.md`.
|