File size: 7,073 Bytes

---
license: apache-2.0
library_name: transformers
pipeline_tag: token-classification
tags:
- anime
- filename-parsing
- bert
- token-classification
datasets:
- ModerRAS/AnimeName
language:
- en
- ja
- zh
---

# AniFileBERT

AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.

The checkpoint in this repository is the full-relabel DMHY character-token model used by MiruPlay.

## Model

- Architecture: `BertForTokenClassification`
- Hidden size: 256
- Layers: 4
- Attention heads: 8
- Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
- Tokenizer: custom character tokenizer implemented in `tokenizer.py`
- Max sequence length: 128
- Parameters: 4,783,631

The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.

## Dataset

Training data snapshots are published separately in [`ModerRAS/AnimeName`](https://huggingface.co/datasets/ModerRAS/AnimeName), and this repository includes it as a nested git submodule at `datasets/AnimeName`.

Current DMHY export waterline (from `datasets/AnimeName`):

- Last exported `files.id`: `1675184`
- Next incremental export: `--min-id 1675185`
- Weak-labeled samples: `632002`
- Mixed training samples: `732002`

## Vocabulary

The published checkpoint uses a character vocabulary. `vocab.json` at the
repository root is the deployed tokenizer vocab, and `vocab.char.json` is kept
as a mirrored explicit copy for training/data maintenance. The full DMHY weak
dataset has **6195 unique characters**, so the complete character vocab is only
**6199** entries including special tokens and reaches 100% token coverage.

The regex vocabulary is still maintained in `datasets/AnimeName/vocab.json` for
dataset relabeling and diagnostics, but the root checkpoint loads as `char`.

## Evaluation

Final full-relabel char training (`632002` DMHY rows, 2 epochs, batch size 256,
seed 48):

| Metric | Value |
|--------|-------|
| Eval loss | 0.0163 |
| Entity precision | 0.9800 |
| Entity recall | 0.9867 |
| Entity F1 | 0.9833 |
| Token accuracy | 0.9943 |
| Held-out parse full match | 2008/2048 (0.9805) |
| Fixed regression full match | 21/21 (1.0000) |

The fixed regression set includes second-season aliases such as `Ni`,
`Ni no Sara`, `貳`, and `弐ノ章`, plus long-running episode IDs and dense meta
blocks.

## Usage

Install dependencies:

```bash
uv sync
```

Parse a filename with this repository cloned locally:

```bash
python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
```

Load only the model weights from the Hub:

```python
from transformers import BertForTokenClassification

model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
```

For full parsing, clone this repo and use `load_tokenizer` from `tokenizer.py` or the CLI in `inference.py`.

## Clone with Dataset Submodule

```bash
git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
# or, after a normal clone:
git submodule update --init --recursive
```

## Training

### Character-token DMHY training

```bash
uv run python convert_to_char_dataset.py \
  --input datasets/AnimeName/dmhy_weak.jsonl \
  --output datasets/AnimeName/dmhy_weak_char.jsonl \
  --vocab-output datasets/AnimeName/vocab.char.json \
  --manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json

uv run python train.py --tokenizer char \
  --data-file datasets/AnimeName/dmhy_weak_char.jsonl \
  --vocab-file datasets/AnimeName/vocab.char.json \
  --save-dir checkpoints/dmhy-char-full-relabel \
  --init-model-dir . \
  --epochs 2 --batch-size 256 \
  --learning-rate 0.00008 --warmup-steps 300 \
  --checkpoint-steps 1000 --save-total-limit 3 \
  --parse-eval-limit 2048 \
  --max-seq-length 128 --seed 48
```

The converter keeps source metadata and adds `tokenizer_variant`, source token
count, and character token count fields to each record. The char dataset's
p99 length is 107 characters, so `--max-seq-length 128` covers almost all rows
while leaving room for `[CLS]` and `[SEP]`.

### Relabel the full dataset

```bash
uv run python relabel_dataset_from_filenames.py \
  --input datasets/AnimeName/dmhy_weak.jsonl \
  --output datasets/AnimeName/dmhy_weak.relabel.jsonl \
  --manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json \
  --vocab-output datasets/AnimeName/vocab.relabel.json \
  --base-vocab datasets/AnimeName/vocab.json \
  --max-vocab-size 8000

Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
Copy-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force
Remove-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json -Force
```

### Rebuild vocabulary (if needed)

```bash
python -c "
import json, collections
tokens = collections.Counter()
[ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ]
vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])}
json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
"
```

### Export ONNX for MiruPlay Android

```bash
uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
```

---

## Google Colab Training

For Codex-controlled short Colab sessions, see [`colab/README.md`](colab/README.md).
Free Colab still has to be started manually, but once `colab_worker.py` is
running Codex can submit jobs through `colab_client.py`, tail logs, and inspect
status. Checkpoints live on Google Drive and default profiles resume from the
latest checkpoint automatically.

Manual one-shot runs are also supported:

```bash
python colab_train.py --profile dmhy_regex_finetune
```

## Repository Layout

- `model.safetensors`, `config.json`, `vocab.json`: default published model
- `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline
- `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing
- `convert_to_char_dataset.py`: full character-token projection for weak labels
- `inference.py`: end-to-end filename parser CLI
- `export_onnx.py`: ONNX export for Android integration
- `exports/`: exported ONNX model and metadata
- `datasets/AnimeName/`: nested dataset submodule

## Maintenance Notes

MiruPlay tracks this repository as `tools/anime_parser`, and this repository
tracks `ModerRAS/AnimeName` as `datasets/AnimeName`. After updating either
repo, remember to commit the submodule pointer in the parent repo.

For the full maintenance workflow, see MiruPlay's
`docs/anifilebert-maintenance.md`.