Improve anime filename parser model

e63569d 6 days ago

7.07 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: token-classification
	tags:
	- anime
	- filename-parsing
	- bert
	- token-classification
	datasets:
	- ModerRAS/AnimeName
	language:
	- en
	- ja
	- zh
	---

	# AniFileBERT

	AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.

	The checkpoint in this repository is the full-relabel DMHY character-token model used by MiruPlay.

	## Model

	- Architecture: `BertForTokenClassification`
	- Hidden size: 256
	- Layers: 4
	- Attention heads: 8
	- Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
	- Tokenizer: custom character tokenizer implemented in `tokenizer.py`
	- Max sequence length: 128
	- Parameters: 4,783,631

	The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.

	## Dataset

	Training data snapshots are published separately in [`ModerRAS/AnimeName`](https://huggingface.co/datasets/ModerRAS/AnimeName), and this repository includes it as a nested git submodule at `datasets/AnimeName`.

	Current DMHY export waterline (from `datasets/AnimeName`):

	- Last exported `files.id`: `1675184`
	- Next incremental export: `--min-id 1675185`
	- Weak-labeled samples: `632002`
	- Mixed training samples: `732002`

	## Vocabulary

	The published checkpoint uses a character vocabulary. `vocab.json` at the
	repository root is the deployed tokenizer vocab, and `vocab.char.json` is kept
	as a mirrored explicit copy for training/data maintenance. The full DMHY weak
	dataset has 6195 unique characters, so the complete character vocab is only
	6199 entries including special tokens and reaches 100% token coverage.

	The regex vocabulary is still maintained in `datasets/AnimeName/vocab.json` for
	dataset relabeling and diagnostics, but the root checkpoint loads as `char`.

	## Evaluation

	Final full-relabel char training (`632002` DMHY rows, 2 epochs, batch size 256,
	seed 48):

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Eval loss \| 0.0163 \|
	\| Entity precision \| 0.9800 \|
	\| Entity recall \| 0.9867 \|
	\| Entity F1 \| 0.9833 \|
	\| Token accuracy \| 0.9943 \|
	\| Held-out parse full match \| 2008/2048 (0.9805) \|
	\| Fixed regression full match \| 21/21 (1.0000) \|

	The fixed regression set includes second-season aliases such as `Ni`,
	`Ni no Sara`, `貳`, and `弐ノ章`, plus long-running episode IDs and dense meta
	blocks.

	## Usage

	Install dependencies:

	```bash
	uv sync
	```

	Parse a filename with this repository cloned locally:

	```bash
	python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
	```

	Load only the model weights from the Hub:

	```python
	from transformers import BertForTokenClassification

	model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
	```

	For full parsing, clone this repo and use `load_tokenizer` from `tokenizer.py` or the CLI in `inference.py`.

	## Clone with Dataset Submodule

	```bash
	git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
	# or, after a normal clone:
	git submodule update --init --recursive
	```

	## Training

	### Character-token DMHY training

	```bash
	uv run python convert_to_char_dataset.py \
	--input datasets/AnimeName/dmhy_weak.jsonl \
	--output datasets/AnimeName/dmhy_weak_char.jsonl \
	--vocab-output datasets/AnimeName/vocab.char.json \
	--manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json

	uv run python train.py --tokenizer char \
	--data-file datasets/AnimeName/dmhy_weak_char.jsonl \
	--vocab-file datasets/AnimeName/vocab.char.json \
	--save-dir checkpoints/dmhy-char-full-relabel \
	--init-model-dir . \
	--epochs 2 --batch-size 256 \
	--learning-rate 0.00008 --warmup-steps 300 \
	--checkpoint-steps 1000 --save-total-limit 3 \
	--parse-eval-limit 2048 \
	--max-seq-length 128 --seed 48
	```

	The converter keeps source metadata and adds `tokenizer_variant`, source token
	count, and character token count fields to each record. The char dataset's
	p99 length is 107 characters, so `--max-seq-length 128` covers almost all rows
	while leaving room for `[CLS]` and `[SEP]`.

	### Relabel the full dataset

	```bash
	uv run python relabel_dataset_from_filenames.py \
	--input datasets/AnimeName/dmhy_weak.jsonl \
	--output datasets/AnimeName/dmhy_weak.relabel.jsonl \
	--manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json \
	--vocab-output datasets/AnimeName/vocab.relabel.json \
	--base-vocab datasets/AnimeName/vocab.json \
	--max-vocab-size 8000

	Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
	Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
	Copy-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force
	Remove-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json -Force
	```

	### Rebuild vocabulary (if needed)

	```bash
	python -c "
	import json, collections
	tokens = collections.Counter()
	[ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ]
	vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])}
	json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
	"
	```

	### Export ONNX for MiruPlay Android

	```bash
	uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
	```

	---

	## Google Colab Training

	For Codex-controlled short Colab sessions, see [`colab/README.md`](colab/README.md).
	Free Colab still has to be started manually, but once `colab_worker.py` is
	running Codex can submit jobs through `colab_client.py`, tail logs, and inspect
	status. Checkpoints live on Google Drive and default profiles resume from the
	latest checkpoint automatically.

	Manual one-shot runs are also supported:

	```bash
	python colab_train.py --profile dmhy_regex_finetune
	```

	## Repository Layout

	- `model.safetensors`, `config.json`, `vocab.json`: default published model
	- `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline
	- `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing
	- `convert_to_char_dataset.py`: full character-token projection for weak labels
	- `inference.py`: end-to-end filename parser CLI
	- `export_onnx.py`: ONNX export for Android integration
	- `exports/`: exported ONNX model and metadata
	- `datasets/AnimeName/`: nested dataset submodule

	## Maintenance Notes

	MiruPlay tracks this repository as `tools/anime_parser`, and this repository
	tracks `ModerRAS/AnimeName` as `datasets/AnimeName`. After updating either
	repo, remember to commit the submodule pointer in the parent repo.

	For the full maintenance workflow, see MiruPlay's
	`docs/anifilebert-maintenance.md`.