Add Codex Colab training workflow

e458112 7 days ago

8.06 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: token-classification
	tags:
	- anime
	- filename-parsing
	- bert
	- token-classification
	datasets:
	- ModerRAS/AnimeName
	language:
	- en
	- ja
	- zh
	---

	# AniFileBERT

	AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.

	The checkpoint in this repository is the DMHY weak-label fine-tuned regex-tokenizer model used by MiruPlay.

	## Model

	- Architecture: `BertForTokenClassification`
	- Hidden size: 256
	- Layers: 4
	- Attention heads: 8
	- Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
	- Tokenizer: custom regex/structure tokenizer implemented in `tokenizer.py`
	- Max sequence length: 64
	- Parameters: about 5M

	The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.

	## Dataset

	Training data snapshots are published separately in [`ModerRAS/AnimeName`](https://huggingface.co/datasets/ModerRAS/AnimeName), and this repository includes it as a nested git submodule at `datasets/AnimeName`.

	Current DMHY export waterline (from `datasets/AnimeName`):

	- Last exported `files.id`: `1675184`
	- Next incremental export: `--min-id 1675185`
	- Weak-labeled samples: `632002`
	- Mixed training samples: `732002`

	## Vocabulary

	The default `vocab.json` contains 8000 tokens (up from 3000) built from frequency
	analysis of the full 632K DMHY weak-label dataset. Tokens not in the vocabulary
	become `[UNK]`, so larger vocabulary directly improves coverage:

	\| Vocab size \| Coverage \| Model params \|
	\|------------\|----------\|-------------\|
	\| 3000 (old) \| 90.4% \| ~4.0M \|
	\| 8000 (current) \| 96.2% \| ~5.3M \|

	Common fansub group names (`Snow`, `LoliHouse`, `DMG`, `KTXP`, `Sakurato`, etc.)
	and individual bracket characters (`[`, `]`, `(`, `)`) are included in the new
	vocabulary.

	For character-token training, `vocab.char.json` is mirrored at the repository
	root for plain `git pull` users and also lives at
	`datasets/AnimeName/vocab.char.json` beside the dataset. It is built from the
	full `dmhy_weak_char.jsonl` export. The full DMHY weak dataset has **6195
	unique characters, so the complete character vocab is only 6199** entries
	including special tokens and reaches 100% token coverage.

	## Evaluation

	Balanced mixed-data A/B run (`50K` synthetic + `50K` DMHY weak labels, 1 epoch, batch size 128, seed 42):

	\| Variant \| Max length \| Vocab \| Params \| Eval F1 \| Accuracy \| Train runtime \|
	\|---------\|------------\|-------\|--------\|---------\|----------\|---------------\|
	\| regex \| 64 \| 3000 \| 3.96M \| 0.9911 \| 0.9951 \| 827s \|
	\| char \| 128 \| 2654 \| 3.88M \| 0.8142 \| 0.9637 \| 1983s \|

	Field-level F1 on the same validation split:

	\| Field \| regex \| char \|
	\|-------\|-------\|------\|
	\| GROUP \| 0.9962 \| 0.9516 \|
	\| TITLE \| 0.9761 \| 0.7983 \|
	\| SEASON \| 0.9880 \| 0.6290 \|
	\| EPISODE \| 0.9950 \| 0.8082 \|

	The regex tokenizer remains the default. Both variants can parse simple `S01E07`, but the character tokenizer was weaker on season/episode boundaries and long title spans.

	## Usage

	Install dependencies:

	```bash
	pip install -r requirements.txt
	```

	Parse a filename with this repository cloned locally:

	```bash
	python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
	```

	Load only the model weights from the Hub:

	```python
	from transformers import BertForTokenClassification

	model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
	```

	For full parsing, clone this repo and use `load_tokenizer` from `tokenizer.py` or the CLI in `inference.py`.

	## Clone with Dataset Submodule

	```bash
	git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
	# or, after a normal clone:
	git submodule update --init --recursive
	```

	## Training

	### Prerequisites (Windows / Local GPU)

	PyTorch 2.11+ with CUDA 12.6 is required for GPU training:

	```bash
	pip install torch --index-url https://download.pytorch.org/whl/cu126
	pip install -r requirements.txt
	```

	### Fine-tune with rebuilt vocabulary

	```bash
	python train.py --data-file datasets/AnimeName/dmhy_weak.jsonl \
	--vocab-file datasets/AnimeName/vocab.json \
	--save-dir checkpoints/dmhy-finetune \
	--init-model-dir . \
	--epochs 10 --batch-size 128 \
	--learning-rate 0.0003 --warmup-steps 300 --seed 42
	```

	The model loads the old 3000-token checkpoint, `resize_token_embeddings()` adds
	5000 new randomly-initialized slots for the new vocabulary, and fine-tuning
	trains the full model. About 96% of token occurrences are now covered (vs 90%
	with the old 3000-token vocabulary).

	### Character-token DMHY training

	```bash
	python convert_to_char_dataset.py \
	--input datasets/AnimeName/dmhy_weak.jsonl \
	--output datasets/AnimeName/dmhy_weak_char.jsonl \
	--vocab-output vocab.char.json \
	--manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json

	python train.py --tokenizer char \
	--data-file datasets/AnimeName/dmhy_weak_char.jsonl \
	--vocab-file vocab.char.json \
	--save-dir checkpoints_char/dmhy-weak-char \
	--epochs 1 --batch-size 64 \
	--learning-rate 0.0003 --warmup-steps 300 \
	--max-seq-length 128 --seed 42
	```

	The converter keeps source metadata and adds `tokenizer_variant`, source token
	count, and character token count fields to each record. The char dataset's
	p99 length is 107 characters, so `--max-seq-length 128` covers almost all rows
	while leaving room for `[CLS]` and `[SEP]`.

	### Regenerate datasets from source

	```bash
	python data_generator.py --num-samples 100000
	python dmhy_dataset.py --db D:/WorkSpace/Python/dmhy-parser/dmhy_anime.db --output data/dmhy/dmhy_weak.jsonl
	python mix_datasets.py --synthetic data/synthetic.jsonl --dmhy data/dmhy/dmhy_weak.jsonl --output data/dmhy/mixed_train.jsonl
	```

	### Rebuild vocabulary (if needed)

	```bash
	python -c "
	import json, collections
	tokens = collections.Counter()
	[ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ]
	vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])}
	json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
	"
	```

	### Export ONNX for MiruPlay Android

	```bash
	python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --output exports/anime_filename_parser.onnx
	```

	---

	## Google Colab Training

	For Codex-controlled short Colab sessions, see [`colab/README.md`](colab/README.md).
	Free Colab still has to be started manually, but once `colab_worker.py` is
	running Codex can submit jobs through `colab_client.py`, tail logs, and inspect
	status. Checkpoints live on Google Drive and default profiles resume from the
	latest checkpoint automatically.

	Manual one-shot runs are also supported:

	```bash
	python colab_train.py --profile dmhy_regex_finetune
	```

	## Repository Layout

	- `model.safetensors`, `config.json`, `vocab.json`: default fine-tuned model
	- `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline
	- `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing
	- `convert_to_char_dataset.py`: full character-token projection for weak labels
	- `inference.py`: end-to-end filename parser CLI
	- `export_onnx.py`: ONNX export for Android integration
	- `exports/`: exported ONNX model and metadata
	- `data/dmhy/*.manifest.json`: dataset waterlines and counts
	- `datasets/AnimeName/`: nested dataset submodule

	## Maintenance Notes

	MiruPlay tracks this repository as `tools/anime_parser`, and this repository
	tracks `ModerRAS/AnimeName` as `datasets/AnimeName`. After updating either
	repo, remember to commit the submodule pointer in the parent repo.

	For the full maintenance workflow, see MiruPlay's
	`docs/anifilebert-maintenance.md`.