重建词表：从632K dmhy_weak.jsonl统计频率取top8000，覆盖96.2%

410e000 11 days ago

6.49 kB

license: apache-2.0
library_name: transformers
pipeline_tag: token-classification
tags:
  - anime
  - filename-parsing
  - bert
  - token-classification
datasets:
  - ModerRAS/AnimeName
language:
  - en
  - ja
  - zh

AniFileBERT

AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.

The checkpoint in this repository is the DMHY weak-label fine-tuned regex-tokenizer model used by MiruPlay.

Model

Architecture: BertForTokenClassification
Hidden size: 256
Layers: 4
Attention heads: 8
Labels: BIO token labels for TITLE, SEASON, EPISODE, GROUP, RESOLUTION, SOURCE, and SPECIAL
Tokenizer: custom regex/structure tokenizer implemented in tokenizer.py
Max sequence length: 64
Parameters: about 5M

The model files are stored at the repository root so BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") can load the weights. Use inference.py for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.

Dataset

Training data snapshots are published separately in ModerRAS/AnimeName, and this repository includes it as a nested git submodule at datasets/AnimeName.

Current DMHY export waterline (from datasets/AnimeName):

Last exported files.id: 1675184
Next incremental export: --min-id 1675185
Weak-labeled samples: 632002
Mixed training samples: 732002

Vocabulary

The default vocab.json contains 8000 tokens (up from 3000) built from frequency analysis of the full 632K DMHY weak-label dataset. Tokens not in the vocabulary become [UNK], so larger vocabulary directly improves coverage:

Vocab size	Coverage	Model params
3000 (old)	90.4%	~4.0M
8000 (current)	96.2%	~5.3M

Common fansub group names (Snow, LoliHouse, DMG, KTXP, Sakurato, etc.) and individual bracket characters ([, ], (, )) are included in the new vocabulary.

Evaluation

Balanced mixed-data A/B run (50K synthetic + 50K DMHY weak labels, 1 epoch, batch size 128, seed 42):

Variant	Max length	Vocab	Params	Eval F1	Accuracy	Train runtime
regex	64	3000	3.96M	0.9911	0.9951	827s
char	128	2654	3.88M	0.8142	0.9637	1983s

Field-level F1 on the same validation split:

Field	regex	char
GROUP	0.9962	0.9516
TITLE	0.9761	0.7983
SEASON	0.9880	0.6290
EPISODE	0.9950	0.8082

The regex tokenizer remains the default. Both variants can parse simple S01E07, but the character tokenizer was weaker on season/episode boundaries and long title spans.

Usage

Install dependencies:

pip install -r requirements.txt

Parse a filename with this repository cloned locally:

python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"

Load only the model weights from the Hub:

from transformers import BertForTokenClassification

model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")

For full parsing, clone this repo and use load_tokenizer from tokenizer.py or the CLI in inference.py.

Clone with Dataset Submodule

git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
# or, after a normal clone:
git submodule update --init --recursive

Training

Prerequisites (Windows / Local GPU)

PyTorch 2.11+ with CUDA 12.6 is required for GPU training:

pip install torch --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt

Fine-tune with rebuilt vocabulary

python train.py --data-file datasets/AnimeName/dmhy_weak.jsonl \
  --vocab-file datasets/AnimeName/vocab.json \
  --save-dir checkpoints/dmhy-finetune \
  --init-model-dir . \
  --epochs 10 --batch-size 128 \
  --learning-rate 0.0003 --warmup-steps 300 --seed 42

The model loads the old 3000-token checkpoint, resize_token_embeddings() adds 5000 new randomly-initialized slots for the new vocabulary, and fine-tuning trains the full model. About 96% of token occurrences are now covered (vs 90% with the old 3000-token vocabulary).

Regenerate datasets from source

python data_generator.py --num-samples 100000
python dmhy_dataset.py --db D:/WorkSpace/Python/dmhy-parser/dmhy_anime.db --output data/dmhy/dmhy_weak.jsonl
python mix_datasets.py --synthetic data/synthetic.jsonl --dmhy data/dmhy/dmhy_weak.jsonl --output data/dmhy/mixed_train.jsonl

Rebuild vocabulary (if needed)

python -c "
import json, collections
tokens = collections.Counter()
[ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ]
vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])}
json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
"

Export ONNX for MiruPlay Android

python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --output exports/anime_filename_parser.onnx

Google Colab Training

Upload and run colab_train.py in a Colab GPU runtime. It will mount Google Drive, clone both repos, install dependencies, and run the full training pipeline. Checkpoints are saved to your Drive automatically.

Repository Layout

model.safetensors, config.json, vocab.json: default fine-tuned model
train.py, dataset.py, tokenizer.py, model.py: training pipeline
dmhy_dataset.py, mix_datasets.py: weak-label export and dataset mixing
inference.py: end-to-end filename parser CLI
export_onnx.py: ONNX export for Android integration
exports/: exported ONNX model and metadata
data/dmhy/*.manifest.json: dataset waterlines and counts
datasets/AnimeName/: nested dataset submodule

Maintenance Notes

MiruPlay tracks this repository as tools/anime_parser, and this repository tracks ModerRAS/AnimeName as datasets/AnimeName. After updating either repo, remember to commit the submodule pointer in the parent repo.

For the full maintenance workflow, see MiruPlay's docs/anifilebert-maintenance.md.