AniFileBERT / README.md
ModerRAS's picture
Document AniFileBERT maintenance workflow
3197202
|
raw
history blame
4.69 kB
metadata
license: apache-2.0
library_name: transformers
pipeline_tag: token-classification
tags:
  - anime
  - filename-parsing
  - bert
  - token-classification
datasets:
  - ModerRAS/AnimeName
language:
  - en
  - ja
  - zh

AniFileBERT

AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.

The checkpoint in this repository is the DMHY weak-label fine-tuned regex-tokenizer model used by MiruPlay.

Model

  • Architecture: BertForTokenClassification
  • Hidden size: 256
  • Layers: 4
  • Attention heads: 8
  • Labels: BIO token labels for TITLE, SEASON, EPISODE, GROUP, RESOLUTION, SOURCE, and SPECIAL
  • Tokenizer: custom regex/structure tokenizer implemented in tokenizer.py
  • Max sequence length: 64
  • Parameters: about 4M

The model files are stored at the repository root so BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") can load the weights. Use inference.py for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.

Dataset

Training data snapshots are published separately in ModerRAS/AnimeName, and this repository includes it as a nested git submodule at datasets/AnimeName.

Current DMHY export waterline:

  • Last exported files.id: 689304
  • Next incremental export: --min-id 689305
  • Weak-labeled samples: 263042
  • Mixed training samples: 363042

Evaluation

Balanced mixed-data A/B run (50K synthetic + 50K DMHY weak labels, 1 epoch, batch size 128, seed 42):

Variant Max length Vocab Params Eval F1 Accuracy Train runtime
regex 64 3000 3.96M 0.9911 0.9951 827s
char 128 2654 3.88M 0.8142 0.9637 1983s

Field-level F1 on the same validation split:

Field regex char
GROUP 0.9962 0.9516
TITLE 0.9761 0.7983
SEASON 0.9880 0.6290
EPISODE 0.9950 0.8082

The regex tokenizer remains the default. Both variants can parse simple S01E07, but the character tokenizer was weaker on season/episode boundaries and long title spans.

Usage

Install dependencies:

pip install -r requirements.txt

Parse a filename with this repository cloned locally:

python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"

Load only the model weights from the Hub:

from transformers import BertForTokenClassification

model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")

For full parsing, clone this repo and use load_tokenizer from tokenizer.py or the CLI in inference.py.

Clone with Dataset Submodule

git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
# or, after a normal clone:
git submodule update --init --recursive

Training

Regenerate or export datasets:

python data_generator.py --num-samples 100000
python dmhy_dataset.py --db D:/WorkSpace/Python/dmhy-parser/dmhy_anime.db --output data/dmhy/dmhy_weak.jsonl
python mix_datasets.py --synthetic data/synthetic.jsonl --dmhy data/dmhy/dmhy_weak.jsonl --output data/dmhy/mixed_train.jsonl

Fine-tune from the synthetic checkpoint or train from scratch:

python train.py --data-file data/dmhy/mixed_train.jsonl --save-dir checkpoints/dmhy-finetune --init-model-dir checkpoints/final --epochs 1 --batch-size 128 --learning-rate 0.0003 --warmup-steps 300 --seed 42

Export ONNX for MiruPlay Android assets:

python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --output exports/anime_filename_parser.onnx

Repository Layout

  • model.safetensors, config.json, vocab.json: default fine-tuned model
  • train.py, dataset.py, tokenizer.py, model.py: training pipeline
  • dmhy_dataset.py, mix_datasets.py: weak-label export and dataset mixing
  • inference.py: end-to-end filename parser CLI
  • export_onnx.py: ONNX export for Android integration
  • exports/: exported ONNX model and metadata
  • data/dmhy/*.manifest.json: dataset waterlines and counts
  • datasets/AnimeName/: nested dataset submodule

Maintenance Notes

MiruPlay tracks this repository as tools/anime_parser, and this repository tracks ModerRAS/AnimeName as datasets/AnimeName. After updating either repo, remember to commit the submodule pointer in the parent repo.

For the full maintenance workflow, see MiruPlay's docs/anifilebert-maintenance.md.