--- license: apache-2.0 library_name: transformers pipeline_tag: token-classification tags: - anime - filename-parsing - bert - token-classification datasets: - ModerRAS/AnimeName language: - en - ja - zh --- # AniFileBERT AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags. The checkpoint in this repository is the DMHY weak-label fine-tuned regex-tokenizer model used by MiruPlay. ## Model - Architecture: `BertForTokenClassification` - Hidden size: 256 - Layers: 4 - Attention heads: 8 - Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL` - Tokenizer: custom regex/structure tokenizer implemented in `tokenizer.py` - Max sequence length: 64 - Parameters: about 4M The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer. ## Dataset Training data snapshots are published separately in [`ModerRAS/AnimeName`](https://huggingface.co/datasets/ModerRAS/AnimeName), and this repository includes it as a nested git submodule at `datasets/AnimeName`. Current DMHY export waterline: - Last exported `files.id`: `689304` - Next incremental export: `--min-id 689305` - Weak-labeled samples: `263042` - Mixed training samples: `363042` ## Evaluation Balanced mixed-data A/B run (`50K` synthetic + `50K` DMHY weak labels, 1 epoch, batch size 128, seed 42): | Variant | Max length | Vocab | Params | Eval F1 | Accuracy | Train runtime | |---------|------------|-------|--------|---------|----------|---------------| | regex | 64 | 3000 | 3.96M | 0.9911 | 0.9951 | 827s | | char | 128 | 2654 | 3.88M | 0.8142 | 0.9637 | 1983s | Field-level F1 on the same validation split: | Field | regex | char | |-------|-------|------| | GROUP | 0.9962 | 0.9516 | | TITLE | 0.9761 | 0.7983 | | SEASON | 0.9880 | 0.6290 | | EPISODE | 0.9950 | 0.8082 | The regex tokenizer remains the default. Both variants can parse simple `S01E07`, but the character tokenizer was weaker on season/episode boundaries and long title spans. ## Usage Install dependencies: ```bash pip install -r requirements.txt ``` Parse a filename with this repository cloned locally: ```bash python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub" ``` Load only the model weights from the Hub: ```python from transformers import BertForTokenClassification model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") ``` For full parsing, clone this repo and use `load_tokenizer` from `tokenizer.py` or the CLI in `inference.py`. ## Clone with Dataset Submodule ```bash git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT # or, after a normal clone: git submodule update --init --recursive ``` ## Training Regenerate or export datasets: ```bash python data_generator.py --num-samples 100000 python dmhy_dataset.py --db D:/WorkSpace/Python/dmhy-parser/dmhy_anime.db --output data/dmhy/dmhy_weak.jsonl python mix_datasets.py --synthetic data/synthetic.jsonl --dmhy data/dmhy/dmhy_weak.jsonl --output data/dmhy/mixed_train.jsonl ``` Fine-tune from the synthetic checkpoint or train from scratch: ```bash python train.py --data-file data/dmhy/mixed_train.jsonl --save-dir checkpoints/dmhy-finetune --init-model-dir checkpoints/final --epochs 1 --batch-size 128 --learning-rate 0.0003 --warmup-steps 300 --seed 42 ``` Export ONNX for MiruPlay Android assets: ```bash python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --output exports/anime_filename_parser.onnx ``` ## Repository Layout - `model.safetensors`, `config.json`, `vocab.json`: default fine-tuned model - `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline - `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing - `inference.py`: end-to-end filename parser CLI - `export_onnx.py`: ONNX export for Android integration - `exports/`: exported ONNX model and metadata - `data/dmhy/*.manifest.json`: dataset waterlines and counts - `datasets/AnimeName/`: nested dataset submodule ## Maintenance Notes MiruPlay tracks this repository as `tools/anime_parser`, and this repository tracks `ModerRAS/AnimeName` as `datasets/AnimeName`. After updating either repo, remember to commit the submodule pointer in the parent repo. For the full maintenance workflow, see MiruPlay's `docs/anifilebert-maintenance.md`.