--- license: apache-2.0 library_name: transformers pipeline_tag: token-classification tags: - anime - filename-parsing - bert - token-classification datasets: - ModerRAS/AnimeName language: - en - ja - zh --- # AniFileBERT AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags. The checkpoint in this repository is the full-relabel DMHY character-token model used by MiruPlay. ## Model - Architecture: `BertForTokenClassification` - Hidden size: 256 - Layers: 4 - Attention heads: 8 - Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL` - Tokenizer: custom character tokenizer implemented in `tokenizer.py` - Max sequence length: 128 - Parameters: 4,783,631 The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer. ## Dataset Training data snapshots are published separately in [`ModerRAS/AnimeName`](https://huggingface.co/datasets/ModerRAS/AnimeName), and this repository includes it as a nested git submodule at `datasets/AnimeName`. Current DMHY export waterline (from `datasets/AnimeName`): - Last exported `files.id`: `1675184` - Next incremental export: `--min-id 1675185` - Weak-labeled samples: `632002` - Mixed training samples: `732002` ## Vocabulary The published checkpoint uses a character vocabulary. `vocab.json` at the repository root is the deployed tokenizer vocab, and `vocab.char.json` is kept as a mirrored explicit copy for training/data maintenance. The full DMHY weak dataset has **6195 unique characters**, so the complete character vocab is only **6199** entries including special tokens and reaches 100% token coverage. The regex vocabulary is still maintained in `datasets/AnimeName/vocab.json` for dataset relabeling and diagnostics, but the root checkpoint loads as `char`. ## Evaluation Final full-relabel char training (`632002` DMHY rows, 2 epochs, batch size 256, seed 48): | Metric | Value | |--------|-------| | Eval loss | 0.0163 | | Entity precision | 0.9800 | | Entity recall | 0.9867 | | Entity F1 | 0.9833 | | Token accuracy | 0.9943 | | Held-out parse full match | 2008/2048 (0.9805) | | Fixed regression full match | 21/21 (1.0000) | The fixed regression set includes second-season aliases such as `Ni`, `Ni no Sara`, `貳`, and `弐ノ章`, plus long-running episode IDs and dense meta blocks. ## Usage Install dependencies: ```bash uv sync ``` Parse a filename with this repository cloned locally: ```bash python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub" ``` Load only the model weights from the Hub: ```python from transformers import BertForTokenClassification model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") ``` For full parsing, clone this repo and use `load_tokenizer` from `tokenizer.py` or the CLI in `inference.py`. ## Clone with Dataset Submodule ```bash git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT # or, after a normal clone: git submodule update --init --recursive ``` ## Training ### Character-token DMHY training ```bash uv run python convert_to_char_dataset.py \ --input datasets/AnimeName/dmhy_weak.jsonl \ --output datasets/AnimeName/dmhy_weak_char.jsonl \ --vocab-output datasets/AnimeName/vocab.char.json \ --manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json uv run python train.py --tokenizer char \ --data-file datasets/AnimeName/dmhy_weak_char.jsonl \ --vocab-file datasets/AnimeName/vocab.char.json \ --save-dir checkpoints/dmhy-char-full-relabel \ --init-model-dir . \ --epochs 2 --batch-size 256 \ --learning-rate 0.00008 --warmup-steps 300 \ --checkpoint-steps 1000 --save-total-limit 3 \ --parse-eval-limit 2048 \ --max-seq-length 128 --seed 48 ``` The converter keeps source metadata and adds `tokenizer_variant`, source token count, and character token count fields to each record. The char dataset's p99 length is 107 characters, so `--max-seq-length 128` covers almost all rows while leaving room for `[CLS]` and `[SEP]`. ### Relabel the full dataset ```bash uv run python relabel_dataset_from_filenames.py \ --input datasets/AnimeName/dmhy_weak.jsonl \ --output datasets/AnimeName/dmhy_weak.relabel.jsonl \ --manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json \ --vocab-output datasets/AnimeName/vocab.relabel.json \ --base-vocab datasets/AnimeName/vocab.json \ --max-vocab-size 8000 Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force Copy-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force Remove-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json -Force ``` ### Rebuild vocabulary (if needed) ```bash python -c " import json, collections tokens = collections.Counter() [ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ] vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])} json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2) " ``` ### Export ONNX for MiruPlay Android ```bash uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128 ``` --- ## Google Colab Training For Codex-controlled short Colab sessions, see [`colab/README.md`](colab/README.md). Free Colab still has to be started manually, but once `colab_worker.py` is running Codex can submit jobs through `colab_client.py`, tail logs, and inspect status. Checkpoints live on Google Drive and default profiles resume from the latest checkpoint automatically. Manual one-shot runs are also supported: ```bash python colab_train.py --profile dmhy_regex_finetune ``` ## Repository Layout - `model.safetensors`, `config.json`, `vocab.json`: default published model - `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline - `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing - `convert_to_char_dataset.py`: full character-token projection for weak labels - `inference.py`: end-to-end filename parser CLI - `export_onnx.py`: ONNX export for Android integration - `exports/`: exported ONNX model and metadata - `datasets/AnimeName/`: nested dataset submodule ## Maintenance Notes MiruPlay tracks this repository as `tools/anime_parser`, and this repository tracks `ModerRAS/AnimeName` as `datasets/AnimeName`. After updating either repo, remember to commit the submodule pointer in the parent repo. For the full maintenance workflow, see MiruPlay's `docs/anifilebert-maintenance.md`.