AniFileBERT / README.md
ModerRAS's picture
Improve anime filename parser model
e63569d
|
raw
history blame
7.07 kB
metadata
license: apache-2.0
library_name: transformers
pipeline_tag: token-classification
tags:
  - anime
  - filename-parsing
  - bert
  - token-classification
datasets:
  - ModerRAS/AnimeName
language:
  - en
  - ja
  - zh

AniFileBERT

AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.

The checkpoint in this repository is the full-relabel DMHY character-token model used by MiruPlay.

Model

  • Architecture: BertForTokenClassification
  • Hidden size: 256
  • Layers: 4
  • Attention heads: 8
  • Labels: BIO token labels for TITLE, SEASON, EPISODE, GROUP, RESOLUTION, SOURCE, and SPECIAL
  • Tokenizer: custom character tokenizer implemented in tokenizer.py
  • Max sequence length: 128
  • Parameters: 4,783,631

The model files are stored at the repository root so BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") can load the weights. Use inference.py for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.

Dataset

Training data snapshots are published separately in ModerRAS/AnimeName, and this repository includes it as a nested git submodule at datasets/AnimeName.

Current DMHY export waterline (from datasets/AnimeName):

  • Last exported files.id: 1675184
  • Next incremental export: --min-id 1675185
  • Weak-labeled samples: 632002
  • Mixed training samples: 732002

Vocabulary

The published checkpoint uses a character vocabulary. vocab.json at the repository root is the deployed tokenizer vocab, and vocab.char.json is kept as a mirrored explicit copy for training/data maintenance. The full DMHY weak dataset has 6195 unique characters, so the complete character vocab is only 6199 entries including special tokens and reaches 100% token coverage.

The regex vocabulary is still maintained in datasets/AnimeName/vocab.json for dataset relabeling and diagnostics, but the root checkpoint loads as char.

Evaluation

Final full-relabel char training (632002 DMHY rows, 2 epochs, batch size 256, seed 48):

Metric Value
Eval loss 0.0163
Entity precision 0.9800
Entity recall 0.9867
Entity F1 0.9833
Token accuracy 0.9943
Held-out parse full match 2008/2048 (0.9805)
Fixed regression full match 21/21 (1.0000)

The fixed regression set includes second-season aliases such as Ni, Ni no Sara, , and 弐ノ章, plus long-running episode IDs and dense meta blocks.

Usage

Install dependencies:

uv sync

Parse a filename with this repository cloned locally:

python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"

Load only the model weights from the Hub:

from transformers import BertForTokenClassification

model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")

For full parsing, clone this repo and use load_tokenizer from tokenizer.py or the CLI in inference.py.

Clone with Dataset Submodule

git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
# or, after a normal clone:
git submodule update --init --recursive

Training

Character-token DMHY training

uv run python convert_to_char_dataset.py \
  --input datasets/AnimeName/dmhy_weak.jsonl \
  --output datasets/AnimeName/dmhy_weak_char.jsonl \
  --vocab-output datasets/AnimeName/vocab.char.json \
  --manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json

uv run python train.py --tokenizer char \
  --data-file datasets/AnimeName/dmhy_weak_char.jsonl \
  --vocab-file datasets/AnimeName/vocab.char.json \
  --save-dir checkpoints/dmhy-char-full-relabel \
  --init-model-dir . \
  --epochs 2 --batch-size 256 \
  --learning-rate 0.00008 --warmup-steps 300 \
  --checkpoint-steps 1000 --save-total-limit 3 \
  --parse-eval-limit 2048 \
  --max-seq-length 128 --seed 48

The converter keeps source metadata and adds tokenizer_variant, source token count, and character token count fields to each record. The char dataset's p99 length is 107 characters, so --max-seq-length 128 covers almost all rows while leaving room for [CLS] and [SEP].

Relabel the full dataset

uv run python relabel_dataset_from_filenames.py \
  --input datasets/AnimeName/dmhy_weak.jsonl \
  --output datasets/AnimeName/dmhy_weak.relabel.jsonl \
  --manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json \
  --vocab-output datasets/AnimeName/vocab.relabel.json \
  --base-vocab datasets/AnimeName/vocab.json \
  --max-vocab-size 8000

Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
Copy-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force
Remove-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json -Force

Rebuild vocabulary (if needed)

python -c "
import json, collections
tokens = collections.Counter()
[ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ]
vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])}
json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
"

Export ONNX for MiruPlay Android

uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128

Google Colab Training

For Codex-controlled short Colab sessions, see colab/README.md. Free Colab still has to be started manually, but once colab_worker.py is running Codex can submit jobs through colab_client.py, tail logs, and inspect status. Checkpoints live on Google Drive and default profiles resume from the latest checkpoint automatically.

Manual one-shot runs are also supported:

python colab_train.py --profile dmhy_regex_finetune

Repository Layout

  • model.safetensors, config.json, vocab.json: default published model
  • train.py, dataset.py, tokenizer.py, model.py: training pipeline
  • dmhy_dataset.py, mix_datasets.py: weak-label export and dataset mixing
  • convert_to_char_dataset.py: full character-token projection for weak labels
  • inference.py: end-to-end filename parser CLI
  • export_onnx.py: ONNX export for Android integration
  • exports/: exported ONNX model and metadata
  • datasets/AnimeName/: nested dataset submodule

Maintenance Notes

MiruPlay tracks this repository as tools/anime_parser, and this repository tracks ModerRAS/AnimeName as datasets/AnimeName. After updating either repo, remember to commit the submodule pointer in the parent repo.

For the full maintenance workflow, see MiruPlay's docs/anifilebert-maintenance.md.