Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
AniFileBERT Maintenance
This repository is the standalone Hugging Face model repo used by MiruPlay as
tools/anime_parser.
Related Repositories
| Repository | URL | Purpose |
|---|---|---|
| AniFileBERT | https://huggingface.co/ModerRAS/AniFileBERT |
Model, training scripts, ONNX export |
| AnimeName | https://huggingface.co/datasets/ModerRAS/AnimeName |
Training datasets and manifests |
| MiruPlay | https://github.com/ModerRAS/MiruPlay |
Android app and runtime integration |
Nested structure:
AniFileBERT
datasets/AnimeName -> ModerRAS/AnimeName
Clone
git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
After a normal clone:
git submodule update --init --recursive
Dataset Waterline
Current DMHY snapshot:
labeled_samples: 632002
char_vocab_size: 6199
strict_bio_violations: 0
The authoritative dataset files live in datasets/AnimeName.
Train
uv sync
uv run python train.py \
--tokenizer char \
--data-file datasets/AnimeName/dmhy_weak_char.jsonl \
--vocab-file datasets/AnimeName/vocab.char.json \
--save-dir checkpoints/dmhy-char-full-relabel \
--init-model-dir . \
--epochs 2 \
--batch-size 256 \
--learning-rate 0.00008 \
--warmup-steps 300 \
--max-seq-length 128 \
--checkpoint-steps 1000 \
--parse-eval-limit 2048 \
--seed 48
Publish a New Checkpoint
Copy the final checkpoint to the repository root:
Copy-Item checkpoints/dmhy-char-full-relabel/final/config.json . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/model.safetensors . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/tokenizer_config.json . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/training_args.bin . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/vocab.json . -Force
Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/run_metadata.json . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/trainer_eval_metrics.json . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/parse_eval_metrics.json . -Force
There is no tracked model/ duplicate. The root checkpoint is the publishing
surface; ignored checkpoints/ directories are training artifacts.
Then commit and push:
git add .
git commit -m "Update AniFileBERT checkpoint"
git push origin main
Update the Dataset Submodule
After pushing new files to ModerRAS/AnimeName, update the nested pointer:
git submodule update --remote datasets/AnimeName
git add datasets/AnimeName
git commit -m "Update AnimeName dataset pointer"
git push origin main
Update MiruPlay
From the MiruPlay root:
git submodule update --remote --recursive tools/anime_parser
git add tools/anime_parser
git commit -m "Update AniFileBERT submodule"
git push origin master
If a new ONNX export changed Android runtime assets, also stage:
scraper/src/main/assets/anime_parser/anime_filename_parser.onnx
scraper/src/main/assets/anime_parser/config.json
scraper/src/main/assets/anime_parser/vocab.json