Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
Repository Guidelines
This repository is AniFileBERT, the Python model, dataset, training, inference,
and ONNX export workspace used by MiruPlay as tools/anime_parser.
Project Shape
- Root model artifacts (
config.json,model.safetensors,vocab.json,tokenizer_config.json,training_args.bin) are the published default checkpoint. - Core code lives in
train.py,dataset.py,tokenizer.py,model.py,inference.py, andexport_onnx.py. - Dataset generation and labeling helpers live in
data_generator.py,dmhy_dataset.py,mix_datasets.py,llm_labeler.py,semantic_labeler.py, andconvert_to_char_dataset.py. datasets/AnimeNameis a nested dataset submodule and should be treated as the authoritative dataset snapshot when present. Use eitherdmhy_weak.jsonlfor the regex tokenizer ordmhy_weak_char.jsonlfor the character tokenizer; the other dataset files are legacy snapshots.exports/contains Android-facing ONNX artifacts. Keep it in sync when changing export behavior or the published checkpoint.
Setup
python -m pip install -r requirements.txt
For local GPU training, install a CUDA-compatible PyTorch build first, then install the remaining requirements.
If the dataset submodule is missing, initialize it:
git submodule update --init --recursive
Common Commands
Run a parser smoke check:
python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
Run the lightweight training pipeline check:
python test_train_small.py --limit-samples 5000 --epochs 2
Train the default regex tokenizer from the dataset submodule:
python train.py --data-file datasets/AnimeName/dmhy_weak.jsonl --vocab-file datasets/AnimeName/vocab.json --save-dir checkpoints/dmhy-finetune --init-model-dir . --epochs 1 --batch-size 128 --learning-rate 0.0003 --warmup-steps 300 --seed 42
Train the character tokenizer only when that variant is intentional:
python train.py --tokenizer char --data-file datasets/AnimeName/dmhy_weak_char.jsonl --vocab-file datasets/AnimeName/vocab.char.json --save-dir checkpoints/dmhy-weak-char --epochs 1 --batch-size 64 --learning-rate 0.0003 --warmup-steps 300 --max-seq-length 128 --seed 42
Export for Android:
python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --android-assets-dir ../../scraper/src/main/assets/anime_parser
Codex-Controlled Colab Training
Free Colab cannot be treated as an always-on remote machine. Use it as a short-lived GPU worker only after the user manually opens a Colab runtime and starts the worker cell. Do not assume Codex can wake Colab by itself.
Before relying on the Colab flow, make sure the Colab helper files have been
pushed to the Hugging Face model repo, or the user has uploaded them manually:
colab_worker.py, colab_client.py, colab_train.py, and colab/.
Ask the user to start a Colab GPU runtime with:
from google.colab import drive
drive.mount("/content/drive")
!git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT /content/AniFileBERT || true
%cd /content/AniFileBERT
!git pull --ff-only || true
!git submodule update --init --recursive
!python colab_worker.py
The worker prints COLAB_WORKER_URL=... and COLAB_WORKER_TOKEN=.... After
the user provides those values, set them for local commands:
$env:ANIFILEBERT_COLAB_URL="https://...trycloudflare.com"
$env:ANIFILEBERT_COLAB_TOKEN="..."
python colab_client.py health
Submit the default regex fine-tune:
python colab_client.py submit --profile dmhy_regex_finetune --wait
Submit the character tokenizer run only when intentional:
python colab_client.py submit --profile dmhy_char_train --wait
Useful follow-up commands:
python colab_client.py jobs
python colab_client.py status <job-id>
python colab_client.py logs <job-id> --tail 200
python colab_client.py manifest <job-id>
python colab_client.py cancel <job-id>
The default Colab profiles save checkpoints to Google Drive every 1000 steps
and resume with resume_from_checkpoint: "auto", so if free Colab disconnects,
ask the user to restart the worker and submit the same profile again. Artifacts
land under MyDrive/AniFileBERT/checkpoints/<profile-name>/, and worker logs
land under MyDrive/AniFileBERT/worker/jobs/<job-id>/.
Validation Expectations
- For parser or tokenizer changes, run
python inference.py --model-dir . ...with at least one realistic filename. - For dataset alignment, tokenizer, model, or training-loop changes, run
python test_train_small.py --limit-samples 5000 --epochs 2when practical. - For export changes, run
python export_onnx.py ...and confirm the exporter reports a small PyTorch/ONNX logits difference. - Full training is expensive; do not start long multi-epoch runs unless the task explicitly requires it.
Data And Artifact Rules
- Avoid committing generated checkpoint directories such as
checkpoints/,test_checkpoints*/, andab_checkpoints*/. - Most
data/**/*.jsonlfiles are generated and ignored. The small checked-in fixtures aredata/synthetic_small.jsonlanddata/test_smoke.jsonl. - For real training, choose exactly one current dataset:
datasets/AnimeName/dmhy_weak.jsonlfor regex tokenization ordatasets/AnimeName/dmhy_weak_char.jsonlfor character tokenization. Treatmixed_train.jsonl,ab_mix_100k.jsonl, and other alternate JSONL files as legacy unless a task explicitly asks to inspect them. - Large binary artifacts are tracked through Git LFS by
.gitattributes. Preserve LFS handling for.safetensors,.onnx,.bin, and related model files. - When publishing a new checkpoint, copy the final checkpoint files to the
repository root as described in
MAINTENANCE.md. - When updating
datasets/AnimeName, commit the submodule pointer in this repo and then update the parent MiruPlay submodule pointer.
Coding Notes
- Keep the custom tokenizer contract stable: Android runtime tokenization must continue to match the exported vocabulary and model metadata.
- Preserve label names and BIO behavior unless a task explicitly changes the model schema; Android expects the current fields for title, season, episode, group, resolution, source, and special tags.
- Prefer deterministic dataset and training changes. Keep seed handling intact.
- Use UTF-8 for files that contain Japanese, Chinese, or release-name examples.
- Keep command examples Windows-friendly where paths reference MiruPlay.