Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
| # Repository Guidelines | |
| This repository is `AniFileBERT`, the Python model, dataset, training, inference, | |
| and ONNX export workspace used by MiruPlay as `tools/anime_parser`. | |
| ## Project Shape | |
| - Root model artifacts (`config.json`, `model.safetensors`, `vocab.json`, | |
| `tokenizer_config.json`, `training_args.bin`) are the published default | |
| checkpoint. | |
| - Core parser/training code lives in `anifilebert/`. | |
| - Command-line tools live in `tools/`, including ONNX export, fixed-case | |
| evaluation, benchmarks, dataset relabeling, dataset generation, and Colab | |
| helpers. | |
| - `datasets/AnimeName` is a nested dataset submodule and should be treated as | |
| the authoritative dataset snapshot when present. Use either | |
| `dmhy_weak.jsonl` for the regex tokenizer or `dmhy_weak_char.jsonl` for the | |
| character tokenizer; the other dataset files are legacy snapshots. | |
| - `exports/` contains Android-facing ONNX artifacts. Keep it in sync when | |
| changing export behavior or the published checkpoint. | |
| ## Setup | |
| ```bash | |
| uv sync | |
| ``` | |
| Use `uv run`, `uv add`, and `uv sync` for environment operations. Do not use | |
| global `pip` for repository work. | |
| If the dataset submodule is missing, initialize it: | |
| ```bash | |
| git submodule update --init --recursive | |
| ``` | |
| ## Common Commands | |
| Run a parser smoke check: | |
| ```bash | |
| uv run python -m anifilebert.inference --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub" | |
| ``` | |
| Run fixed real-world parser regression: | |
| ```bash | |
| uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json | |
| ``` | |
| Benchmark PyTorch and ONNX Runtime inference: | |
| ```bash | |
| uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json | |
| ``` | |
| Train the current default character tokenizer: | |
| ```bash | |
| uv run python -m anifilebert.train --tokenizer char --data-file datasets/AnimeName/dmhy_weak_char.jsonl --vocab-file datasets/AnimeName/vocab.char.json --save-dir checkpoints/dmhy-char-full --init-model-dir . --epochs 2 --batch-size 256 --learning-rate 0.00008 --warmup-steps 300 --max-seq-length 128 --train-split 0.98 --num-workers 4 --checkpoint-steps 1000 --save-total-limit 3 --parse-eval-limit 2048 --case-eval-file data/parser_regression_cases.json --seed 52 --experiment-name dmhy-char-full | |
| ``` | |
| For large generated or hard-focus JSONL files, pre-encode train/eval shards | |
| with Rust before training to avoid the slow Python startup encode path: | |
| ```powershell | |
| cargo run --release --manifest-path tools\encoded_dataset_cache\Cargo.toml -- ` | |
| --input data\schema_v2_hard_focus_char_seed63.jsonl ` | |
| --vocab-file datasets\AnimeName\vocab.char.json ` | |
| --label-schema-file label_schema.json ` | |
| --output-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 ` | |
| --max-length 128 ` | |
| --train-split 0.95 ` | |
| --seed 63 ` | |
| --shard-size 25000 ` | |
| --threads 16 | |
| ``` | |
| Then pass the generated cache to training with the same data/vocab/max-length, | |
| split, and seed: | |
| ```powershell | |
| .\.venv\Scripts\python.exe -m anifilebert.train --tokenizer char ` | |
| --data-file data\schema_v2_hard_focus_char_seed63.jsonl ` | |
| --vocab-file datasets\AnimeName\vocab.char.json ` | |
| --encoded-cache-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 ` | |
| --max-seq-length 128 --train-split 0.95 --seed 63 | |
| ``` | |
| Do not combine `--encoded-cache-dir` with `--extra-data-file`, | |
| `--limit-samples`, `--rebuild-vocab`, training-time augmentation, or | |
| `--apply-label-repairs`. Regenerate the cache after changing the JSONL, vocab, | |
| label schema, max length, split ratio, or seed. | |
| Generate schema v2 synthetic augmentation with Rust. This is an independent | |
| augmentation chain and must not rewrite the authoritative DMHY dataset or the | |
| main DMHY template-application flow: | |
| ```powershell | |
| cargo run --release --manifest-path tools\schema_v2_synthetic_augment\Cargo.toml --bin schema_v2_synthetic_augment -- ` | |
| --recipes reports\dmhy_template_recipes.full_top5000.seed.jsonl ` | |
| --label-schema-file label_schema.json ` | |
| --numeric-title-seeds data\synthetic_numeric_titles.txt ` | |
| --path-prefix-seeds data\synthetic_path_prefixes.txt ` | |
| --limit-templates 3000 ` | |
| --max-rows 50000 ` | |
| --output data\schema_v2_synthetic_aug.jsonl ` | |
| --manifest-output data\schema_v2_synthetic_aug.manifest.json | |
| ``` | |
| Validate the generated augmentation with the Rust validator: | |
| ```powershell | |
| cargo run --release --manifest-path tools\schema_v2_synthetic_augment\Cargo.toml --bin validate_synthetic_aug_jsonl -- ` | |
| --input data\schema_v2_synthetic_aug.jsonl ` | |
| --manifest data\schema_v2_synthetic_aug.manifest.json | |
| ``` | |
| Export for Android: | |
| ```bash | |
| uv run python -m tools.export_onnx --model-dir . --max-length 128 --android-assets-dir ../../scraper/src/main/assets/anime_parser | |
| ``` | |
| ## Codex-Controlled Colab Training | |
| Free Colab cannot be treated as an always-on remote machine. Use it as a | |
| short-lived GPU worker only after the user manually opens a Colab runtime and | |
| starts the worker cell. Do not assume Codex can wake Colab by itself. | |
| Before relying on the Colab flow, make sure the Colab helper files have been | |
| pushed to the Hugging Face model repo, or the user has uploaded them manually: | |
| `tools/colab_worker.py`, `tools/colab_client.py`, `tools/colab_train.py`, and `colab/`. | |
| Ask the user to start a Colab GPU runtime with: | |
| ```python | |
| from google.colab import drive | |
| drive.mount("/content/drive") | |
| !git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT /content/AniFileBERT || true | |
| %cd /content/AniFileBERT | |
| !git pull --ff-only || true | |
| !git submodule update --init --recursive | |
| !python -m tools.colab_worker | |
| ``` | |
| The worker prints `COLAB_WORKER_URL=...` and `COLAB_WORKER_TOKEN=...`. After | |
| the user provides those values, set them for local commands: | |
| ```powershell | |
| $env:ANIFILEBERT_COLAB_URL="https://...trycloudflare.com" | |
| $env:ANIFILEBERT_COLAB_TOKEN="..." | |
| python -m tools.colab_client health | |
| ``` | |
| Submit the default regex fine-tune: | |
| ```powershell | |
| python -m tools.colab_client submit --profile dmhy_regex_finetune --wait | |
| ``` | |
| Submit the character tokenizer run only when intentional: | |
| ```powershell | |
| python -m tools.colab_client submit --profile dmhy_char_train --wait | |
| ``` | |
| Useful follow-up commands: | |
| ```powershell | |
| python -m tools.colab_client jobs | |
| python -m tools.colab_client status <job-id> | |
| python -m tools.colab_client logs <job-id> --tail 200 | |
| python -m tools.colab_client manifest <job-id> | |
| python -m tools.colab_client cancel <job-id> | |
| ``` | |
| The default Colab profiles save checkpoints to Google Drive every 1000 steps | |
| and resume with `resume_from_checkpoint: "auto"`, so if free Colab disconnects, | |
| ask the user to restart the worker and submit the same profile again. Artifacts | |
| land under `MyDrive/AniFileBERT/checkpoints/<profile-name>/`, and worker logs | |
| land under `MyDrive/AniFileBERT/worker/jobs/<job-id>/`. | |
| ## Validation Expectations | |
| - For parser or tokenizer changes, run `python -m anifilebert.inference --model-dir . ...` | |
| with at least one realistic filename. | |
| - Run `uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json` | |
| before publishing parser changes. | |
| - For dataset alignment, tokenizer, model, or training-loop changes, run | |
| `python -m tools.test_train_small --limit-samples 5000 --epochs 2` when practical. | |
| - For Rust encoded-cache changes, run `cargo check --manifest-path tools\encoded_dataset_cache\Cargo.toml`, | |
| generate a small cache with `--limit-rows`, and verify `python -m anifilebert.train` | |
| can start with `--encoded-cache-dir`. | |
| - For schema v2 synthetic augmentation changes, prefer Rust tools over Python: | |
| run `cargo test --manifest-path tools\schema_v2_synthetic_augment\Cargo.toml`, | |
| generate a small smoke JSONL, and validate it with the Rust | |
| `validate_synthetic_aug_jsonl` binary. Confirm the manifest reports separate | |
| `path_series_rows`, `path_movie_rows`, `path_special_rows`, | |
| `path_confuser_rows`, and `dropped_media_kind_mismatch`. | |
| - For export changes, run `python -m tools.export_onnx ...` and confirm the exporter | |
| reports a small PyTorch/ONNX logits difference. | |
| - For performance-sensitive inference changes, run `uv run python -m tools.benchmark_inference ...` | |
| and update `reports/benchmark_results.json` plus the README performance table. | |
| - Full training is expensive; do not start long multi-epoch runs unless the | |
| task explicitly requires it. | |
| ## Data And Artifact Rules | |
| - Avoid committing generated checkpoint directories such as `checkpoints/`, | |
| `test_checkpoints*/`, and `ab_checkpoints*/`. | |
| - Most `data/**/*.jsonl` files are generated and ignored. The small checked-in | |
| fixtures are `data/synthetic_small.jsonl` and `data/test_smoke.jsonl`. | |
| - Rust encoded dataset caches under `data/encoded_cache/` are generated | |
| artifacts and should not be committed. | |
| - For real training, choose exactly one current dataset: | |
| `datasets/AnimeName/dmhy_weak.jsonl` for regex tokenization or | |
| `datasets/AnimeName/dmhy_weak_char.jsonl` for character tokenization. | |
| Synthetic augmentation JSONL such as `data/schema_v2_synthetic_aug.jsonl` | |
| should be mixed in as an independent augmentation source, not treated as a | |
| replacement for the authoritative dataset. Treat `mixed_train.jsonl`, | |
| `ab_mix_100k.jsonl`, and other alternate JSONL files as legacy unless a task | |
| explicitly asks to inspect them. | |
| - The published default checkpoint is the character tokenizer variant with | |
| `max_seq_length=128`. Keep `vocab.json`, `vocab.char.json`, `config.json`, | |
| ONNX export, Android assets, and docs synchronized. | |
| - Large binary artifacts are tracked through Git LFS by `.gitattributes`. | |
| Preserve LFS handling for `.safetensors`, `.onnx`, `.bin`, and related model | |
| files. | |
| - When publishing a new checkpoint, copy the final checkpoint files to the | |
| repository root and reports as described in `docs/maintenance.md`. | |
| - When updating `datasets/AnimeName`, commit the submodule pointer in this repo | |
| and then update the parent MiruPlay submodule pointer. | |
| - Push LFS objects before pushing Git commits when model or ONNX artifacts | |
| changed: `git lfs push origin main --all`, then `git push origin main`. | |
| ## Coding Notes | |
| - Keep the custom tokenizer contract stable: Android runtime tokenization must | |
| continue to match the exported vocabulary and model metadata. | |
| - Preserve label names and BIO behavior unless a task explicitly changes the | |
| model schema; Android expects the current fields for title, season, episode, | |
| group, resolution, source, and special tags. | |
| - Prefer deterministic dataset and training changes. Keep seed handling intact. | |
| - Use UTF-8 for files that contain Japanese, Chinese, or release-name examples. | |
| - Keep command examples Windows-friendly where paths reference MiruPlay. | |