Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
Add agent repository guidelines
Browse files
AGENTS.md
ADDED
|
@@ -0,0 +1,109 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Repository Guidelines
|
| 2 |
+
|
| 3 |
+
This repository is `AniFileBERT`, the Python model, dataset, training, inference,
|
| 4 |
+
and ONNX export workspace used by MiruPlay as `tools/anime_parser`.
|
| 5 |
+
|
| 6 |
+
## Project Shape
|
| 7 |
+
|
| 8 |
+
- Root model artifacts (`config.json`, `model.safetensors`, `vocab.json`,
|
| 9 |
+
`tokenizer_config.json`, `training_args.bin`) are the published default
|
| 10 |
+
checkpoint.
|
| 11 |
+
- Core code lives in `train.py`, `dataset.py`, `tokenizer.py`, `model.py`,
|
| 12 |
+
`inference.py`, and `export_onnx.py`.
|
| 13 |
+
- Dataset generation and labeling helpers live in `data_generator.py`,
|
| 14 |
+
`dmhy_dataset.py`, `mix_datasets.py`, `llm_labeler.py`,
|
| 15 |
+
`semantic_labeler.py`, and `convert_to_char_dataset.py`.
|
| 16 |
+
- `datasets/AnimeName` is a nested dataset submodule and should be treated as
|
| 17 |
+
the authoritative dataset snapshot when present. Use either
|
| 18 |
+
`dmhy_weak.jsonl` for the regex tokenizer or `dmhy_weak_char.jsonl` for the
|
| 19 |
+
character tokenizer; the other dataset files are legacy snapshots.
|
| 20 |
+
- `exports/` contains Android-facing ONNX artifacts. Keep it in sync when
|
| 21 |
+
changing export behavior or the published checkpoint.
|
| 22 |
+
|
| 23 |
+
## Setup
|
| 24 |
+
|
| 25 |
+
```bash
|
| 26 |
+
python -m pip install -r requirements.txt
|
| 27 |
+
```
|
| 28 |
+
|
| 29 |
+
For local GPU training, install a CUDA-compatible PyTorch build first, then
|
| 30 |
+
install the remaining requirements.
|
| 31 |
+
|
| 32 |
+
If the dataset submodule is missing, initialize it:
|
| 33 |
+
|
| 34 |
+
```bash
|
| 35 |
+
git submodule update --init --recursive
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
## Common Commands
|
| 39 |
+
|
| 40 |
+
Run a parser smoke check:
|
| 41 |
+
|
| 42 |
+
```bash
|
| 43 |
+
python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
Run the lightweight training pipeline check:
|
| 47 |
+
|
| 48 |
+
```bash
|
| 49 |
+
python test_train_small.py --limit-samples 5000 --epochs 2
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
Train the default regex tokenizer from the dataset submodule:
|
| 53 |
+
|
| 54 |
+
```bash
|
| 55 |
+
python train.py --data-file datasets/AnimeName/dmhy_weak.jsonl --vocab-file datasets/AnimeName/vocab.json --save-dir checkpoints/dmhy-finetune --init-model-dir . --epochs 1 --batch-size 128 --learning-rate 0.0003 --warmup-steps 300 --seed 42
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
Train the character tokenizer only when that variant is intentional:
|
| 59 |
+
|
| 60 |
+
```bash
|
| 61 |
+
python train.py --tokenizer char --data-file datasets/AnimeName/dmhy_weak_char.jsonl --vocab-file datasets/AnimeName/vocab.char.json --save-dir checkpoints/dmhy-weak-char --epochs 1 --batch-size 64 --learning-rate 0.0003 --warmup-steps 300 --max-seq-length 128 --seed 42
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
Export for Android:
|
| 65 |
+
|
| 66 |
+
```bash
|
| 67 |
+
python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --android-assets-dir ../../scraper/src/main/assets/anime_parser
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
## Validation Expectations
|
| 71 |
+
|
| 72 |
+
- For parser or tokenizer changes, run `python inference.py --model-dir . ...`
|
| 73 |
+
with at least one realistic filename.
|
| 74 |
+
- For dataset alignment, tokenizer, model, or training-loop changes, run
|
| 75 |
+
`python test_train_small.py --limit-samples 5000 --epochs 2` when practical.
|
| 76 |
+
- For export changes, run `python export_onnx.py ...` and confirm the exporter
|
| 77 |
+
reports a small PyTorch/ONNX logits difference.
|
| 78 |
+
- Full training is expensive; do not start long multi-epoch runs unless the
|
| 79 |
+
task explicitly requires it.
|
| 80 |
+
|
| 81 |
+
## Data And Artifact Rules
|
| 82 |
+
|
| 83 |
+
- Avoid committing generated checkpoint directories such as `checkpoints/`,
|
| 84 |
+
`test_checkpoints*/`, and `ab_checkpoints*/`.
|
| 85 |
+
- Most `data/**/*.jsonl` files are generated and ignored. The small checked-in
|
| 86 |
+
fixtures are `data/synthetic_small.jsonl` and `data/test_smoke.jsonl`.
|
| 87 |
+
- For real training, choose exactly one current dataset:
|
| 88 |
+
`datasets/AnimeName/dmhy_weak.jsonl` for regex tokenization or
|
| 89 |
+
`datasets/AnimeName/dmhy_weak_char.jsonl` for character tokenization.
|
| 90 |
+
Treat `mixed_train.jsonl`, `ab_mix_100k.jsonl`, and other alternate JSONL
|
| 91 |
+
files as legacy unless a task explicitly asks to inspect them.
|
| 92 |
+
- Large binary artifacts are tracked through Git LFS by `.gitattributes`.
|
| 93 |
+
Preserve LFS handling for `.safetensors`, `.onnx`, `.bin`, and related model
|
| 94 |
+
files.
|
| 95 |
+
- When publishing a new checkpoint, copy the final checkpoint files to the
|
| 96 |
+
repository root as described in `MAINTENANCE.md`.
|
| 97 |
+
- When updating `datasets/AnimeName`, commit the submodule pointer in this repo
|
| 98 |
+
and then update the parent MiruPlay submodule pointer.
|
| 99 |
+
|
| 100 |
+
## Coding Notes
|
| 101 |
+
|
| 102 |
+
- Keep the custom tokenizer contract stable: Android runtime tokenization must
|
| 103 |
+
continue to match the exported vocabulary and model metadata.
|
| 104 |
+
- Preserve label names and BIO behavior unless a task explicitly changes the
|
| 105 |
+
model schema; Android expects the current fields for title, season, episode,
|
| 106 |
+
group, resolution, source, and special tags.
|
| 107 |
+
- Prefer deterministic dataset and training changes. Keep seed handling intact.
|
| 108 |
+
- Use UTF-8 for files that contain Japanese, Chinese, or release-name examples.
|
| 109 |
+
- Keep command examples Windows-friendly where paths reference MiruPlay.
|