ModerRAS
/

AniFileBERT

+# Repository Guidelines
+This repository is `AniFileBERT`, the Python model, dataset, training, inference,
+and ONNX export workspace used by MiruPlay as `tools/anime_parser`.
+## Project Shape
+- Root model artifacts (`config.json`, `model.safetensors`, `vocab.json`,
+  `tokenizer_config.json`, `training_args.bin`) are the published default
+  checkpoint.
+- Core code lives in `train.py`, `dataset.py`, `tokenizer.py`, `model.py`,
+  `inference.py`, and `export_onnx.py`.
+- Dataset generation and labeling helpers live in `data_generator.py`,
+  `dmhy_dataset.py`, `mix_datasets.py`, `llm_labeler.py`,
+  `semantic_labeler.py`, and `convert_to_char_dataset.py`.
+- `datasets/AnimeName` is a nested dataset submodule and should be treated as
+  the authoritative dataset snapshot when present. Use either
+  `dmhy_weak.jsonl` for the regex tokenizer or `dmhy_weak_char.jsonl` for the
+  character tokenizer; the other dataset files are legacy snapshots.
+- `exports/` contains Android-facing ONNX artifacts. Keep it in sync when
+  changing export behavior or the published checkpoint.
+## Setup
+```bash
+python -m pip install -r requirements.txt
+```
+For local GPU training, install a CUDA-compatible PyTorch build first, then
+install the remaining requirements.
+If the dataset submodule is missing, initialize it:
+```bash
+git submodule update --init --recursive
+```
+## Common Commands
+Run a parser smoke check:
+```bash
+python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
+```
+Run the lightweight training pipeline check:
+```bash
+python test_train_small.py --limit-samples 5000 --epochs 2
+```
+Train the default regex tokenizer from the dataset submodule:
+```bash
+python train.py --data-file datasets/AnimeName/dmhy_weak.jsonl --vocab-file datasets/AnimeName/vocab.json --save-dir checkpoints/dmhy-finetune --init-model-dir . --epochs 1 --batch-size 128 --learning-rate 0.0003 --warmup-steps 300 --seed 42
+```
+Train the character tokenizer only when that variant is intentional:
+```bash
+python train.py --tokenizer char --data-file datasets/AnimeName/dmhy_weak_char.jsonl --vocab-file datasets/AnimeName/vocab.char.json --save-dir checkpoints/dmhy-weak-char --epochs 1 --batch-size 64 --learning-rate 0.0003 --warmup-steps 300 --max-seq-length 128 --seed 42
+```
+Export for Android:
+```bash
+python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --android-assets-dir ../../scraper/src/main/assets/anime_parser
+```
+## Validation Expectations
+- For parser or tokenizer changes, run `python inference.py --model-dir . ...`
+  with at least one realistic filename.
+- For dataset alignment, tokenizer, model, or training-loop changes, run
+  `python test_train_small.py --limit-samples 5000 --epochs 2` when practical.
+- For export changes, run `python export_onnx.py ...` and confirm the exporter
+  reports a small PyTorch/ONNX logits difference.
+- Full training is expensive; do not start long multi-epoch runs unless the
+  task explicitly requires it.
+## Data And Artifact Rules
+- Avoid committing generated checkpoint directories such as `checkpoints/`,
+  `test_checkpoints*/`, and `ab_checkpoints*/`.
+- Most `data/**/*.jsonl` files are generated and ignored. The small checked-in
+  fixtures are `data/synthetic_small.jsonl` and `data/test_smoke.jsonl`.
+- For real training, choose exactly one current dataset:
+  `datasets/AnimeName/dmhy_weak.jsonl` for regex tokenization or
+  `datasets/AnimeName/dmhy_weak_char.jsonl` for character tokenization.
+  Treat `mixed_train.jsonl`, `ab_mix_100k.jsonl`, and other alternate JSONL
+  files as legacy unless a task explicitly asks to inspect them.
+- Large binary artifacts are tracked through Git LFS by `.gitattributes`.
+  Preserve LFS handling for `.safetensors`, `.onnx`, `.bin`, and related model
+  files.
+- When publishing a new checkpoint, copy the final checkpoint files to the
+  repository root as described in `MAINTENANCE.md`.
+- When updating `datasets/AnimeName`, commit the submodule pointer in this repo
+  and then update the parent MiruPlay submodule pointer.
+## Coding Notes
+- Keep the custom tokenizer contract stable: Android runtime tokenization must
+  continue to match the exported vocabulary and model metadata.
+- Preserve label names and BIO behavior unless a task explicitly changes the
+  model schema; Android expects the current fields for title, season, episode,
+  group, resolution, source, and special tags.
+- Prefer deterministic dataset and training changes. Keep seed handling intact.
+- Use UTF-8 for files that contain Japanese, Chinese, or release-name examples.
+- Keep command examples Windows-friendly where paths reference MiruPlay.