# Repository Guidelines This repository is `AniFileBERT`, the Python model, dataset, training, inference, and ONNX export workspace used by MiruPlay as `tools/anime_parser`. ## Project Shape - Root model artifacts (`config.json`, `model.safetensors`, `vocab.json`, `tokenizer_config.json`, `training_args.bin`) are the published default checkpoint. - Core parser/training code lives in `anifilebert/`. - Command-line tools live in `tools/`, including ONNX export, fixed-case evaluation, benchmarks, dataset relabeling, dataset generation, and Colab helpers. - `datasets/AnimeName` is a nested dataset submodule and should be treated as the authoritative dataset snapshot when present. Use either `dmhy_weak.jsonl` for the regex tokenizer or `dmhy_weak_char.jsonl` for the character tokenizer; the other dataset files are legacy snapshots. - `exports/` contains Android-facing ONNX artifacts. Keep it in sync when changing export behavior or the published checkpoint. ## Setup ```bash uv sync ``` Use `uv run`, `uv add`, and `uv sync` for environment operations. Do not use global `pip` for repository work. If the dataset submodule is missing, initialize it: ```bash git submodule update --init --recursive ``` ## Common Commands Run a parser smoke check: ```bash uv run python -m anifilebert.inference --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub" ``` Run fixed real-world parser regression: ```bash uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json ``` Benchmark PyTorch and ONNX Runtime inference: ```bash uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json ``` Train the current default character tokenizer: ```bash uv run python -m anifilebert.train --tokenizer char --data-file datasets/AnimeName/dmhy_weak_char.jsonl --vocab-file datasets/AnimeName/vocab.char.json --save-dir checkpoints/dmhy-char-full --init-model-dir . --epochs 2 --batch-size 256 --learning-rate 0.00008 --warmup-steps 300 --max-seq-length 128 --train-split 0.98 --num-workers 4 --checkpoint-steps 1000 --save-total-limit 3 --parse-eval-limit 2048 --case-eval-file data/parser_regression_cases.json --seed 52 --experiment-name dmhy-char-full ``` Export for Android: ```bash uv run python -m tools.export_onnx --model-dir . --max-length 128 --android-assets-dir ../../scraper/src/main/assets/anime_parser ``` ## Codex-Controlled Colab Training Free Colab cannot be treated as an always-on remote machine. Use it as a short-lived GPU worker only after the user manually opens a Colab runtime and starts the worker cell. Do not assume Codex can wake Colab by itself. Before relying on the Colab flow, make sure the Colab helper files have been pushed to the Hugging Face model repo, or the user has uploaded them manually: `tools/colab_worker.py`, `tools/colab_client.py`, `tools/colab_train.py`, and `colab/`. Ask the user to start a Colab GPU runtime with: ```python from google.colab import drive drive.mount("/content/drive") !git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT /content/AniFileBERT || true %cd /content/AniFileBERT !git pull --ff-only || true !git submodule update --init --recursive !python -m tools.colab_worker ``` The worker prints `COLAB_WORKER_URL=...` and `COLAB_WORKER_TOKEN=...`. After the user provides those values, set them for local commands: ```powershell $env:ANIFILEBERT_COLAB_URL="https://...trycloudflare.com" $env:ANIFILEBERT_COLAB_TOKEN="..." python -m tools.colab_client health ``` Submit the default regex fine-tune: ```powershell python -m tools.colab_client submit --profile dmhy_regex_finetune --wait ``` Submit the character tokenizer run only when intentional: ```powershell python -m tools.colab_client submit --profile dmhy_char_train --wait ``` Useful follow-up commands: ```powershell python -m tools.colab_client jobs python -m tools.colab_client status python -m tools.colab_client logs --tail 200 python -m tools.colab_client manifest python -m tools.colab_client cancel ``` The default Colab profiles save checkpoints to Google Drive every 1000 steps and resume with `resume_from_checkpoint: "auto"`, so if free Colab disconnects, ask the user to restart the worker and submit the same profile again. Artifacts land under `MyDrive/AniFileBERT/checkpoints//`, and worker logs land under `MyDrive/AniFileBERT/worker/jobs//`. ## Validation Expectations - For parser or tokenizer changes, run `python -m anifilebert.inference --model-dir . ...` with at least one realistic filename. - Run `uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json` before publishing parser changes. - For dataset alignment, tokenizer, model, or training-loop changes, run `python -m tools.test_train_small --limit-samples 5000 --epochs 2` when practical. - For export changes, run `python -m tools.export_onnx ...` and confirm the exporter reports a small PyTorch/ONNX logits difference. - For performance-sensitive inference changes, run `uv run python -m tools.benchmark_inference ...` and update `reports/benchmark_results.json` plus the README performance table. - Full training is expensive; do not start long multi-epoch runs unless the task explicitly requires it. ## Data And Artifact Rules - Avoid committing generated checkpoint directories such as `checkpoints/`, `test_checkpoints*/`, and `ab_checkpoints*/`. - Most `data/**/*.jsonl` files are generated and ignored. The small checked-in fixtures are `data/synthetic_small.jsonl` and `data/test_smoke.jsonl`. - For real training, choose exactly one current dataset: `datasets/AnimeName/dmhy_weak.jsonl` for regex tokenization or `datasets/AnimeName/dmhy_weak_char.jsonl` for character tokenization. Treat `mixed_train.jsonl`, `ab_mix_100k.jsonl`, and other alternate JSONL files as legacy unless a task explicitly asks to inspect them. - The published default checkpoint is the character tokenizer variant with `max_seq_length=128`. Keep `vocab.json`, `vocab.char.json`, `config.json`, ONNX export, Android assets, and docs synchronized. - Large binary artifacts are tracked through Git LFS by `.gitattributes`. Preserve LFS handling for `.safetensors`, `.onnx`, `.bin`, and related model files. - When publishing a new checkpoint, copy the final checkpoint files to the repository root and reports as described in `docs/maintenance.md`. - When updating `datasets/AnimeName`, commit the submodule pointer in this repo and then update the parent MiruPlay submodule pointer. - Push LFS objects before pushing Git commits when model or ONNX artifacts changed: `git lfs push origin main --all`, then `git push origin main`. ## Coding Notes - Keep the custom tokenizer contract stable: Android runtime tokenization must continue to match the exported vocabulary and model metadata. - Preserve label names and BIO behavior unless a task explicitly changes the model schema; Android expects the current fields for title, season, episode, group, resolution, source, and special tags. - Prefer deterministic dataset and training changes. Keep seed handling intact. - Use UTF-8 for files that contain Japanese, Chinese, or release-name examples. - Keep command examples Windows-friendly where paths reference MiruPlay.