AniFileBERT / AGENTS.md
ModerRAS's picture
Add Rust schema v2 synthetic augmentation
d87ba32
|
raw
history blame
10.9 kB

Repository Guidelines

This repository is AniFileBERT, the Python model, dataset, training, inference, and ONNX export workspace used by MiruPlay as tools/anime_parser.

Project Shape

  • Root model artifacts (config.json, model.safetensors, vocab.json, tokenizer_config.json, training_args.bin) are the published default checkpoint.
  • Core parser/training code lives in anifilebert/.
  • Command-line tools live in tools/, including ONNX export, fixed-case evaluation, benchmarks, dataset relabeling, dataset generation, and Colab helpers.
  • datasets/AnimeName is a nested dataset submodule and should be treated as the authoritative dataset snapshot when present. Use either dmhy_weak.jsonl for the regex tokenizer or dmhy_weak_char.jsonl for the character tokenizer; the other dataset files are legacy snapshots.
  • exports/ contains Android-facing ONNX artifacts. Keep it in sync when changing export behavior or the published checkpoint.

Setup

uv sync

Use uv run, uv add, and uv sync for environment operations. Do not use global pip for repository work.

If the dataset submodule is missing, initialize it:

git submodule update --init --recursive

Common Commands

Run a parser smoke check:

uv run python -m anifilebert.inference --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"

Run fixed real-world parser regression:

uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json

Benchmark PyTorch and ONNX Runtime inference:

uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json

Train the current default character tokenizer:

uv run python -m anifilebert.train --tokenizer char --data-file datasets/AnimeName/dmhy_weak_char.jsonl --vocab-file datasets/AnimeName/vocab.char.json --save-dir checkpoints/dmhy-char-full --init-model-dir . --epochs 2 --batch-size 256 --learning-rate 0.00008 --warmup-steps 300 --max-seq-length 128 --train-split 0.98 --num-workers 4 --checkpoint-steps 1000 --save-total-limit 3 --parse-eval-limit 2048 --case-eval-file data/parser_regression_cases.json --seed 52 --experiment-name dmhy-char-full

For large generated or hard-focus JSONL files, pre-encode train/eval shards with Rust before training to avoid the slow Python startup encode path:

cargo run --release --manifest-path tools\encoded_dataset_cache\Cargo.toml -- `
  --input data\schema_v2_hard_focus_char_seed63.jsonl `
  --vocab-file datasets\AnimeName\vocab.char.json `
  --label-schema-file label_schema.json `
  --output-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 `
  --max-length 128 `
  --train-split 0.95 `
  --seed 63 `
  --shard-size 25000 `
  --threads 16

Then pass the generated cache to training with the same data/vocab/max-length, split, and seed:

.\.venv\Scripts\python.exe -m anifilebert.train --tokenizer char `
  --data-file data\schema_v2_hard_focus_char_seed63.jsonl `
  --vocab-file datasets\AnimeName\vocab.char.json `
  --encoded-cache-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 `
  --max-seq-length 128 --train-split 0.95 --seed 63

Do not combine --encoded-cache-dir with --extra-data-file, --limit-samples, --rebuild-vocab, training-time augmentation, or --apply-label-repairs. Regenerate the cache after changing the JSONL, vocab, label schema, max length, split ratio, or seed.

Generate schema v2 synthetic augmentation with Rust. This is an independent augmentation chain and must not rewrite the authoritative DMHY dataset or the main DMHY template-application flow:

cargo run --release --manifest-path tools\schema_v2_synthetic_augment\Cargo.toml --bin schema_v2_synthetic_augment -- `
  --recipes reports\dmhy_template_recipes.full_top5000.seed.jsonl `
  --label-schema-file label_schema.json `
  --numeric-title-seeds data\synthetic_numeric_titles.txt `
  --path-prefix-seeds data\synthetic_path_prefixes.txt `
  --limit-templates 3000 `
  --max-rows 50000 `
  --output data\schema_v2_synthetic_aug.jsonl `
  --manifest-output data\schema_v2_synthetic_aug.manifest.json

Validate the generated augmentation with the Rust validator:

cargo run --release --manifest-path tools\schema_v2_synthetic_augment\Cargo.toml --bin validate_synthetic_aug_jsonl -- `
  --input data\schema_v2_synthetic_aug.jsonl `
  --manifest data\schema_v2_synthetic_aug.manifest.json

Export for Android:

uv run python -m tools.export_onnx --model-dir . --max-length 128 --android-assets-dir ../../scraper/src/main/assets/anime_parser

Codex-Controlled Colab Training

Free Colab cannot be treated as an always-on remote machine. Use it as a short-lived GPU worker only after the user manually opens a Colab runtime and starts the worker cell. Do not assume Codex can wake Colab by itself.

Before relying on the Colab flow, make sure the Colab helper files have been pushed to the Hugging Face model repo, or the user has uploaded them manually: tools/colab_worker.py, tools/colab_client.py, tools/colab_train.py, and colab/.

Ask the user to start a Colab GPU runtime with:

from google.colab import drive
drive.mount("/content/drive")

!git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT /content/AniFileBERT || true
%cd /content/AniFileBERT
!git pull --ff-only || true
!git submodule update --init --recursive
!python -m tools.colab_worker

The worker prints COLAB_WORKER_URL=... and COLAB_WORKER_TOKEN=.... After the user provides those values, set them for local commands:

$env:ANIFILEBERT_COLAB_URL="https://...trycloudflare.com"
$env:ANIFILEBERT_COLAB_TOKEN="..."
python -m tools.colab_client health

Submit the default regex fine-tune:

python -m tools.colab_client submit --profile dmhy_regex_finetune --wait

Submit the character tokenizer run only when intentional:

python -m tools.colab_client submit --profile dmhy_char_train --wait

Useful follow-up commands:

python -m tools.colab_client jobs
python -m tools.colab_client status <job-id>
python -m tools.colab_client logs <job-id> --tail 200
python -m tools.colab_client manifest <job-id>
python -m tools.colab_client cancel <job-id>

The default Colab profiles save checkpoints to Google Drive every 1000 steps and resume with resume_from_checkpoint: "auto", so if free Colab disconnects, ask the user to restart the worker and submit the same profile again. Artifacts land under MyDrive/AniFileBERT/checkpoints/<profile-name>/, and worker logs land under MyDrive/AniFileBERT/worker/jobs/<job-id>/.

Validation Expectations

  • For parser or tokenizer changes, run python -m anifilebert.inference --model-dir . ... with at least one realistic filename.
  • Run uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json before publishing parser changes.
  • For dataset alignment, tokenizer, model, or training-loop changes, run python -m tools.test_train_small --limit-samples 5000 --epochs 2 when practical.
  • For Rust encoded-cache changes, run cargo check --manifest-path tools\encoded_dataset_cache\Cargo.toml, generate a small cache with --limit-rows, and verify python -m anifilebert.train can start with --encoded-cache-dir.
  • For schema v2 synthetic augmentation changes, prefer Rust tools over Python: run cargo test --manifest-path tools\schema_v2_synthetic_augment\Cargo.toml, generate a small smoke JSONL, and validate it with the Rust validate_synthetic_aug_jsonl binary. Confirm the manifest reports separate path_series_rows, path_movie_rows, path_special_rows, path_confuser_rows, and dropped_media_kind_mismatch.
  • For export changes, run python -m tools.export_onnx ... and confirm the exporter reports a small PyTorch/ONNX logits difference.
  • For performance-sensitive inference changes, run uv run python -m tools.benchmark_inference ... and update reports/benchmark_results.json plus the README performance table.
  • Full training is expensive; do not start long multi-epoch runs unless the task explicitly requires it.

Data And Artifact Rules

  • Avoid committing generated checkpoint directories such as checkpoints/, test_checkpoints*/, and ab_checkpoints*/.
  • Most data/**/*.jsonl files are generated and ignored. The small checked-in fixtures are data/synthetic_small.jsonl and data/test_smoke.jsonl.
  • Rust encoded dataset caches under data/encoded_cache/ are generated artifacts and should not be committed.
  • For real training, choose exactly one current dataset: datasets/AnimeName/dmhy_weak.jsonl for regex tokenization or datasets/AnimeName/dmhy_weak_char.jsonl for character tokenization. Synthetic augmentation JSONL such as data/schema_v2_synthetic_aug.jsonl should be mixed in as an independent augmentation source, not treated as a replacement for the authoritative dataset. Treat mixed_train.jsonl, ab_mix_100k.jsonl, and other alternate JSONL files as legacy unless a task explicitly asks to inspect them.
  • The published default checkpoint is the character tokenizer variant with max_seq_length=128. Keep vocab.json, vocab.char.json, config.json, ONNX export, Android assets, and docs synchronized.
  • Large binary artifacts are tracked through Git LFS by .gitattributes. Preserve LFS handling for .safetensors, .onnx, .bin, and related model files.
  • When publishing a new checkpoint, copy the final checkpoint files to the repository root and reports as described in docs/maintenance.md.
  • When updating datasets/AnimeName, commit the submodule pointer in this repo and then update the parent MiruPlay submodule pointer.
  • Push LFS objects before pushing Git commits when model or ONNX artifacts changed: git lfs push origin main --all, then git push origin main.

Coding Notes

  • Keep the custom tokenizer contract stable: Android runtime tokenization must continue to match the exported vocabulary and model metadata.
  • Preserve label names and BIO behavior unless a task explicitly changes the model schema; Android expects the current fields for title, season, episode, group, resolution, source, and special tags.
  • Prefer deterministic dataset and training changes. Keep seed handling intact.
  • Use UTF-8 for files that contain Japanese, Chinese, or release-name examples.
  • Keep command examples Windows-friendly where paths reference MiruPlay.