File size: 12,493 Bytes

beb8c7e
 
 
 
 
 
 
 
 
 
8c50d16
 
 
 
beb8c7e
 
 
 
 
 
 
 
 
 
76e084f
beb8c7e
 
76e084f
 
beb8c7e
 
 
 
 
 
 
 
 
 
 
 
8c50d16
beb8c7e
 
76e084f
beb8c7e
 
8c50d16
beb8c7e
 
76e084f
beb8c7e
 
8c50d16
beb8c7e
 
76e084f
beb8c7e
 
8c50d16
beb8c7e
 
c705a32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d87ba32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95246c7
a61b883
 
 
 
 
 
 
 
95246c7
 
a61b883
95246c7
 
a61b883
 
 
 
 
 
 
 
 
 
 
 
0df0bf9
 
 
 
 
 
 
 
 
beb8c7e
 
 
8c50d16
beb8c7e
 
e458112
 
 
 
 
 
 
 
8c50d16
e458112
 
 
 
 
 
 
 
 
 
 
8c50d16
e458112
 
 
 
 
 
 
 
8c50d16
e458112
 
 
 
 
8c50d16
e458112
 
 
 
 
8c50d16
e458112
 
 
 
 
8c50d16
 
 
 
 
e458112
 
 
 
 
 
 
 
beb8c7e
 
8c50d16
beb8c7e
8c50d16
76e084f
beb8c7e
8c50d16
c705a32
 
 
d87ba32
 
 
 
 
 
8c50d16
beb8c7e
8c50d16
 
beb8c7e
 
 
 
 
 
 
 
 
c705a32
 
beb8c7e
 
 
d87ba32
 
 
 
 
76e084f
 
 
beb8c7e
 
 
 
8c50d16
beb8c7e
 
76e084f
 
beb8c7e
 
 
 
 
 
 
 
 
 
 
8c50d16

# Repository Guidelines

This repository is `AniFileBERT`, the Python model, dataset, training, inference,
and ONNX export workspace used by MiruPlay as `tools/anime_parser`.

## Project Shape

- Root model artifacts (`config.json`, `model.safetensors`, `vocab.json`,
  `tokenizer_config.json`, `training_args.bin`) are the published default
  checkpoint.
- Core parser/training code lives in `anifilebert/`.
- Command-line tools live in `tools/`, including ONNX export, fixed-case
  evaluation, benchmarks, dataset relabeling, dataset generation, and Colab
  helpers.
- `datasets/AnimeName` is a nested dataset submodule and should be treated as
  the authoritative dataset snapshot when present. Use either
  `dmhy_weak.jsonl` for the regex tokenizer or `dmhy_weak_char.jsonl` for the
  character tokenizer; the other dataset files are legacy snapshots.
- `exports/` contains Android-facing ONNX artifacts. Keep it in sync when
  changing export behavior or the published checkpoint.

## Setup

```bash
uv sync
```

Use `uv run`, `uv add`, and `uv sync` for environment operations. Do not use
global `pip` for repository work.

If the dataset submodule is missing, initialize it:

```bash
git submodule update --init --recursive
```

## Common Commands

Run a parser smoke check:

```bash
uv run python -m anifilebert.inference --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
```

Run fixed real-world parser regression:

```bash
uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json
```

Benchmark PyTorch and ONNX Runtime inference:

```bash
uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json
```

Train the current default character tokenizer:

```bash
uv run python -m anifilebert.train --tokenizer char --data-file datasets/AnimeName/dmhy_weak_char.jsonl --vocab-file datasets/AnimeName/vocab.char.json --save-dir checkpoints/dmhy-char-full --init-model-dir . --epochs 2 --batch-size 256 --learning-rate 0.00008 --warmup-steps 300 --max-seq-length 128 --train-split 0.98 --num-workers 4 --checkpoint-steps 1000 --save-total-limit 3 --parse-eval-limit 2048 --case-eval-file data/parser_regression_cases.json --seed 52 --experiment-name dmhy-char-full
```

For large generated or hard-focus JSONL files, pre-encode train/eval shards
with Rust before training to avoid the slow Python startup encode path:

```powershell
cargo run --release --manifest-path tools\encoded_dataset_cache\Cargo.toml -- `
  --input data\schema_v2_hard_focus_char_seed63.jsonl `
  --vocab-file datasets\AnimeName\vocab.char.json `
  --label-schema-file label_schema.json `
  --output-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 `
  --max-length 128 `
  --train-split 0.95 `
  --seed 63 `
  --shard-size 25000 `
  --threads 16
```

Then pass the generated cache to training with the same data/vocab/max-length,
split, and seed:

```powershell
.\.venv\Scripts\python.exe -m anifilebert.train --tokenizer char `
  --data-file data\schema_v2_hard_focus_char_seed63.jsonl `
  --vocab-file datasets\AnimeName\vocab.char.json `
  --encoded-cache-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 `
  --max-seq-length 128 --train-split 0.95 --seed 63
```

Do not combine `--encoded-cache-dir` with `--extra-data-file`,
`--limit-samples`, `--rebuild-vocab`, training-time augmentation, or
`--apply-label-repairs`. Regenerate the cache after changing the JSONL, vocab,
label schema, max length, split ratio, or seed.

Generate schema v2 synthetic augmentation with Rust. This is an independent
augmentation chain and must not rewrite the authoritative DMHY dataset or the
main DMHY template-application flow:

```powershell
cargo run --release --manifest-path tools\schema_v2_synthetic_augment\Cargo.toml --bin schema_v2_synthetic_augment -- `
  --recipes reports\dmhy_template_recipes.full_top5000.seed.jsonl `
  --label-schema-file label_schema.json `
  --numeric-title-seeds data\synthetic_numeric_titles.txt `
  --path-prefix-seeds data\synthetic_path_prefixes.txt `
  --limit-templates 3000 `
  --max-rows 50000 `
  --output data\schema_v2_synthetic_aug.jsonl `
  --manifest-output data\schema_v2_synthetic_aug.manifest.json
```

Validate the generated augmentation with the Rust validator:

```powershell
cargo run --release --manifest-path tools\schema_v2_synthetic_augment\Cargo.toml --bin validate_synthetic_aug_jsonl -- `
  --input data\schema_v2_synthetic_aug.jsonl `
  --manifest data\schema_v2_synthetic_aug.manifest.json
```

Preferred synthetic follow-up training is a second stage from the best repaired
hard-focus checkpoint, not a replacement for hard-focus. Keep this path
Rust-cache-first: build one combined encoded cache from hard-focus JSONL plus
synthetic JSONL, then train from that cache. Do not pass `--extra-data-file` to
`anifilebert.train` together with `--encoded-cache-dir`.

Use the local wrapper, which calls Rust `tools/encoded_dataset_cache` with
multiple `--input` values and then launches `anifilebert.train` against the
combined cache:

```powershell
.\.venv\Scripts\python.exe -m tools.train_schema_v2_synthetic
```

The wrapper defaults to:

- primary data: `data\schema_v2_hard_focus_char_seed63.jsonl`
- synthetic data: `data\schema_v2_synthetic_aug.jsonl`
- synthetic repeat: `3`
- encoded cache: `data\encoded_cache\schema_v2_hard_focus_seed63_synth_pathleaf_repeat3`
- init checkpoint: `checkpoints\ablation-schema-v2-hardfocus-cache-repaired-from-baseline-seed62-10epoch-rerun\final`
- output checkpoint: `checkpoints\schema-v2-best-hardfocus-synth-pathleaf-cache`

Use `--force-cache` to rebuild the combined cache after changing either JSONL,
vocab, label schema, max length, split ratio, seed, or repeat count.

For background local runs, inspect progress and metrics with:

```powershell
.\.venv\Scripts\python.exe -m tools.training_status `
  --name schema_v2_cached_wrapper_train_skipcache `
  --metrics reports\schema_v2_best_hardfocus_synth_pathleaf_cache_case_metrics.json `
  --metrics checkpoints\schema-v2-best-hardfocus-synth-pathleaf-cache\final\parse_eval_metrics.json
```

Export for Android:

```bash
uv run python -m tools.export_onnx --model-dir . --max-length 128 --android-assets-dir ../../scraper/src/main/assets/anime_parser
```

## Codex-Controlled Colab Training

Free Colab cannot be treated as an always-on remote machine. Use it as a
short-lived GPU worker only after the user manually opens a Colab runtime and
starts the worker cell. Do not assume Codex can wake Colab by itself.

Before relying on the Colab flow, make sure the Colab helper files have been
pushed to the Hugging Face model repo, or the user has uploaded them manually:
`tools/colab_worker.py`, `tools/colab_client.py`, `tools/colab_train.py`, and `colab/`.

Ask the user to start a Colab GPU runtime with:

```python
from google.colab import drive
drive.mount("/content/drive")

!git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT /content/AniFileBERT || true
%cd /content/AniFileBERT
!git pull --ff-only || true
!git submodule update --init --recursive
!python -m tools.colab_worker
```

The worker prints `COLAB_WORKER_URL=...` and `COLAB_WORKER_TOKEN=...`. After
the user provides those values, set them for local commands:

```powershell
$env:ANIFILEBERT_COLAB_URL="https://...trycloudflare.com"
$env:ANIFILEBERT_COLAB_TOKEN="..."
python -m tools.colab_client health
```

Submit the default regex fine-tune:

```powershell
python -m tools.colab_client submit --profile dmhy_regex_finetune --wait
```

Submit the character tokenizer run only when intentional:

```powershell
python -m tools.colab_client submit --profile dmhy_char_train --wait
```

Useful follow-up commands:

```powershell
python -m tools.colab_client jobs
python -m tools.colab_client status <job-id>
python -m tools.colab_client logs <job-id> --tail 200
python -m tools.colab_client manifest <job-id>
python -m tools.colab_client cancel <job-id>
```

The default Colab profiles save checkpoints to Google Drive every 1000 steps
and resume with `resume_from_checkpoint: "auto"`, so if free Colab disconnects,
ask the user to restart the worker and submit the same profile again. Artifacts
land under `MyDrive/AniFileBERT/checkpoints/<profile-name>/`, and worker logs
land under `MyDrive/AniFileBERT/worker/jobs/<job-id>/`.

## Validation Expectations

- For parser or tokenizer changes, run `python -m anifilebert.inference --model-dir . ...`
  with at least one realistic filename.
- Run `uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json`
  before publishing parser changes.
- For dataset alignment, tokenizer, model, or training-loop changes, run
  `python -m tools.test_train_small --limit-samples 5000 --epochs 2` when practical.
- For Rust encoded-cache changes, run `cargo check --manifest-path tools\encoded_dataset_cache\Cargo.toml`,
  generate a small cache with `--limit-rows`, and verify `python -m anifilebert.train`
  can start with `--encoded-cache-dir`.
- For schema v2 synthetic augmentation changes, prefer Rust tools over Python:
  run `cargo test --manifest-path tools\schema_v2_synthetic_augment\Cargo.toml`,
  generate a small smoke JSONL, and validate it with the Rust
  `validate_synthetic_aug_jsonl` binary. Confirm the manifest reports separate
  `path_series_rows`, `path_movie_rows`, `path_special_rows`,
  `path_confuser_rows`, and `dropped_media_kind_mismatch`.
- For export changes, run `python -m tools.export_onnx ...` and confirm the exporter
  reports a small PyTorch/ONNX logits difference.
- For performance-sensitive inference changes, run `uv run python -m tools.benchmark_inference ...`
  and update `reports/benchmark_results.json` plus the README performance table.
- Full training is expensive; do not start long multi-epoch runs unless the
  task explicitly requires it.

## Data And Artifact Rules

- Avoid committing generated checkpoint directories such as `checkpoints/`,
  `test_checkpoints*/`, and `ab_checkpoints*/`.
- Most `data/**/*.jsonl` files are generated and ignored. The small checked-in
  fixtures are `data/synthetic_small.jsonl` and `data/test_smoke.jsonl`.
- Rust encoded dataset caches under `data/encoded_cache/` are generated
  artifacts and should not be committed.
- For real training, choose exactly one current dataset:
  `datasets/AnimeName/dmhy_weak.jsonl` for regex tokenization or
  `datasets/AnimeName/dmhy_weak_char.jsonl` for character tokenization.
  Synthetic augmentation JSONL such as `data/schema_v2_synthetic_aug.jsonl`
  should be mixed in as an independent augmentation source, not treated as a
  replacement for the authoritative dataset. Treat `mixed_train.jsonl`,
  `ab_mix_100k.jsonl`, and other alternate JSONL files as legacy unless a task
  explicitly asks to inspect them.
- The published default checkpoint is the character tokenizer variant with
  `max_seq_length=128`. Keep `vocab.json`, `vocab.char.json`, `config.json`,
  ONNX export, Android assets, and docs synchronized.
- Large binary artifacts are tracked through Git LFS by `.gitattributes`.
  Preserve LFS handling for `.safetensors`, `.onnx`, `.bin`, and related model
  files.
- When publishing a new checkpoint, copy the final checkpoint files to the
  repository root and reports as described in `docs/maintenance.md`.
- When updating `datasets/AnimeName`, commit the submodule pointer in this repo
  and then update the parent MiruPlay submodule pointer.
- Push LFS objects before pushing Git commits when model or ONNX artifacts
  changed: `git lfs push origin main --all`, then `git push origin main`.

## Coding Notes

- Keep the custom tokenizer contract stable: Android runtime tokenization must
  continue to match the exported vocabulary and model metadata.
- Preserve label names and BIO behavior unless a task explicitly changes the
  model schema; Android expects the current fields for title, season, episode,
  group, resolution, source, and special tags.
- Prefer deterministic dataset and training changes. Keep seed handling intact.
- Use UTF-8 for files that contain Japanese, Chinese, or release-name examples.
- Keep command examples Windows-friendly where paths reference MiruPlay.