ModerRAS commited on
Commit
beb8c7e
·
1 Parent(s): f408729

Add agent repository guidelines

Browse files
Files changed (1) hide show
  1. AGENTS.md +109 -0
AGENTS.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Repository Guidelines
2
+
3
+ This repository is `AniFileBERT`, the Python model, dataset, training, inference,
4
+ and ONNX export workspace used by MiruPlay as `tools/anime_parser`.
5
+
6
+ ## Project Shape
7
+
8
+ - Root model artifacts (`config.json`, `model.safetensors`, `vocab.json`,
9
+ `tokenizer_config.json`, `training_args.bin`) are the published default
10
+ checkpoint.
11
+ - Core code lives in `train.py`, `dataset.py`, `tokenizer.py`, `model.py`,
12
+ `inference.py`, and `export_onnx.py`.
13
+ - Dataset generation and labeling helpers live in `data_generator.py`,
14
+ `dmhy_dataset.py`, `mix_datasets.py`, `llm_labeler.py`,
15
+ `semantic_labeler.py`, and `convert_to_char_dataset.py`.
16
+ - `datasets/AnimeName` is a nested dataset submodule and should be treated as
17
+ the authoritative dataset snapshot when present. Use either
18
+ `dmhy_weak.jsonl` for the regex tokenizer or `dmhy_weak_char.jsonl` for the
19
+ character tokenizer; the other dataset files are legacy snapshots.
20
+ - `exports/` contains Android-facing ONNX artifacts. Keep it in sync when
21
+ changing export behavior or the published checkpoint.
22
+
23
+ ## Setup
24
+
25
+ ```bash
26
+ python -m pip install -r requirements.txt
27
+ ```
28
+
29
+ For local GPU training, install a CUDA-compatible PyTorch build first, then
30
+ install the remaining requirements.
31
+
32
+ If the dataset submodule is missing, initialize it:
33
+
34
+ ```bash
35
+ git submodule update --init --recursive
36
+ ```
37
+
38
+ ## Common Commands
39
+
40
+ Run a parser smoke check:
41
+
42
+ ```bash
43
+ python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
44
+ ```
45
+
46
+ Run the lightweight training pipeline check:
47
+
48
+ ```bash
49
+ python test_train_small.py --limit-samples 5000 --epochs 2
50
+ ```
51
+
52
+ Train the default regex tokenizer from the dataset submodule:
53
+
54
+ ```bash
55
+ python train.py --data-file datasets/AnimeName/dmhy_weak.jsonl --vocab-file datasets/AnimeName/vocab.json --save-dir checkpoints/dmhy-finetune --init-model-dir . --epochs 1 --batch-size 128 --learning-rate 0.0003 --warmup-steps 300 --seed 42
56
+ ```
57
+
58
+ Train the character tokenizer only when that variant is intentional:
59
+
60
+ ```bash
61
+ python train.py --tokenizer char --data-file datasets/AnimeName/dmhy_weak_char.jsonl --vocab-file datasets/AnimeName/vocab.char.json --save-dir checkpoints/dmhy-weak-char --epochs 1 --batch-size 64 --learning-rate 0.0003 --warmup-steps 300 --max-seq-length 128 --seed 42
62
+ ```
63
+
64
+ Export for Android:
65
+
66
+ ```bash
67
+ python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --android-assets-dir ../../scraper/src/main/assets/anime_parser
68
+ ```
69
+
70
+ ## Validation Expectations
71
+
72
+ - For parser or tokenizer changes, run `python inference.py --model-dir . ...`
73
+ with at least one realistic filename.
74
+ - For dataset alignment, tokenizer, model, or training-loop changes, run
75
+ `python test_train_small.py --limit-samples 5000 --epochs 2` when practical.
76
+ - For export changes, run `python export_onnx.py ...` and confirm the exporter
77
+ reports a small PyTorch/ONNX logits difference.
78
+ - Full training is expensive; do not start long multi-epoch runs unless the
79
+ task explicitly requires it.
80
+
81
+ ## Data And Artifact Rules
82
+
83
+ - Avoid committing generated checkpoint directories such as `checkpoints/`,
84
+ `test_checkpoints*/`, and `ab_checkpoints*/`.
85
+ - Most `data/**/*.jsonl` files are generated and ignored. The small checked-in
86
+ fixtures are `data/synthetic_small.jsonl` and `data/test_smoke.jsonl`.
87
+ - For real training, choose exactly one current dataset:
88
+ `datasets/AnimeName/dmhy_weak.jsonl` for regex tokenization or
89
+ `datasets/AnimeName/dmhy_weak_char.jsonl` for character tokenization.
90
+ Treat `mixed_train.jsonl`, `ab_mix_100k.jsonl`, and other alternate JSONL
91
+ files as legacy unless a task explicitly asks to inspect them.
92
+ - Large binary artifacts are tracked through Git LFS by `.gitattributes`.
93
+ Preserve LFS handling for `.safetensors`, `.onnx`, `.bin`, and related model
94
+ files.
95
+ - When publishing a new checkpoint, copy the final checkpoint files to the
96
+ repository root as described in `MAINTENANCE.md`.
97
+ - When updating `datasets/AnimeName`, commit the submodule pointer in this repo
98
+ and then update the parent MiruPlay submodule pointer.
99
+
100
+ ## Coding Notes
101
+
102
+ - Keep the custom tokenizer contract stable: Android runtime tokenization must
103
+ continue to match the exported vocabulary and model metadata.
104
+ - Preserve label names and BIO behavior unless a task explicitly changes the
105
+ model schema; Android expects the current fields for title, season, episode,
106
+ group, resolution, source, and special tags.
107
+ - Prefer deterministic dataset and training changes. Keep seed handling intact.
108
+ - Use UTF-8 for files that contain Japanese, Chinese, or release-name examples.
109
+ - Keep command examples Windows-friendly where paths reference MiruPlay.