# AniFileBERT encoded dataset cache Builds split train/eval `.npy` shard caches for `anifilebert.train`. The tool mirrors the Python char-tokenizer training encoder for JSONL rows with `filename`, `tokens`, and `labels`, including projection from source tokens to character labels and the structural media-label repairs used by training. Example: ```powershell cargo run --release --manifest-path tools\encoded_dataset_cache\Cargo.toml -- ` --input data\schema_v2_hard_focus_char_seed63.jsonl ` --vocab-file datasets\AnimeName\vocab.char.json ` --label-schema-file label_schema.json ` --output-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 ` --max-length 128 ` --train-split 0.95 ` --seed 63 ` --shard-size 25000 ` --threads 16 ``` Multiple JSONL inputs can be encoded into one deterministic train/eval split. Pass `--input-repeat` once per `--input` when an augmentation source should be upweighted: ```powershell cargo run --release --manifest-path tools\encoded_dataset_cache\Cargo.toml -- ` --input data\schema_v2_hard_focus_char_seed63.jsonl ` --input data\schema_v2_synthetic_aug.jsonl ` --input-repeat 1 ` --input-repeat 3 ` --vocab-file datasets\AnimeName\vocab.char.json ` --label-schema-file label_schema.json ` --output-dir data\encoded_cache\schema_v2_hard_focus_seed63_synth_pathleaf_repeat3 ` --max-length 128 ` --train-split 0.995 ` --seed 63 ` --shard-size 25000 ` --threads 16 ``` Use the cache in training: ```powershell .\.venv\Scripts\python.exe -m anifilebert.train ` --tokenizer char ` --data-file data\schema_v2_hard_focus_char_seed63.jsonl ` --vocab-file datasets\AnimeName\vocab.char.json ` --encoded-cache-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 ` --max-seq-length 128 ```