# AniFileBERT encoded dataset cache Builds split train/eval `.npy` shard caches for `anifilebert.train`. The tool mirrors the Python char-tokenizer training encoder for JSONL rows with `filename`, `tokens`, and `labels`, including projection from source tokens to character labels and the structural media-label repairs used by training. Example: ```powershell cargo run --release --manifest-path tools\encoded_dataset_cache\Cargo.toml -- ` --input data\schema_v2_hard_focus_char_seed63.jsonl ` --vocab-file datasets\AnimeName\vocab.char.json ` --label-schema-file label_schema.json ` --output-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 ` --max-length 128 ` --train-split 0.95 ` --seed 63 ` --shard-size 25000 ` --threads 16 ``` Use the cache in training: ```powershell .\.venv\Scripts\python.exe -m anifilebert.train ` --tokenizer char ` --data-file data\schema_v2_hard_focus_char_seed63.jsonl ` --vocab-file datasets\AnimeName\vocab.char.json ` --encoded-cache-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 ` --max-seq-length 128 ```