ModerRAS's picture
Add Rust encoded dataset cache
c705a32
|
raw
history blame
1.09 kB
# AniFileBERT encoded dataset cache
Builds split train/eval `.npy` shard caches for `anifilebert.train`.
The tool mirrors the Python char-tokenizer training encoder for JSONL rows with
`filename`, `tokens`, and `labels`, including projection from source tokens to
character labels and the structural media-label repairs used by training.
Example:
```powershell
cargo run --release --manifest-path tools\encoded_dataset_cache\Cargo.toml -- `
--input data\schema_v2_hard_focus_char_seed63.jsonl `
--vocab-file datasets\AnimeName\vocab.char.json `
--label-schema-file label_schema.json `
--output-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 `
--max-length 128 `
--train-split 0.95 `
--seed 63 `
--shard-size 25000 `
--threads 16
```
Use the cache in training:
```powershell
.\.venv\Scripts\python.exe -m anifilebert.train `
--tokenizer char `
--data-file data\schema_v2_hard_focus_char_seed63.jsonl `
--vocab-file datasets\AnimeName\vocab.char.json `
--encoded-cache-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 `
--max-seq-length 128
```