File size: 1,093 Bytes
c705a32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# AniFileBERT encoded dataset cache

Builds split train/eval `.npy` shard caches for `anifilebert.train`.

The tool mirrors the Python char-tokenizer training encoder for JSONL rows with
`filename`, `tokens`, and `labels`, including projection from source tokens to
character labels and the structural media-label repairs used by training.

Example:

```powershell
cargo run --release --manifest-path tools\encoded_dataset_cache\Cargo.toml -- `
  --input data\schema_v2_hard_focus_char_seed63.jsonl `
  --vocab-file datasets\AnimeName\vocab.char.json `
  --label-schema-file label_schema.json `
  --output-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 `
  --max-length 128 `
  --train-split 0.95 `
  --seed 63 `
  --shard-size 25000 `
  --threads 16
```

Use the cache in training:

```powershell
.\.venv\Scripts\python.exe -m anifilebert.train `
  --tokenizer char `
  --data-file data\schema_v2_hard_focus_char_seed63.jsonl `
  --vocab-file datasets\AnimeName\vocab.char.json `
  --encoded-cache-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 `
  --max-seq-length 128
```