ModerRAS
/

AniFileBERT

Token Classification

filename-parsing

Eval Results (legacy)

Model card Files Files and versions

AniFileBERT / tools /encoded_dataset_cache /README.md

ModerRAS's picture

Add Rust encoded dataset cache

c705a32 6 days ago

|

1.09 kB

	# AniFileBERT encoded dataset cache

	Builds split train/eval `.npy` shard caches for `anifilebert.train`.

	The tool mirrors the Python char-tokenizer training encoder for JSONL rows with
	`filename`, `tokens`, and `labels`, including projection from source tokens to
	character labels and the structural media-label repairs used by training.

	Example:

	```powershell
	cargo run --release --manifest-path tools\encoded_dataset_cache\Cargo.toml -- `
	--input data\schema_v2_hard_focus_char_seed63.jsonl `
	--vocab-file datasets\AnimeName\vocab.char.json `
	--label-schema-file label_schema.json `
	--output-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 `
	--max-length 128 `
	--train-split 0.95 `
	--seed 63 `
	--shard-size 25000 `
	--threads 16
	```

	Use the cache in training:

	```powershell
	.\.venv\Scripts\python.exe -m anifilebert.train `
	--tokenizer char `
	--data-file data\schema_v2_hard_focus_char_seed63.jsonl `
	--vocab-file datasets\AnimeName\vocab.char.json `
	--encoded-cache-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 `
	--max-seq-length 128
	```