SlayerLab
/

slayer-gpt-tokenizer-model

Text Generation

Model card Files Files and versions

slayer-gpt-tokenizer-model / docs /TOKENIZER_NOTES.md

kacperwikiel's picture

Upload Slayer GPT tokenizer model archive

78c54ec verified 7 days ago

|

History Blame Contribute Delete

1.83 kB

	# Tokenizer Notes

	The key teaching point is that the tokenizer is part of the model. It is not an interchangeable preprocessing detail.

	## Tokenizers In This Repo

	\| File \| Vocab \| Use \|
	\| --- \| ---: \| --- \|
	\| `tokenizers/polish_bpe_32k.json` \| 32768 \| Paired with `model/ckpt.pt` \|
	\| `tokenizers/rxlm_polish_bpe_65k.json` \| 65536 \| Separate later custom tokenizer artifact \|

	## 32k Polish BPE

	Properties:

	- byte-level BPE,
	- 32768 vocabulary entries,
	- 32511 merges,
	- no normalizer,
	- `add_prefix_space=False`,
	- `<\|endoftext\|>` is the special document separator token.

	This tokenizer is small enough that token ids fit safely in `uint16`, which is why the training shards can be compact raw binary files.

	## 65k RXLM BPE

	Properties:

	- byte-level BPE,
	- 65536 vocabulary entries,
	- 65283 merges,
	- NFKC normalization,
	- `add_prefix_space=True`,
	- 12 added tokens.

	This is a different tokenizer. It should be taught as a later design variant, not as the tokenizer for `model/ckpt.pt`.

	## Why Custom BPE

	For Polish, a custom tokenizer can reduce awkward fragmentation of common morphemes, diacritics, inflected forms, and domain-specific text. The lesson is not that 32k is always best; the lesson is that tokenizer choice changes:

	- effective context length,
	- training cost,
	- model embedding size,
	- output fluency,
	- evaluation comparability.

	## Quick Inspection Snippet

	```python
	from tokenizers import Tokenizer

	tok = Tokenizer.from_file("tokenizers/polish_bpe_32k.json")
	text = "Zażółć gęślą jaźń. Polska jest częścią Europy."
	enc = tok.encode(text)

	print(enc.ids)
	print(enc.tokens)
	print(tok.decode(enc.ids))
	```

	## Compatibility Rule

	For inference:

	```text
	checkpoint vocab_size == tokenizer vocab size
	```

	For this repo:

	```text
	model/ckpt.pt -> tokenizers/polish_bpe_32k.json
	```