Tokenizer Notes
The key teaching point is that the tokenizer is part of the model. It is not an interchangeable preprocessing detail.
Tokenizers In This Repo
| File | Vocab | Use |
|---|---|---|
tokenizers/polish_bpe_32k.json |
32768 | Paired with model/ckpt.pt |
tokenizers/rxlm_polish_bpe_65k.json |
65536 | Separate later custom tokenizer artifact |
32k Polish BPE
Properties:
- byte-level BPE,
- 32768 vocabulary entries,
- 32511 merges,
- no normalizer,
add_prefix_space=False,<|endoftext|>is the special document separator token.
This tokenizer is small enough that token ids fit safely in uint16, which is why the training shards can be compact raw binary files.
65k RXLM BPE
Properties:
- byte-level BPE,
- 65536 vocabulary entries,
- 65283 merges,
- NFKC normalization,
add_prefix_space=True,- 12 added tokens.
This is a different tokenizer. It should be taught as a later design variant, not as the tokenizer for model/ckpt.pt.
Why Custom BPE
For Polish, a custom tokenizer can reduce awkward fragmentation of common morphemes, diacritics, inflected forms, and domain-specific text. The lesson is not that 32k is always best; the lesson is that tokenizer choice changes:
- effective context length,
- training cost,
- model embedding size,
- output fluency,
- evaluation comparability.
Quick Inspection Snippet
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizers/polish_bpe_32k.json")
text = "Zażółć gęślą jaźń. Polska jest częścią Europy."
enc = tok.encode(text)
print(enc.ids)
print(enc.tokens)
print(tok.decode(enc.ids))
Compatibility Rule
For inference:
checkpoint vocab_size == tokenizer vocab size
For this repo:
model/ckpt.pt -> tokenizers/polish_bpe_32k.json