Tokenizer Notes

The key teaching point is that the tokenizer is part of the model. It is not an interchangeable preprocessing detail.

Tokenizers In This Repo

File	Vocab	Use
`tokenizers/polish_bpe_32k.json`	32768	Paired with `model/ckpt.pt`
`tokenizers/rxlm_polish_bpe_65k.json`	65536	Separate later custom tokenizer artifact

32k Polish BPE

Properties:

byte-level BPE,
32768 vocabulary entries,
32511 merges,
no normalizer,
add_prefix_space=False,
<|endoftext|> is the special document separator token.

This tokenizer is small enough that token ids fit safely in uint16, which is why the training shards can be compact raw binary files.

65k RXLM BPE

Properties:

byte-level BPE,
65536 vocabulary entries,
65283 merges,
NFKC normalization,
add_prefix_space=True,
12 added tokens.

This is a different tokenizer. It should be taught as a later design variant, not as the tokenizer for model/ckpt.pt.

Why Custom BPE

For Polish, a custom tokenizer can reduce awkward fragmentation of common morphemes, diacritics, inflected forms, and domain-specific text. The lesson is not that 32k is always best; the lesson is that tokenizer choice changes:

effective context length,
training cost,
model embedding size,
output fluency,
evaluation comparability.

Quick Inspection Snippet

from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizers/polish_bpe_32k.json")
text = "Zażółć gęślą jaźń. Polska jest częścią Europy."
enc = tok.encode(text)

print(enc.ids)
print(enc.tokens)
print(tok.decode(enc.ids))

Compatibility Rule

For inference:

checkpoint vocab_size == tokenizer vocab size

For this repo:

model/ckpt.pt -> tokenizers/polish_bpe_32k.json