# Tokenizer Notes The key teaching point is that the tokenizer is part of the model. It is not an interchangeable preprocessing detail. ## Tokenizers In This Repo | File | Vocab | Use | | --- | ---: | --- | | `tokenizers/polish_bpe_32k.json` | 32768 | Paired with `model/ckpt.pt` | | `tokenizers/rxlm_polish_bpe_65k.json` | 65536 | Separate later custom tokenizer artifact | ## 32k Polish BPE Properties: - byte-level BPE, - 32768 vocabulary entries, - 32511 merges, - no normalizer, - `add_prefix_space=False`, - `<|endoftext|>` is the special document separator token. This tokenizer is small enough that token ids fit safely in `uint16`, which is why the training shards can be compact raw binary files. ## 65k RXLM BPE Properties: - byte-level BPE, - 65536 vocabulary entries, - 65283 merges, - NFKC normalization, - `add_prefix_space=True`, - 12 added tokens. This is a different tokenizer. It should be taught as a later design variant, not as the tokenizer for `model/ckpt.pt`. ## Why Custom BPE For Polish, a custom tokenizer can reduce awkward fragmentation of common morphemes, diacritics, inflected forms, and domain-specific text. The lesson is not that 32k is always best; the lesson is that tokenizer choice changes: - effective context length, - training cost, - model embedding size, - output fluency, - evaluation comparability. ## Quick Inspection Snippet ```python from tokenizers import Tokenizer tok = Tokenizer.from_file("tokenizers/polish_bpe_32k.json") text = "Zażółć gęślą jaźń. Polska jest częścią Europy." enc = tok.encode(text) print(enc.ids) print(enc.tokens) print(tok.decode(enc.ids)) ``` ## Compatibility Rule For inference: ```text checkpoint vocab_size == tokenizer vocab size ``` For this repo: ```text model/ckpt.pt -> tokenizers/polish_bpe_32k.json ```