| # Tokenizer Notes |
|
|
| The key teaching point is that the tokenizer is part of the model. It is not an interchangeable preprocessing detail. |
|
|
| ## Tokenizers In This Repo |
|
|
| | File | Vocab | Use | |
| | --- | ---: | --- | |
| | `tokenizers/polish_bpe_32k.json` | 32768 | Paired with `model/ckpt.pt` | |
| | `tokenizers/rxlm_polish_bpe_65k.json` | 65536 | Separate later custom tokenizer artifact | |
|
|
| ## 32k Polish BPE |
|
|
| Properties: |
|
|
| - byte-level BPE, |
| - 32768 vocabulary entries, |
| - 32511 merges, |
| - no normalizer, |
| - `add_prefix_space=False`, |
| - `<|endoftext|>` is the special document separator token. |
|
|
| This tokenizer is small enough that token ids fit safely in `uint16`, which is why the training shards can be compact raw binary files. |
|
|
| ## 65k RXLM BPE |
|
|
| Properties: |
|
|
| - byte-level BPE, |
| - 65536 vocabulary entries, |
| - 65283 merges, |
| - NFKC normalization, |
| - `add_prefix_space=True`, |
| - 12 added tokens. |
|
|
| This is a different tokenizer. It should be taught as a later design variant, not as the tokenizer for `model/ckpt.pt`. |
|
|
| ## Why Custom BPE |
|
|
| For Polish, a custom tokenizer can reduce awkward fragmentation of common morphemes, diacritics, inflected forms, and domain-specific text. The lesson is not that 32k is always best; the lesson is that tokenizer choice changes: |
|
|
| - effective context length, |
| - training cost, |
| - model embedding size, |
| - output fluency, |
| - evaluation comparability. |
|
|
| ## Quick Inspection Snippet |
|
|
| ```python |
| from tokenizers import Tokenizer |
| |
| tok = Tokenizer.from_file("tokenizers/polish_bpe_32k.json") |
| text = "Zażółć gęślą jaźń. Polska jest częścią Europy." |
| enc = tok.encode(text) |
| |
| print(enc.ids) |
| print(enc.tokens) |
| print(tok.decode(enc.ids)) |
| ``` |
|
|
| ## Compatibility Rule |
|
|
| For inference: |
|
|
| ```text |
| checkpoint vocab_size == tokenizer vocab size |
| ``` |
|
|
| For this repo: |
|
|
| ```text |
| model/ckpt.pt -> tokenizers/polish_bpe_32k.json |
| ``` |
|
|
|
|