# Tokenizer Notes

The key teaching point is that the tokenizer is part of the model. It is not an interchangeable preprocessing detail.

## Tokenizers In This Repo

| File | Vocab | Use |
| --- | ---: | --- |
| `tokenizers/polish_bpe_32k.json` | 32768 | Paired with `model/ckpt.pt` |
| `tokenizers/rxlm_polish_bpe_65k.json` | 65536 | Separate later custom tokenizer artifact |

## 32k Polish BPE

Properties:

- byte-level BPE,
- 32768 vocabulary entries,
- 32511 merges,
- no normalizer,
- `add_prefix_space=False`,
- `<|endoftext|>` is the special document separator token.

This tokenizer is small enough that token ids fit safely in `uint16`, which is why the training shards can be compact raw binary files.

## 65k RXLM BPE

Properties:

- byte-level BPE,
- 65536 vocabulary entries,
- 65283 merges,
- NFKC normalization,
- `add_prefix_space=True`,
- 12 added tokens.

This is a different tokenizer. It should be taught as a later design variant, not as the tokenizer for `model/ckpt.pt`.

## Why Custom BPE

For Polish, a custom tokenizer can reduce awkward fragmentation of common morphemes, diacritics, inflected forms, and domain-specific text. The lesson is not that 32k is always best; the lesson is that tokenizer choice changes:

- effective context length,
- training cost,
- model embedding size,
- output fluency,
- evaluation comparability.

## Quick Inspection Snippet

```python
from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizers/polish_bpe_32k.json")
text = "Zażółć gęślą jaźń. Polska jest częścią Europy."
enc = tok.encode(text)

print(enc.ids)
print(enc.tokens)
print(tok.decode(enc.ids))
```

## Compatibility Rule

For inference:

```text
checkpoint vocab_size == tokenizer vocab size
```

For this repo:

```text
model/ckpt.pt -> tokenizers/polish_bpe_32k.json
```