slayer-gpt-tokenizer-model / docs /TOKENIZER_NOTES.md
kacperwikiel's picture
Upload Slayer GPT tokenizer model archive
78c54ec verified
|
Raw
History Blame Contribute Delete
1.83 kB
# Tokenizer Notes
The key teaching point is that the tokenizer is part of the model. It is not an interchangeable preprocessing detail.
## Tokenizers In This Repo
| File | Vocab | Use |
| --- | ---: | --- |
| `tokenizers/polish_bpe_32k.json` | 32768 | Paired with `model/ckpt.pt` |
| `tokenizers/rxlm_polish_bpe_65k.json` | 65536 | Separate later custom tokenizer artifact |
## 32k Polish BPE
Properties:
- byte-level BPE,
- 32768 vocabulary entries,
- 32511 merges,
- no normalizer,
- `add_prefix_space=False`,
- `<|endoftext|>` is the special document separator token.
This tokenizer is small enough that token ids fit safely in `uint16`, which is why the training shards can be compact raw binary files.
## 65k RXLM BPE
Properties:
- byte-level BPE,
- 65536 vocabulary entries,
- 65283 merges,
- NFKC normalization,
- `add_prefix_space=True`,
- 12 added tokens.
This is a different tokenizer. It should be taught as a later design variant, not as the tokenizer for `model/ckpt.pt`.
## Why Custom BPE
For Polish, a custom tokenizer can reduce awkward fragmentation of common morphemes, diacritics, inflected forms, and domain-specific text. The lesson is not that 32k is always best; the lesson is that tokenizer choice changes:
- effective context length,
- training cost,
- model embedding size,
- output fluency,
- evaluation comparability.
## Quick Inspection Snippet
```python
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizers/polish_bpe_32k.json")
text = "Zażółć gęślą jaźń. Polska jest częścią Europy."
enc = tok.encode(text)
print(enc.ids)
print(enc.tokens)
print(tok.decode(enc.ids))
```
## Compatibility Rule
For inference:
```text
checkpoint vocab_size == tokenizer vocab size
```
For this repo:
```text
model/ckpt.pt -> tokenizers/polish_bpe_32k.json
```