slayer-gpt-tokenizer-model / docs /TOKENIZER_NOTES.md
kacperwikiel's picture
Upload Slayer GPT tokenizer model archive
78c54ec verified
|
Raw
History Blame Contribute Delete
1.83 kB

Tokenizer Notes

The key teaching point is that the tokenizer is part of the model. It is not an interchangeable preprocessing detail.

Tokenizers In This Repo

File Vocab Use
tokenizers/polish_bpe_32k.json 32768 Paired with model/ckpt.pt
tokenizers/rxlm_polish_bpe_65k.json 65536 Separate later custom tokenizer artifact

32k Polish BPE

Properties:

  • byte-level BPE,
  • 32768 vocabulary entries,
  • 32511 merges,
  • no normalizer,
  • add_prefix_space=False,
  • <|endoftext|> is the special document separator token.

This tokenizer is small enough that token ids fit safely in uint16, which is why the training shards can be compact raw binary files.

65k RXLM BPE

Properties:

  • byte-level BPE,
  • 65536 vocabulary entries,
  • 65283 merges,
  • NFKC normalization,
  • add_prefix_space=True,
  • 12 added tokens.

This is a different tokenizer. It should be taught as a later design variant, not as the tokenizer for model/ckpt.pt.

Why Custom BPE

For Polish, a custom tokenizer can reduce awkward fragmentation of common morphemes, diacritics, inflected forms, and domain-specific text. The lesson is not that 32k is always best; the lesson is that tokenizer choice changes:

  • effective context length,
  • training cost,
  • model embedding size,
  • output fluency,
  • evaluation comparability.

Quick Inspection Snippet

from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizers/polish_bpe_32k.json")
text = "Zażółć gęślą jaźń. Polska jest częścią Europy."
enc = tok.encode(text)

print(enc.ids)
print(enc.tokens)
print(tok.decode(enc.ids))

Compatibility Rule

For inference:

checkpoint vocab_size == tokenizer vocab size

For this repo:

model/ckpt.pt -> tokenizers/polish_bpe_32k.json