--- title: README emoji: 📊 colorFrom: purple colorTo: yellow sdk: static pinned: false license: mit ---

TokSuite Logo

**TokSuite** is a collection of models and benchmarks designed to isolate and study the impact of tokenization on language model behavior across **English, Chinese, Turkish, Italian, and Farsi** languages, as well as **STEM and mathematical text**. It includes fourteen models that share the same architecture, training data, training budget, and initialization but differ only in their tokenizers, alongside a set of benchmarks that evaluate performance under **real-world perturbations** that affect tokenization. Our code is available at [https://github.com/r-three/Tokenizers](https://github.com/r-three/Tokenizers).