TokSuite

community

AI & ML interests

Tokenization, Robustness, LLMs

Recent Activity

gsaltintas  updated a Space 2 days ago
toksuite/quick-tokenizer-accuracy
gsaltintas  updated a dataset 4 days ago
toksuite/toksuite_chinese
gsaltintas  updated a dataset 4 days ago
toksuite/toksuite_turkish
View all activity

TokSuite Logo

TokSuite is a collection of models and benchmarks designed to isolate and study the impact of tokenization on language model behavior across English, Chinese, Turkish, Italian, and Farsi languages, as well as STEM and mathematical text. It includes fourteen models that share the same architecture, training data, training budget, and initialization but differ only in their tokenizers, alongside a set of benchmarks that evaluate performance under real-world perturbations that affect tokenization.

Our code is available at https://github.com/r-three/Tokenizers.