File size: 815 Bytes
b7f6443
 
2f057c1
 
 
b7f6443
 
2f057c1
b7f6443
 
792d898
eec92a5
26170ad
 
f2d4e42
4d6b7e5
d46d1c3
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
---
title: README
emoji: 📊
colorFrom: purple
colorTo: yellow
sdk: static
pinned: false
license: mit
---

<p align="center">
  <img src="./toksuite-pipeline.png" alt="TokSuite Logo"/>
</p>

**TokSuite** is a collection of models and benchmarks designed to isolate and study the impact of tokenization on language model behavior across **English, Chinese, Turkish, Italian, and Farsi** languages, as well as **STEM and mathematical text**. It includes fourteen models that share the same architecture, training data, training budget, and initialization but differ only in their tokenizers, alongside a set of benchmarks that evaluate performance under **real-world perturbations** that affect tokenization.

Our code is available at [https://github.com/r-three/Tokenizers](https://github.com/r-three/Tokenizers).