File size: 8,009 Bytes
30fc149 a012db2 30fc149 9f4842e 30fc149 ca4f565 a012db2 30fc149 a012db2 30fc149 a012db2 30fc149 a012db2 30fc149 a012db2 30fc149 a012db2 30fc149 a012db2 30fc149 a012db2 ca4f565 30fc149 8b99558 ca4f565 8b99558 ca4f565 30fc149 a012db2 30fc149 a012db2 30fc149 8b99558 a012db2 8b99558 30fc149 a012db2 30fc149 a012db2 30fc149 a012db2 30fc149 8b99558 30fc149 a012db2 30fc149 8b99558 a012db2 30fc149 a012db2 30fc149 a012db2 30fc149 a012db2 30fc149 a012db2 30fc149 a012db2 30fc149 a012db2 30fc149 8b99558 30fc149 b1434a7 a012db2 30fc149 a012db2 30fc149 a012db2 30fc149 92cd6c9 af0f69b 92cd6c9 8b99558 92cd6c9 8b99558 af0f69b 8b99558 af0f69b 8b99558 af0f69b 92cd6c9 30fc149 6aa028a a012db2 30fc149 a012db2 6aa028a a012db2 6aa028a 30fc149 6aa028a 30fc149 a012db2 30fc149 a012db2 30fc149 a012db2 75f77e5 6aa028a 30fc149 a012db2 30fc149 a012db2 30fc149 a012db2 30fc149 a012db2 30fc149 a012db2 30fc149 a012db2 30fc149 a012db2 4b9e9a0 9f4842e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 | ---
license: mit
language:
- en
- tr
- it
- fa
- zh
tags:
- toksuite
- tokenization
- gpt2
- bpe
- research
- robustness
pipeline_tag: text-generation
library_name: transformers
datasets:
- toksuite/toksuite_pretraining_data
---
<p align="left">
<img src="./toksuite-logo.png" alt="TokSuite Logo" width="260"/>
</p>
# TokSuite – GPT-2
## Model Summary
**TokSuite–GPT-2** is part of **TokSuite**, a suite of language models designed to study the impact of **tokenizer choice on language model behavior** under controlled conditions.
This model uses the **GPT-2 tokenizer** and is otherwise **identical** to the other TokSuite models in architecture, training data, training budget, and initialization. The TokSuite setup ensures that any observed behavioral characteristics reflect properties of the tokenizer rather than differences in model size, data, or optimization.
---
## Tokenizer
- **Tokenizer:** GPT-2
- **Tokenization method:** BPE
- **Vocabulary size:** 50,257
- **Out-of-vocabulary handling:** Byte-fallback
- **Language coverage:** English-only
- **Pretokenization source:** GPT-2
**Processing details:**
- **Numbers:** Group
- **Contractions:** GPT-2
- **Unicode normalization:** None
- **Whitespace / boundary markers:** Individual
- **Zerowidth chars:** Token
## Why GPT-2?
GPT-2 was included in TokSuite to represent a **canonical English BPE tokenizer** that has been widely adopted in early large-scale language models. As described in the tokenizer selection rationale of the TokSuite paper, GPT-2 provides a well-established reference point for studying subword tokenization without explicit normalization or language-specific preprocessing.
Including GPT-2 enables TokSuite to study tokenizer behavior in settings where:
- tokenization is optimized for English,
- preprocessing and normalization are minimal,
- and whitespace is handled implicitly through token boundaries.
This makes GPT-2 a foundational tokenizer design within the TokSuite collection.
---
## Model Architecture
- **Architecture:** Decoder-only Transformer (Lingua's Llama-3.2-1B configuration)
- **Non-embedding parameters:** ~1B
- **Context length:** 4096 tokens
- **Framework:** Meta Lingua
- **Initialization:** Shared super-vocabulary initialization across TokSuite models
The architecture and hyperparameters are fixed across all TokSuite models; the tokenizer is the only variable.
---
## Training Data
The model was trained on a **multilingual corpus totaling approximately 100B tokens**, consisting of:
- **English:** 40B tokens from *FineWeb-Edu*
- **Multilingual:** 60B tokens evenly distributed across:
- Chinese (ZH)
- Turkish (TR)
- Italian (IT)
- Farsi (FA)
You can find the pretraining dataset here: [toksuite/toksuite_pretraining_data](https://huggingface.co/datasets/toksuite/toksuite_pretraining_data)
All models in TokSuite are trained using a **fixed token budget**, following common practice in large-scale language model training.
---
## Training Procedure
- **Training steps:** 100,000
- **Sequence length:** 4096
- **Batch size:** 256 sequences
- **Optimizer:** AdamW
- **Peak learning rate:** 1e-3
- **Learning rate schedule:** Cosine decay with 2,000 warm-up steps
- **Weight decay:** 0.1
---
## Evaluation
### Canonical Benchmarks
The model was evaluated on standard base language model benchmarks:
- HellaSwag
- ARC
- PIQA
- XNLI
<p align="left">
<img src="./model-performance-comparison.png" alt="TokSuite Logo" width="700"/>
</p>
These evaluations verify that the model exhibits reasonable base language modeling behavior at its scale and training budget.
### TokSuite Robustness Benchmark
TokSuite–GPT-2 is evaluated on the **TokSuite robustness benchmark**, which measures sensitivity to real-world text perturbations, including:
- orthographic and spelling variations,
- diacritics presence and absence,
- keyboard and input-method noise,
- Unicode formatting and homoglyphs,
- OCR and spacing artifacts,
- LaTeX and STEM-style formatting.
**Tokenization Robustness under Multilingual Text Perturbations**
Values represent **relative performance drop**, computed as `(Acc_clean − Acc_perturbed) / Acc_clean`, where **lower values indicate greater robustness**.
- **Input:** non-native keyboard input and romanization
- **Diacr.:** optional diacritics
- **Orth.& Gram.:** orthographic and grammatical errors
- **Morph:** morphological variations including derivations, inflections, and contractions
- **Noise:** homoglyph substitutions, OCR artifacts, typos, and spacing errors
- **LaTeX:** LaTeX-style mathematical formatting
- **STEM:** scientific diagrams and notational conventions
- **Unic.:** Unicode styling characters
**NEN** denotes non-English inputs and **EN** denotes English inputs. The **Avg** column reports the average relative performance drop across all perturbation categories.
| Model | Input (NEN) | Diacr. (NEN) | Orth. & Gram. (EN) | Orth. & Gram. (NEN) | Morph (EN) | Morph (NEN) | Noise (EN) | Noise (NEN) | LaTeX (EN) | STEM (EN) | Unic. (EN) | Avg ↓ |
|-------|-------------|--------------|--------------------|---------------------|------------|-------------|------------|-------------|------------|-----------|------------|-------|
| TokenMonster | **0.23** | **0.33** | 0.08 | **0.01** | 0.23 | **-0.07** | **0.10** | **0.18** | 0.21 | **0.10** | 0.51 | **0.17** |
| XGLM | 0.34 | 0.49 | 0.10 | 0.11 | 0.25 | 0.07 | 0.12 | 0.22 | **0.29** | 0.29 | **0.11** | 0.22 |
| BLOOM | 0.30 | 0.34 | 0.13 | 0.07 | **0.18** | **0.11** | 0.18 | **0.18** | 0.24 | 0.11 | 0.57 | 0.22 |
| ByT5 | 0.30 | 0.44 | **0.04** | 0.06 | 0.27 | 0.04 | 0.14 | **0.18** | 0.17 | 0.29 | 0.53 | 0.22 |
| Comma | 0.28 | 0.43 | 0.05 | 0.07 | **0.18** | 0.00 | 0.11 | 0.20 | 0.23 | 0.29 | 0.61 | 0.22 |
| mBERT | 0.33 | 0.44 | 0.11 | 0.11 | 0.23 | 0.06 | 0.18 | 0.22 | **0.14** | 0.22 | **0.61** | 0.24 |
| GPT-4o | 0.30 | 0.51 | 0.08 | 0.05 | 0.21 | 0.05 | 0.16 | 0.19 | 0.24 | 0.33 | 0.55 | 0.24 |
| GPT-2 | 0.34 | 0.46 | 0.07 | 0.10 | 0.25 | 0.06 | 0.14 | 0.21 | 0.24 | 0.35 | 0.53 | 0.25 |
| Phi-3 | 0.33 | 0.46 | 0.16 | **0.09** | 0.27 | 0.08 | 0.17 | 0.21 | 0.24 | 0.22 | 0.55 | 0.25 |
| Gemma-2 | 0.32 | 0.42 | 0.14 | **0.15** | 0.24 | 0.03 | 0.16 | 0.25 | 0.22 | 0.36 | 0.57 | 0.26 |
| Qwen-3 | **0.36** | 0.42 | 0.14 | 0.11 | 0.25 | 0.06 | 0.16 | 0.23 | 0.26 | 0.29 | 0.57 | 0.26 |
| Llama-3.2 | 0.33 | **0.55** | 0.11 | 0.10 | 0.25 | 0.08 | 0.15 | 0.24 | 0.17 | 0.30 | 0.59 | 0.26 |
| Aya | 0.31 | 0.46 | 0.14 | 0.10 | 0.22 | 0.03 | **0.19** | **0.25** | 0.21 | 0.38 | 0.58 | 0.26 |
| Tekken | 0.33 | 0.47 | **0.18** | 0.03 | **0.31** | 0.10 | 0.14 | 0.21 | 0.27 | **0.43** | 0.54 | **0.27** |
| **Avg** | 0.31 | 0.44 | 0.11 | 0.08 | 0.24 | **0.04** | 0.15 | 0.21 | 0.22 | 0.28 | **0.53** | 0.24 |
---
## Intended Use
This model is intended for:
- research on tokenization and robustness,
- multilingual NLP analysis,
- controlled ablation studies,
- benchmarking tokenizer behavior under noise.
It is **not** instruction-tuned, aligned, or optimized for deployment.
---
## Limitations
- Trained on a limited set of five languages.
- Not optimized for instruction following or dialogue.
- Fixed token budget constraints exposure to raw text depending on tokenization efficiency.
- Intended strictly for research purposes.
---
## Ethical Considerations
TokSuite models are released to support **scientific investigation of tokenization effects**.
They may reflect biases present in large-scale web data and should not be used in high-stakes or user-facing applications without additional safeguards.
---
## Citation
If you use this model, please cite:
```bibtex
@article{toksuite2025,
title={TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior},
author={Altıntaş, Gul Sena and Ehghaghi, Malikeh and Lester, Brian and Liu, Fengyuan and Zhao, Wanru and Ciccone, Marco and Raffel, Colin},
year={2025},
arxiv={https://arxiv.org/abs/2512.20757},
} |