File size: 7,994 Bytes
e5c8f84
36deedc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5c8f84
88ee2bb
 
e5c8f84
 
36deedc
 
 
e5c8f84
36deedc
e5c8f84
36deedc
e5c8f84
36deedc
e5c8f84
36deedc
e5c8f84
36deedc
e5c8f84
36deedc
e5c8f84
36deedc
27b80fa
 
 
 
 
e5c8f84
c48258c
 
27b80fa
c48258c
 
 
27b80fa
 
 
 
 
 
 
 
 
 
 
e5c8f84
36deedc
e5c8f84
36deedc
e5c8f84
aa2d8a0
36deedc
 
 
aa2d8a0
e5c8f84
aa2d8a0
e5c8f84
36deedc
e5c8f84
36deedc
e5c8f84
36deedc
e5c8f84
36deedc
 
 
 
 
 
e5c8f84
c48258c
 
36deedc
e5c8f84
36deedc
e5c8f84
36deedc
e5c8f84
36deedc
 
 
 
 
 
 
e5c8f84
36deedc
e5c8f84
 
 
36deedc
e5c8f84
36deedc
 
 
 
 
e5c8f84
c48258c
 
 
220005c
 
 
36deedc
e5c8f84
36deedc
e5c8f84
36deedc
 
 
 
 
 
e5c8f84
bd851da
1a76b4b
 
 
 
 
c48258c
1a76b4b
 
 
 
 
 
 
 
c48258c
 
 
bd851da
c48258c
bd851da
c48258c
bd851da
 
 
 
 
 
 
 
 
 
e5c8f84
36deedc
e5c8f84
36deedc
e5c8f84
36deedc
ad8710f
 
36deedc
ad8710f
e5c8f84
ad8710f
e5c8f84
36deedc
e5c8f84
36deedc
e5c8f84
36deedc
 
6198f54
36deedc
e5c8f84
36deedc
e5c8f84
36deedc
e5c8f84
36deedc
 
e5c8f84
36deedc
e5c8f84
36deedc
e5c8f84
36deedc
e5c8f84
36deedc
 
 
 
06954d7
 
88ee2bb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
license: mit
language:
- en
- tr
- it
- fa
- zh
tags:
- toksuite
- tokenization
- bloom
- multilingual
- bpe
- robustness
- research
pipeline_tag: text-generation
library_name: transformers
datasets:
- toksuite/toksuite_pretraining_data
---

<p align="left">
  <img src="./toksuite-logo.png" alt="TokSuite Logo" width="260"/>
</p>

# TokSuite – BLOOM

## Model Summary

**TokSuite–BLOOM** is part of **TokSuite**, a suite of language models designed to study the impact of **tokenizer choice on language model behavior** under controlled conditions.

This model uses the **BLOOM tokenizer** and is otherwise **identical** to the other TokSuite models in architecture, training data, training budget, and initialization. The TokSuite setup ensures that any observed behavioral characteristics reflect properties of the tokenizer rather than differences in model scale, data composition, or optimization.

---

## Tokenizer

- **Tokenizer:** BLOOM
- **Tokenization method:** BPE
- **Vocabulary size:** 250,680
- **Out-of-vocabulary handling:** Byte-fallback
- **Language coverage:** Multilingual
- **Pretokenization source:** BLOOM

**Processing details:**
- **Numbers:** Learned
- **Contractions:** Learned
- **Unicode normalization:** None
- **Whitespace / boundary markers:** Learned
- **Zerowidth chars:** Token

## Why BLOOM?

BLOOM was included in TokSuite to represent a **large-vocabulary multilingual BPE tokenizer** trained for broad cross-lingual coverage. As described in the tokenizer selection rationale of the TokSuite paper, BLOOM exemplifies a design choice that prioritizes extensive vocabulary capacity while maintaining subword-based segmentation.

Including BLOOM enables TokSuite to study tokenizer behavior in settings where:
- vocabulary size is large,
- segmentation follows BPE-style merges,
- and multilingual text is handled through a shared tokenizer.

This makes BLOOM a representative example of multilingual BPE tokenization.

---

## Model Architecture

- **Architecture:**  Decoder-only Transformer (Lingua's Llama-3.2-1B configuration)
- **Non-embedding parameters:** ~1B
- **Context length:** 4096 tokens
- **Framework:** Meta Lingua
- **Initialization:** Shared super-vocabulary initialization across TokSuite models

The architecture and training setup are identical across all TokSuite models; only the tokenizer differs.

---

## Training Data

The model was trained on a **multilingual corpus totaling approximately 100B tokens**, composed of:

- **English:** 40B tokens from *FineWeb-Edu*
- **Multilingual:** 60B tokens evenly distributed across:
  - Chinese (ZH)
  - Turkish (TR)
  - Italian (IT)
  - Farsi (FA)

You can find the pretraining dataset here: [toksuite/toksuite_pretraining_data](https://huggingface.co/datasets/toksuite/toksuite_pretraining_data)

All models in TokSuite are trained using a **fixed token budget**, reflecting common practice in large language model training.

---

## Training Procedure

- **Training steps:** 100,000
- **Sequence length:** 4096
- **Batch size:** 256 sequences
- **Optimizer:** AdamW
- **Peak learning rate:** 1e-3
- **Learning rate schedule:** Cosine decay with 2,000 warm-up steps
- **Weight decay:** 0.1

---

## Evaluation

### Canonical Benchmarks

The model was evaluated on standard base language model benchmarks:
- HellaSwag
- ARC
- PIQA
- XNLI

<p align="left">
  <img src="./model-performance-comparison.png" alt="TokSuite Logo" width="700"/>
</p>

These evaluations verify that the model exhibits reasonable base language modeling behavior at its scale and training budget.

### TokSuite Robustness Benchmark

TokSuite–BLOOM is evaluated on the **TokSuite robustness benchmark**, which measures sensitivity to real-world text perturbations, including:

- orthographic and spelling variations,
- diacritics presence and absence,
- keyboard and input-method noise,
- Unicode formatting and homoglyphs,
- OCR and spacing artifacts,
- LaTeX and STEM-style formatting.

**Tokenization Robustness under Multilingual Text Perturbations**  
Values represent **relative performance drop**, computed as `(Acc_clean − Acc_perturbed) / Acc_clean`,  where **lower values indicate greater robustness**.

Perturbation types include:  
- **Input:** non-native keyboard input and romanization  
- **Diacr.:** optional diacritics  
- **Orth.& Gram.:** orthographic and grammatical errors  
- **Morph:** morphological variations including derivations, inflections, and contractions  
- **Noise:** homoglyph substitutions, OCR artifacts, typos, and spacing errors  
- **LaTeX:** LaTeX-style mathematical formatting  
- **STEM:** scientific diagrams and notational conventions  
- **Unic.:** Unicode styling characters  

**NEN** denotes non-English inputs and **EN** denotes English inputs.  The **Avg** column reports the average relative performance drop across all perturbation categories.

| Model | Input (NEN) | Diacr. (NEN) | Orth. & Gram. (EN) | Orth. & Gram. (NEN) | Morph (EN) | Morph (NEN) | Noise (EN) | Noise (NEN) | LaTeX (EN) | STEM (EN) | Unic. (EN) | Avg ↓ |
|-------|-------------|--------------|--------------------|---------------------|------------|-------------|------------|-------------|------------|-----------|------------|-------|
| TokenMonster | **0.23** | **0.33** | 0.08 | **0.01** | 0.23 | **-0.07** | **0.10** | **0.18** | 0.21 | **0.10** | 0.51 | **0.17** |
| XGLM | 0.34 | 0.49 | 0.10 | 0.11 | 0.25 | 0.07 | 0.12 | 0.22 | **0.29** | 0.29 | **0.11** | 0.22 |
| BLOOM | 0.30 | 0.34 | 0.13 | 0.07 | **0.18** | **0.11** | 0.18 | **0.18** | 0.24 | 0.11 | 0.57 | 0.22 |
| ByT5 | 0.30 | 0.44 | **0.04** | 0.06 | 0.27 | 0.04 | 0.14 | **0.18** | 0.17 | 0.29 | 0.53 | 0.22 |
| Comma | 0.28 | 0.43 | 0.05 | 0.07 | **0.18** | 0.00 | 0.11 | 0.20 | 0.23 | 0.29 | 0.61 | 0.22 |
| mBERT | 0.33 | 0.44 | 0.11 | 0.11 | 0.23 | 0.06 | 0.18 | 0.22 | **0.14** | 0.22 | **0.61** | 0.24 |
| GPT-4o | 0.30 | 0.51 | 0.08 | 0.05 | 0.21 | 0.05 | 0.16 | 0.19 | 0.24 | 0.33 | 0.55 | 0.24 |
| GPT-2 | 0.34 | 0.46 | 0.07 | 0.10 | 0.25 | 0.06 | 0.14 | 0.21 | 0.24 | 0.35 | 0.53 | 0.25 |
| Phi-3 | 0.33 | 0.46 | 0.16 | **0.09** | 0.27 | 0.08 | 0.17 | 0.21 | 0.24 | 0.22 | 0.55 | 0.25 |
| Gemma-2 | 0.32 | 0.42 | 0.14 | **0.15** | 0.24 | 0.03 | 0.16 | 0.25 | 0.22 | 0.36 | 0.57 | 0.26 |
| Qwen-3 | **0.36** | 0.42 | 0.14 | 0.11 | 0.25 | 0.06 | 0.16 | 0.23 | 0.26 | 0.29 | 0.57 | 0.26 |
| Llama-3.2 | 0.33 | **0.55** | 0.11 | 0.10 | 0.25 | 0.08 | 0.15 | 0.24 | 0.17 | 0.30 | 0.59 | 0.26 |
| Aya | 0.31 | 0.46 | 0.14 | 0.10 | 0.22 | 0.03 | **0.19** | **0.25** | 0.21 | 0.38 | 0.58 | 0.26 |
| Tekken | 0.33 | 0.47 | **0.18** | 0.03 | **0.31** | 0.10 | 0.14 | 0.21 | 0.27 | **0.43** | 0.54 | **0.27** |
| **Avg** | 0.31 | 0.44 | 0.11 | 0.08 | 0.24 | **0.04** | 0.15 | 0.21 | 0.22 | 0.28 | **0.53** | 0.24 |

---

## Intended Use

This model is intended for:
- research on tokenization and robustness,
- multilingual NLP analysis,
- controlled ablation studies,
- benchmarking tokenizer behavior under noise.

It is **not** instruction-tuned, aligned, or optimized for deployment.

---

## Limitations

- Trained on a limited set of five languages.
- Not optimized for instruction following or dialogue.
- Fixed token budget constraints exposure to raw text depending on tokenization efficiency.
- Intended strictly for research purposes.

---

## Ethical Considerations

TokSuite models are released to support **scientific analysis of tokenization**.  
They may reflect biases present in large-scale web data and should not be used in high-stakes or user-facing applications without additional alignment and evaluation.

---

## Citation

If you use this model, please cite:

```bibtex
@article{toksuite2025,
  title={TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior},
  author={Altıntaş, Gul Sena and Ehghaghi, Malikeh and Lester, Brian and Liu, Fengyuan and Zhao, Wanru and Ciccone, Marco and Raffel, Colin},
  year={2025},
  arxiv={https://arxiv.org/abs/2512.20757},
}