toksuite
/

bigscience-bloom

@@ -44,12 +44,12 @@ This model uses the **BLOOM tokenizer** and is otherwise **identical** to the ot
 - **Language coverage:** Multilingual
 - **Pretokenization source:** BLOOM
-**Processing details (Table 3):**
-- **Numbers:** Split into individual digits
 - **Contractions:** Learned
-- **Unicode normalization:** NFKC
-- **Whitespace / boundary markers:** ▁ (SentencePiece-style whitespace marker)
-- **Continuation / subword markers:** BPE continuation tokens
 ## Why BLOOM?
@@ -87,6 +87,8 @@ The model was trained on a **multilingual corpus totaling approximately 100B tok
   - Italian (IT)
   - Farsi (FA)
 All models in TokSuite are trained using a **fixed token budget**, reflecting common practice in large language model training.
 ---
@@ -113,8 +115,11 @@ The model was evaluated on standard base language model benchmarks:
 - PIQA
 - XNLI
-These evaluations verify that the model exhibits reasonable base language modeling behavior given its scale and training budget.
 ### TokSuite Robustness Benchmark
 TokSuite–BLOOM is evaluated on the **TokSuite robustness benchmark**, which measures sensitivity to real-world text perturbations, including:
@@ -132,7 +137,7 @@ Values represent **relative performance drop**, computed as `(Acc_clean − Acc_
 Perturbation types include:
 - **Input:** non-native keyboard input and romanization
 - **Diacr.:** optional diacritics
-- **Orth.:** orthographic errors
 - **Morph:** morphological variations including derivations, inflections, and contractions
 - **Noise:** homoglyph substitutions, OCR artifacts, typos, and spacing errors
 - **LaTeX:** LaTeX-style mathematical formatting
@@ -141,13 +146,13 @@ Perturbation types include:
 **NEN** denotes non-English inputs and **EN** denotes English inputs.  The **Avg** column reports the average relative performance drop across all perturbation categories.
-| Model | Input (NEN) | Diacr. (NEN) | Orth. (EN) | Gram. (NEN) | Morph (EN) | Morph (NEN) | Noise (EN) | Noise (NEN) | LaTeX (EN) | STEM (EN) | Unic. (EN) | Avg ↓ |
-|-------|-------------|--------------|------------|-------------|------------|-------------|------------|-------------|------------|-----------|------------|-------|
-| TokenMonster | **0.23** | **0.33** | 0.08 | **0.01** | 0.23 | **-0.07** | **0.10** | 0.18 | 0.21 | **0.10** | 0.51 | **0.17** |
 | XGLM | 0.34 | 0.49 | 0.10 | 0.11 | 0.25 | 0.07 | 0.12 | 0.22 | **0.29** | 0.29 | **0.11** | 0.22 |
-| BLOOM | 0.30 | 0.34 | 0.13 | 0.07 | 0.18 | **0.11** | 0.18 | 0.18 | 0.24 | 0.11 | 0.57 | 0.22 |
 | ByT5 | 0.30 | 0.44 | **0.04** | 0.06 | 0.27 | 0.04 | 0.14 | **0.18** | 0.17 | 0.29 | 0.53 | 0.22 |
-| Comma | 0.28 | 0.43 | 0.05 | 0.07 | **0.18** | -0.00 | 0.11 | 0.20 | 0.23 | 0.29 | 0.61 | 0.22 |
 | mBERT | 0.33 | 0.44 | 0.11 | 0.11 | 0.23 | 0.06 | 0.18 | 0.22 | **0.14** | 0.22 | **0.61** | 0.24 |
 | GPT-4o | 0.30 | 0.51 | 0.08 | 0.05 | 0.21 | 0.05 | 0.16 | 0.19 | 0.24 | 0.33 | 0.55 | 0.24 |
 | GPT-2 | 0.34 | 0.46 | 0.07 | 0.10 | 0.25 | 0.06 | 0.14 | 0.21 | 0.24 | 0.35 | 0.53 | 0.25 |

 - **Language coverage:** Multilingual
 - **Pretokenization source:** BLOOM
+**Processing details:**
+- **Numbers:** Learned
 - **Contractions:** Learned
+- **Unicode normalization:** None
+- **Whitespace / boundary markers:** Learned
+- **Zerowidth chars:** Token
 ## Why BLOOM?
   - Italian (IT)
   - Farsi (FA)
+You can find the pretraining dataset here: [toksuite/toksuite_pretraining_data](https://huggingface.co/datasets/toksuite/toksuite_pretraining_data)
 All models in TokSuite are trained using a **fixed token budget**, reflecting common practice in large language model training.
 ---
 - PIQA
 - XNLI
+Performance differences across TokSuite models on these benchmarks arise **solely from tokenizer choice**.
+<p align="left">
+  <img src="./model-performance-comparison.png" alt="TokSuite Logo" width="700"/>
+</p>
 ### TokSuite Robustness Benchmark
 TokSuite–BLOOM is evaluated on the **TokSuite robustness benchmark**, which measures sensitivity to real-world text perturbations, including:
 Perturbation types include:
 - **Input:** non-native keyboard input and romanization
 - **Diacr.:** optional diacritics
+- **Orth.& Gram.:** orthographic and grammatical errors
 - **Morph:** morphological variations including derivations, inflections, and contractions
 - **Noise:** homoglyph substitutions, OCR artifacts, typos, and spacing errors
 - **LaTeX:** LaTeX-style mathematical formatting
 **NEN** denotes non-English inputs and **EN** denotes English inputs.  The **Avg** column reports the average relative performance drop across all perturbation categories.
+| Model | Input (NEN) | Diacr. (NEN) | Orth. & Gram. (EN) | Orth. & Gram. (NEN) | Morph (EN) | Morph (NEN) | Noise (EN) | Noise (NEN) | LaTeX (EN) | STEM (EN) | Unic. (EN) | Avg ↓ |
+|-------|-------------|--------------|--------------------|---------------------|------------|-------------|------------|-------------|------------|-----------|------------|-------|
+| TokenMonster | **0.23** | **0.33** | 0.08 | **0.01** | 0.23 | **-0.07** | **0.10** | **0.18** | 0.21 | **0.10** | 0.51 | **0.17** |
 | XGLM | 0.34 | 0.49 | 0.10 | 0.11 | 0.25 | 0.07 | 0.12 | 0.22 | **0.29** | 0.29 | **0.11** | 0.22 |
+| BLOOM | 0.30 | 0.34 | 0.13 | 0.07 | **0.18** | **0.11** | 0.18 | **0.18** | 0.24 | 0.11 | 0.57 | 0.22 |
 | ByT5 | 0.30 | 0.44 | **0.04** | 0.06 | 0.27 | 0.04 | 0.14 | **0.18** | 0.17 | 0.29 | 0.53 | 0.22 |
+| Comma | 0.28 | 0.43 | 0.05 | 0.07 | **0.18** | 0.00 | 0.11 | 0.20 | 0.23 | 0.29 | 0.61 | 0.22 |
 | mBERT | 0.33 | 0.44 | 0.11 | 0.11 | 0.23 | 0.06 | 0.18 | 0.22 | **0.14** | 0.22 | **0.61** | 0.24 |
 | GPT-4o | 0.30 | 0.51 | 0.08 | 0.05 | 0.21 | 0.05 | 0.16 | 0.19 | 0.24 | 0.33 | 0.55 | 0.24 |
 | GPT-2 | 0.34 | 0.46 | 0.07 | 0.10 | 0.25 | 0.06 | 0.14 | 0.21 | 0.24 | 0.35 | 0.53 | 0.25 |