Update README.md
Browse files
README.md
CHANGED
|
@@ -44,12 +44,12 @@ This model uses the **BLOOM tokenizer** and is otherwise **identical** to the ot
|
|
| 44 |
- **Language coverage:** Multilingual
|
| 45 |
- **Pretokenization source:** BLOOM
|
| 46 |
|
| 47 |
-
**Processing details
|
| 48 |
-
- **Numbers:**
|
| 49 |
- **Contractions:** Learned
|
| 50 |
-
- **Unicode normalization:**
|
| 51 |
-
- **Whitespace / boundary markers:**
|
| 52 |
-
- **
|
| 53 |
|
| 54 |
## Why BLOOM?
|
| 55 |
|
|
@@ -87,6 +87,8 @@ The model was trained on a **multilingual corpus totaling approximately 100B tok
|
|
| 87 |
- Italian (IT)
|
| 88 |
- Farsi (FA)
|
| 89 |
|
|
|
|
|
|
|
| 90 |
All models in TokSuite are trained using a **fixed token budget**, reflecting common practice in large language model training.
|
| 91 |
|
| 92 |
---
|
|
@@ -113,8 +115,11 @@ The model was evaluated on standard base language model benchmarks:
|
|
| 113 |
- PIQA
|
| 114 |
- XNLI
|
| 115 |
|
| 116 |
-
These evaluations verify that the model exhibits reasonable base language modeling behavior given its scale and training budget.
|
| 117 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
### TokSuite Robustness Benchmark
|
| 119 |
|
| 120 |
TokSuite–BLOOM is evaluated on the **TokSuite robustness benchmark**, which measures sensitivity to real-world text perturbations, including:
|
|
@@ -132,7 +137,7 @@ Values represent **relative performance drop**, computed as `(Acc_clean − Acc_
|
|
| 132 |
Perturbation types include:
|
| 133 |
- **Input:** non-native keyboard input and romanization
|
| 134 |
- **Diacr.:** optional diacritics
|
| 135 |
-
- **Orth.:** orthographic errors
|
| 136 |
- **Morph:** morphological variations including derivations, inflections, and contractions
|
| 137 |
- **Noise:** homoglyph substitutions, OCR artifacts, typos, and spacing errors
|
| 138 |
- **LaTeX:** LaTeX-style mathematical formatting
|
|
@@ -141,13 +146,13 @@ Perturbation types include:
|
|
| 141 |
|
| 142 |
**NEN** denotes non-English inputs and **EN** denotes English inputs. The **Avg** column reports the average relative performance drop across all perturbation categories.
|
| 143 |
|
| 144 |
-
| Model | Input (NEN) | Diacr. (NEN) | Orth. (EN) | Gram. (NEN) | Morph (EN) | Morph (NEN) | Noise (EN) | Noise (NEN) | LaTeX (EN) | STEM (EN) | Unic. (EN) | Avg ↓ |
|
| 145 |
-
|
| 146 |
-
| TokenMonster | **0.23** | **0.33** | 0.08 | **0.01** | 0.23 | **-0.07** | **0.10** | 0.18 | 0.21 | **0.10** | 0.51 | **0.17** |
|
| 147 |
| XGLM | 0.34 | 0.49 | 0.10 | 0.11 | 0.25 | 0.07 | 0.12 | 0.22 | **0.29** | 0.29 | **0.11** | 0.22 |
|
| 148 |
-
| BLOOM | 0.30 | 0.34 | 0.13 | 0.07 | 0.18 | **0.11** | 0.18 | 0.18 | 0.24 | 0.11 | 0.57 | 0.22 |
|
| 149 |
| ByT5 | 0.30 | 0.44 | **0.04** | 0.06 | 0.27 | 0.04 | 0.14 | **0.18** | 0.17 | 0.29 | 0.53 | 0.22 |
|
| 150 |
-
| Comma | 0.28 | 0.43 | 0.05 | 0.07 | **0.18** |
|
| 151 |
| mBERT | 0.33 | 0.44 | 0.11 | 0.11 | 0.23 | 0.06 | 0.18 | 0.22 | **0.14** | 0.22 | **0.61** | 0.24 |
|
| 152 |
| GPT-4o | 0.30 | 0.51 | 0.08 | 0.05 | 0.21 | 0.05 | 0.16 | 0.19 | 0.24 | 0.33 | 0.55 | 0.24 |
|
| 153 |
| GPT-2 | 0.34 | 0.46 | 0.07 | 0.10 | 0.25 | 0.06 | 0.14 | 0.21 | 0.24 | 0.35 | 0.53 | 0.25 |
|
|
|
|
| 44 |
- **Language coverage:** Multilingual
|
| 45 |
- **Pretokenization source:** BLOOM
|
| 46 |
|
| 47 |
+
**Processing details:**
|
| 48 |
+
- **Numbers:** Learned
|
| 49 |
- **Contractions:** Learned
|
| 50 |
+
- **Unicode normalization:** None
|
| 51 |
+
- **Whitespace / boundary markers:** Learned
|
| 52 |
+
- **Zerowidth chars:** Token
|
| 53 |
|
| 54 |
## Why BLOOM?
|
| 55 |
|
|
|
|
| 87 |
- Italian (IT)
|
| 88 |
- Farsi (FA)
|
| 89 |
|
| 90 |
+
You can find the pretraining dataset here: [toksuite/toksuite_pretraining_data](https://huggingface.co/datasets/toksuite/toksuite_pretraining_data)
|
| 91 |
+
|
| 92 |
All models in TokSuite are trained using a **fixed token budget**, reflecting common practice in large language model training.
|
| 93 |
|
| 94 |
---
|
|
|
|
| 115 |
- PIQA
|
| 116 |
- XNLI
|
| 117 |
|
|
|
|
| 118 |
|
| 119 |
+
Performance differences across TokSuite models on these benchmarks arise **solely from tokenizer choice**.
|
| 120 |
+
<p align="left">
|
| 121 |
+
<img src="./model-performance-comparison.png" alt="TokSuite Logo" width="700"/>
|
| 122 |
+
</p>
|
| 123 |
### TokSuite Robustness Benchmark
|
| 124 |
|
| 125 |
TokSuite–BLOOM is evaluated on the **TokSuite robustness benchmark**, which measures sensitivity to real-world text perturbations, including:
|
|
|
|
| 137 |
Perturbation types include:
|
| 138 |
- **Input:** non-native keyboard input and romanization
|
| 139 |
- **Diacr.:** optional diacritics
|
| 140 |
+
- **Orth.& Gram.:** orthographic and grammatical errors
|
| 141 |
- **Morph:** morphological variations including derivations, inflections, and contractions
|
| 142 |
- **Noise:** homoglyph substitutions, OCR artifacts, typos, and spacing errors
|
| 143 |
- **LaTeX:** LaTeX-style mathematical formatting
|
|
|
|
| 146 |
|
| 147 |
**NEN** denotes non-English inputs and **EN** denotes English inputs. The **Avg** column reports the average relative performance drop across all perturbation categories.
|
| 148 |
|
| 149 |
+
| Model | Input (NEN) | Diacr. (NEN) | Orth. & Gram. (EN) | Orth. & Gram. (NEN) | Morph (EN) | Morph (NEN) | Noise (EN) | Noise (NEN) | LaTeX (EN) | STEM (EN) | Unic. (EN) | Avg ↓ |
|
| 150 |
+
|-------|-------------|--------------|--------------------|---------------------|------------|-------------|------------|-------------|------------|-----------|------------|-------|
|
| 151 |
+
| TokenMonster | **0.23** | **0.33** | 0.08 | **0.01** | 0.23 | **-0.07** | **0.10** | **0.18** | 0.21 | **0.10** | 0.51 | **0.17** |
|
| 152 |
| XGLM | 0.34 | 0.49 | 0.10 | 0.11 | 0.25 | 0.07 | 0.12 | 0.22 | **0.29** | 0.29 | **0.11** | 0.22 |
|
| 153 |
+
| BLOOM | 0.30 | 0.34 | 0.13 | 0.07 | **0.18** | **0.11** | 0.18 | **0.18** | 0.24 | 0.11 | 0.57 | 0.22 |
|
| 154 |
| ByT5 | 0.30 | 0.44 | **0.04** | 0.06 | 0.27 | 0.04 | 0.14 | **0.18** | 0.17 | 0.29 | 0.53 | 0.22 |
|
| 155 |
+
| Comma | 0.28 | 0.43 | 0.05 | 0.07 | **0.18** | 0.00 | 0.11 | 0.20 | 0.23 | 0.29 | 0.61 | 0.22 |
|
| 156 |
| mBERT | 0.33 | 0.44 | 0.11 | 0.11 | 0.23 | 0.06 | 0.18 | 0.22 | **0.14** | 0.22 | **0.61** | 0.24 |
|
| 157 |
| GPT-4o | 0.30 | 0.51 | 0.08 | 0.05 | 0.21 | 0.05 | 0.16 | 0.19 | 0.24 | 0.33 | 0.55 | 0.24 |
|
| 158 |
| GPT-2 | 0.34 | 0.46 | 0.07 | 0.10 | 0.25 | 0.06 | 0.14 | 0.21 | 0.24 | 0.35 | 0.53 | 0.25 |
|