Malikeh1375 commited on
Commit
c48258c
·
verified ·
1 Parent(s): 0e25ab6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -12
README.md CHANGED
@@ -44,12 +44,12 @@ This model uses the **BLOOM tokenizer** and is otherwise **identical** to the ot
44
  - **Language coverage:** Multilingual
45
  - **Pretokenization source:** BLOOM
46
 
47
- **Processing details (Table 3):**
48
- - **Numbers:** Split into individual digits
49
  - **Contractions:** Learned
50
- - **Unicode normalization:** NFKC
51
- - **Whitespace / boundary markers:** ▁ (SentencePiece-style whitespace marker)
52
- - **Continuation / subword markers:** BPE continuation tokens
53
 
54
  ## Why BLOOM?
55
 
@@ -87,6 +87,8 @@ The model was trained on a **multilingual corpus totaling approximately 100B tok
87
  - Italian (IT)
88
  - Farsi (FA)
89
 
 
 
90
  All models in TokSuite are trained using a **fixed token budget**, reflecting common practice in large language model training.
91
 
92
  ---
@@ -113,8 +115,11 @@ The model was evaluated on standard base language model benchmarks:
113
  - PIQA
114
  - XNLI
115
 
116
- These evaluations verify that the model exhibits reasonable base language modeling behavior given its scale and training budget.
117
 
 
 
 
 
118
  ### TokSuite Robustness Benchmark
119
 
120
  TokSuite–BLOOM is evaluated on the **TokSuite robustness benchmark**, which measures sensitivity to real-world text perturbations, including:
@@ -132,7 +137,7 @@ Values represent **relative performance drop**, computed as `(Acc_clean − Acc_
132
  Perturbation types include:
133
  - **Input:** non-native keyboard input and romanization
134
  - **Diacr.:** optional diacritics
135
- - **Orth.:** orthographic errors
136
  - **Morph:** morphological variations including derivations, inflections, and contractions
137
  - **Noise:** homoglyph substitutions, OCR artifacts, typos, and spacing errors
138
  - **LaTeX:** LaTeX-style mathematical formatting
@@ -141,13 +146,13 @@ Perturbation types include:
141
 
142
  **NEN** denotes non-English inputs and **EN** denotes English inputs. The **Avg** column reports the average relative performance drop across all perturbation categories.
143
 
144
- | Model | Input (NEN) | Diacr. (NEN) | Orth. (EN) | Gram. (NEN) | Morph (EN) | Morph (NEN) | Noise (EN) | Noise (NEN) | LaTeX (EN) | STEM (EN) | Unic. (EN) | Avg ↓ |
145
- |-------|-------------|--------------|------------|-------------|------------|-------------|------------|-------------|------------|-----------|------------|-------|
146
- | TokenMonster | **0.23** | **0.33** | 0.08 | **0.01** | 0.23 | **-0.07** | **0.10** | 0.18 | 0.21 | **0.10** | 0.51 | **0.17** |
147
  | XGLM | 0.34 | 0.49 | 0.10 | 0.11 | 0.25 | 0.07 | 0.12 | 0.22 | **0.29** | 0.29 | **0.11** | 0.22 |
148
- | BLOOM | 0.30 | 0.34 | 0.13 | 0.07 | 0.18 | **0.11** | 0.18 | 0.18 | 0.24 | 0.11 | 0.57 | 0.22 |
149
  | ByT5 | 0.30 | 0.44 | **0.04** | 0.06 | 0.27 | 0.04 | 0.14 | **0.18** | 0.17 | 0.29 | 0.53 | 0.22 |
150
- | Comma | 0.28 | 0.43 | 0.05 | 0.07 | **0.18** | -0.00 | 0.11 | 0.20 | 0.23 | 0.29 | 0.61 | 0.22 |
151
  | mBERT | 0.33 | 0.44 | 0.11 | 0.11 | 0.23 | 0.06 | 0.18 | 0.22 | **0.14** | 0.22 | **0.61** | 0.24 |
152
  | GPT-4o | 0.30 | 0.51 | 0.08 | 0.05 | 0.21 | 0.05 | 0.16 | 0.19 | 0.24 | 0.33 | 0.55 | 0.24 |
153
  | GPT-2 | 0.34 | 0.46 | 0.07 | 0.10 | 0.25 | 0.06 | 0.14 | 0.21 | 0.24 | 0.35 | 0.53 | 0.25 |
 
44
  - **Language coverage:** Multilingual
45
  - **Pretokenization source:** BLOOM
46
 
47
+ **Processing details:**
48
+ - **Numbers:** Learned
49
  - **Contractions:** Learned
50
+ - **Unicode normalization:** None
51
+ - **Whitespace / boundary markers:** Learned
52
+ - **Zerowidth chars:** Token
53
 
54
  ## Why BLOOM?
55
 
 
87
  - Italian (IT)
88
  - Farsi (FA)
89
 
90
+ You can find the pretraining dataset here: [toksuite/toksuite_pretraining_data](https://huggingface.co/datasets/toksuite/toksuite_pretraining_data)
91
+
92
  All models in TokSuite are trained using a **fixed token budget**, reflecting common practice in large language model training.
93
 
94
  ---
 
115
  - PIQA
116
  - XNLI
117
 
 
118
 
119
+ Performance differences across TokSuite models on these benchmarks arise **solely from tokenizer choice**.
120
+ <p align="left">
121
+ <img src="./model-performance-comparison.png" alt="TokSuite Logo" width="700"/>
122
+ </p>
123
  ### TokSuite Robustness Benchmark
124
 
125
  TokSuite–BLOOM is evaluated on the **TokSuite robustness benchmark**, which measures sensitivity to real-world text perturbations, including:
 
137
  Perturbation types include:
138
  - **Input:** non-native keyboard input and romanization
139
  - **Diacr.:** optional diacritics
140
+ - **Orth.& Gram.:** orthographic and grammatical errors
141
  - **Morph:** morphological variations including derivations, inflections, and contractions
142
  - **Noise:** homoglyph substitutions, OCR artifacts, typos, and spacing errors
143
  - **LaTeX:** LaTeX-style mathematical formatting
 
146
 
147
  **NEN** denotes non-English inputs and **EN** denotes English inputs. The **Avg** column reports the average relative performance drop across all perturbation categories.
148
 
149
+ | Model | Input (NEN) | Diacr. (NEN) | Orth. & Gram. (EN) | Orth. & Gram. (NEN) | Morph (EN) | Morph (NEN) | Noise (EN) | Noise (NEN) | LaTeX (EN) | STEM (EN) | Unic. (EN) | Avg ↓ |
150
+ |-------|-------------|--------------|--------------------|---------------------|------------|-------------|------------|-------------|------------|-----------|------------|-------|
151
+ | TokenMonster | **0.23** | **0.33** | 0.08 | **0.01** | 0.23 | **-0.07** | **0.10** | **0.18** | 0.21 | **0.10** | 0.51 | **0.17** |
152
  | XGLM | 0.34 | 0.49 | 0.10 | 0.11 | 0.25 | 0.07 | 0.12 | 0.22 | **0.29** | 0.29 | **0.11** | 0.22 |
153
+ | BLOOM | 0.30 | 0.34 | 0.13 | 0.07 | **0.18** | **0.11** | 0.18 | **0.18** | 0.24 | 0.11 | 0.57 | 0.22 |
154
  | ByT5 | 0.30 | 0.44 | **0.04** | 0.06 | 0.27 | 0.04 | 0.14 | **0.18** | 0.17 | 0.29 | 0.53 | 0.22 |
155
+ | Comma | 0.28 | 0.43 | 0.05 | 0.07 | **0.18** | 0.00 | 0.11 | 0.20 | 0.23 | 0.29 | 0.61 | 0.22 |
156
  | mBERT | 0.33 | 0.44 | 0.11 | 0.11 | 0.23 | 0.06 | 0.18 | 0.22 | **0.14** | 0.22 | **0.61** | 0.24 |
157
  | GPT-4o | 0.30 | 0.51 | 0.08 | 0.05 | 0.21 | 0.05 | 0.16 | 0.19 | 0.24 | 0.33 | 0.55 | 0.24 |
158
  | GPT-2 | 0.34 | 0.46 | 0.07 | 0.10 | 0.25 | 0.06 | 0.14 | 0.21 | 0.24 | 0.35 | 0.53 | 0.25 |