Update README.md
Browse files
README.md
CHANGED
|
@@ -82,33 +82,37 @@ print(text)
|
|
| 82 |
|
| 83 |
### Tokenizer Benchmark Results
|
| 84 |
|
| 85 |
-
Comparison with state-of-the-art tokenizers
|
| 86 |
-
|
| 87 |
-
**Dataset
|
| 88 |
-
|
| 89 |
-
| Tokenizer | Vocab | AR Fert | EN Fert | Avg Fert |
|
| 90 |
-
|
| 91 |
-
| **SARFTokenizer** | 64,641 | 1.
|
| 92 |
-
| ALLaM-7B | 64,000 | 1.
|
| 93 |
-
|
|
| 94 |
-
|
|
| 95 |
-
|
|
| 96 |
-
|
|
| 97 |
-
|
|
| 98 |
-
|
|
| 99 |
-
|
|
| 100 |
-
|
|
|
|
|
| 101 |
|
| 102 |
**Metrics explained:**
|
| 103 |
-
- **Fertility**: Average tokens per word (lower is better)
|
|
|
|
| 104 |
- **Parity**: AR chars/token ÷ EN chars/token (1.0 = equal treatment of both languages)
|
| 105 |
|
| 106 |
**Key findings:**
|
| 107 |
-
- **SARFTokenizer
|
| 108 |
-
- **
|
| 109 |
-
- **
|
| 110 |
-
-
|
| 111 |
-
-
|
|
|
|
|
|
|
| 112 |
|
| 113 |
### Throughput Benchmark (1M samples, 680 MB)
|
| 114 |
|
|
@@ -195,4 +199,4 @@ CC-BY-NC-4.0
|
|
| 195 |
url={https://huggingface.co/almaghrabima/SARFTokenizer},
|
| 196 |
note={Independent research, part of Suhail Project}
|
| 197 |
}
|
| 198 |
-
```
|
|
|
|
| 82 |
|
| 83 |
### Tokenizer Benchmark Results
|
| 84 |
|
| 85 |
+
Comparison with state-of-the-art tokenizers on 60,000 samples (30k Arabic + 30k English).
|
| 86 |
+
|
| 87 |
+
**Dataset:** [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data)
|
| 88 |
+
|
| 89 |
+
| Tokenizer | Vocab | AR Fert | EN Fert | Avg Fert | AR C/T | EN C/T | Parity |
|
| 90 |
+
|-----------|-------|---------|---------|----------|--------|--------|--------|
|
| 91 |
+
| **SARFTokenizer** | 64,641 | **1.72** | 1.57 | **1.64** | 3.45 | 2.99 | 1.156 |
|
| 92 |
+
| ALLaM-7B | 64,000 | 1.82 | 1.48 | 1.65 | 3.08 | 2.65 | 1.163 |
|
| 93 |
+
| Gemma-3-4B | 262,145 | 2.78 | 1.33 | 2.05 | 2.42 | 3.00 | 0.805 |
|
| 94 |
+
| Falcon-H1-7B | 130,049 | 2.65 | 1.55 | 2.10 | 2.55 | 2.75 | **0.926** |
|
| 95 |
+
| Fanar-1-9B | 128,256 | 2.85 | 1.36 | 2.11 | 2.27 | 2.93 | 0.775 |
|
| 96 |
+
| Hala-9B | 128,256 | 2.85 | 1.36 | 2.11 | 2.27 | 2.93 | 0.775 |
|
| 97 |
+
| GPT-4o | 200,019 | 2.81 | 1.44 | 2.12 | 2.45 | 3.37 | 0.726 |
|
| 98 |
+
| Command-R-Arabic | 255,033 | 3.00 | 1.33 | 2.16 | 2.17 | 3.04 | 0.714 |
|
| 99 |
+
| Qwen3-4B | 151,669 | 3.06 | 1.50 | 2.28 | 2.04 | 2.92 | 0.697 |
|
| 100 |
+
| GPT-4 | 100,277 | 4.59 | 1.50 | 3.05 | 1.35 | 3.24 | 0.417 |
|
| 101 |
+
| Mistral-7B-v0.3 | 32,768 | 5.56 | 1.48 | 3.52 | 1.11 | 2.64 | 0.418 |
|
| 102 |
|
| 103 |
**Metrics explained:**
|
| 104 |
+
- **Fertility**: Average tokens per word (lower is better - more efficient encoding)
|
| 105 |
+
- **C/T**: Characters per token (higher is better - more characters encoded per token)
|
| 106 |
- **Parity**: AR chars/token ÷ EN chars/token (1.0 = equal treatment of both languages)
|
| 107 |
|
| 108 |
**Key findings:**
|
| 109 |
+
- **SARFTokenizer achieves best Arabic fertility** (1.72 tokens/word) - 35% better than GPT-4o
|
| 110 |
+
- **Lowest average fertility** (1.64) among all tokenizers tested
|
| 111 |
+
- **Best Arabic characters/token** (3.45) - encodes more Arabic per token than any competitor
|
| 112 |
+
- Compact vocabulary (64k) while maintaining top performance
|
| 113 |
+
- ALLaM-7B shows similar efficiency (both use morpheme-aware approaches)
|
| 114 |
+
- Falcon-H1-7B has best parity (0.926) but 28% higher fertility than SARF
|
| 115 |
+
- GPT-4 and Mistral struggle with Arabic (4.6-5.6 tokens/word vs 1.7 for SARF)
|
| 116 |
|
| 117 |
### Throughput Benchmark (1M samples, 680 MB)
|
| 118 |
|
|
|
|
| 199 |
url={https://huggingface.co/almaghrabima/SARFTokenizer},
|
| 200 |
note={Independent research, part of Suhail Project}
|
| 201 |
}
|
| 202 |
+
```
|