Update benchmark results with new tokenizers (Falcon-H1, ALLaM, Hala, Mistral)
Browse files
README.md
CHANGED
|
@@ -53,25 +53,29 @@ Comparison with state-of-the-art tokenizers (5 runs, 5000 samples each).
|
|
| 53 |
|
| 54 |
**Dataset used:** [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data) (60k samples: 30k Arabic + 30k English)
|
| 55 |
|
| 56 |
-
|
|
| 57 |
-
|
| 58 |
-
|
|
| 59 |
-
|
|
| 60 |
-
|
|
| 61 |
-
|
|
| 62 |
-
|
|
| 63 |
-
|
|
| 64 |
-
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
**Metrics explained:**
|
| 67 |
- **Fertility**: Average tokens per word (lower is better)
|
| 68 |
-
- **
|
| 69 |
-
- **Parity**: AR C/T ÷ EN C/T (1.0 = equal treatment of both languages)
|
| 70 |
|
| 71 |
**Key findings:**
|
| 72 |
-
- SARFTokenizer
|
| 73 |
-
-
|
|
|
|
| 74 |
- Morpheme-aware encoding significantly improves Arabic tokenization efficiency
|
|
|
|
| 75 |
|
| 76 |
### Throughput Benchmark (1M samples, 680 MB)
|
| 77 |
|
|
|
|
| 53 |
|
| 54 |
**Dataset used:** [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data) (60k samples: 30k Arabic + 30k English)
|
| 55 |
|
| 56 |
+
| Tokenizer | Vocab | AR Fert | EN Fert | Avg Fert | Parity | Fert Rank | Parity Rank |
|
| 57 |
+
|-----------|-------|---------|---------|----------|--------|-----------|-------------|
|
| 58 |
+
| **SARFTokenizer** | 64,641 | 1.71 | 1.57 | **1.64** | 1.155 | **#1** | #2 |
|
| 59 |
+
| ALLaM-7B | 64,000 | 1.81 | 1.48 | 1.65 | 1.162 | #2 | #3 |
|
| 60 |
+
| Falcon-H1-7B | 130,049 | 2.64 | 1.55 | 2.10 | **0.926** | #3 | **#1** |
|
| 61 |
+
| Fanar-1-9B | 128,256 | 2.85 | 1.36 | 2.10 | 0.774 | #4 | #4 |
|
| 62 |
+
| Hala-9B | 128,256 | 2.85 | 1.36 | 2.10 | 0.774 | #5 | #5 |
|
| 63 |
+
| GPT-4o | 200,019 | 2.81 | 1.44 | 2.12 | 0.725 | #6 | #6 |
|
| 64 |
+
| Command-R-Arabic | 255,033 | 3.00 | 1.33 | 2.16 | 0.713 | #7 | #7 |
|
| 65 |
+
| Qwen3-4B | 151,669 | 3.05 | 1.50 | 2.28 | 0.696 | #8 | #8 |
|
| 66 |
+
| GPT-4 | 100,277 | 4.59 | 1.50 | 3.05 | 0.416 | #9 | #10 |
|
| 67 |
+
| Mistral-7B-v0.3 | 32,768 | 5.56 | 1.48 | 3.52 | 0.417 | #10 | #9 |
|
| 68 |
|
| 69 |
**Metrics explained:**
|
| 70 |
- **Fertility**: Average tokens per word (lower is better)
|
| 71 |
+
- **Parity**: AR chars/token ÷ EN chars/token (1.0 = equal treatment of both languages)
|
|
|
|
| 72 |
|
| 73 |
**Key findings:**
|
| 74 |
+
- **SARFTokenizer ranks #1 in fertility** (1.64 avg tokens/word) and #2 in parity (1.155)
|
| 75 |
+
- **Falcon-H1-7B has best parity** (0.926) but lower fertility efficiency
|
| 76 |
+
- **SARFTokenizer achieves best Arabic fertility** (1.71 tokens/word vs 2.6+ for others)
|
| 77 |
- Morpheme-aware encoding significantly improves Arabic tokenization efficiency
|
| 78 |
+
- SARFTokenizer uses smallest vocab (64k) among top performers
|
| 79 |
|
| 80 |
### Throughput Benchmark (1M samples, 680 MB)
|
| 81 |
|