almaghrabima commited on
Commit
2d1f910
·
verified ·
1 Parent(s): d3c23ec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -23
README.md CHANGED
@@ -82,33 +82,37 @@ print(text)
82
 
83
  ### Tokenizer Benchmark Results
84
 
85
- Comparison with state-of-the-art tokenizers (5 runs, 5000 samples each).
86
-
87
- **Dataset used:** [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data) (60k samples: 30k Arabic + 30k English)
88
-
89
- | Tokenizer | Vocab | AR Fert | EN Fert | Avg Fert | Parity | Fert Rank | Parity Rank |
90
- |-----------|-------|---------|---------|----------|--------|-----------|-------------|
91
- | **SARFTokenizer** | 64,641 | 1.71 | 1.57 | **1.64** | 1.155 | **#1** | #2 |
92
- | ALLaM-7B | 64,000 | 1.81 | 1.48 | 1.65 | 1.162 | #2 | #3 |
93
- | Falcon-H1-7B | 130,049 | 2.64 | 1.55 | 2.10 | **0.926** | #3 | **#1** |
94
- | Fanar-1-9B | 128,256 | 2.85 | 1.36 | 2.10 | 0.774 | #4 | #4 |
95
- | Hala-9B | 128,256 | 2.85 | 1.36 | 2.10 | 0.774 | #5 | #5 |
96
- | GPT-4o | 200,019 | 2.81 | 1.44 | 2.12 | 0.725 | #6 | #6 |
97
- | Command-R-Arabic | 255,033 | 3.00 | 1.33 | 2.16 | 0.713 | #7 | #7 |
98
- | Qwen3-4B | 151,669 | 3.05 | 1.50 | 2.28 | 0.696 | #8 | #8 |
99
- | GPT-4 | 100,277 | 4.59 | 1.50 | 3.05 | 0.416 | #9 | #10 |
100
- | Mistral-7B-v0.3 | 32,768 | 5.56 | 1.48 | 3.52 | 0.417 | #10 | #9 |
 
101
 
102
  **Metrics explained:**
103
- - **Fertility**: Average tokens per word (lower is better)
 
104
  - **Parity**: AR chars/token ÷ EN chars/token (1.0 = equal treatment of both languages)
105
 
106
  **Key findings:**
107
- - **SARFTokenizer ranks #1 in fertility** (1.64 avg tokens/word) and #2 in parity (1.155)
108
- - **Falcon-H1-7B has best parity** (0.926) but lower fertility efficiency
109
- - **SARFTokenizer achieves best Arabic fertility** (1.71 tokens/word vs 2.6+ for others)
110
- - Morpheme-aware encoding significantly improves Arabic tokenization efficiency
111
- - SARFTokenizer uses smallest vocab (64k) among top performers
 
 
112
 
113
  ### Throughput Benchmark (1M samples, 680 MB)
114
 
@@ -195,4 +199,4 @@ CC-BY-NC-4.0
195
  url={https://huggingface.co/almaghrabima/SARFTokenizer},
196
  note={Independent research, part of Suhail Project}
197
  }
198
- ```
 
82
 
83
  ### Tokenizer Benchmark Results
84
 
85
+ Comparison with state-of-the-art tokenizers on 60,000 samples (30k Arabic + 30k English).
86
+
87
+ **Dataset:** [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data)
88
+
89
+ | Tokenizer | Vocab | AR Fert | EN Fert | Avg Fert | AR C/T | EN C/T | Parity |
90
+ |-----------|-------|---------|---------|----------|--------|--------|--------|
91
+ | **SARFTokenizer** | 64,641 | **1.72** | 1.57 | **1.64** | 3.45 | 2.99 | 1.156 |
92
+ | ALLaM-7B | 64,000 | 1.82 | 1.48 | 1.65 | 3.08 | 2.65 | 1.163 |
93
+ | Gemma-3-4B | 262,145 | 2.78 | 1.33 | 2.05 | 2.42 | 3.00 | 0.805 |
94
+ | Falcon-H1-7B | 130,049 | 2.65 | 1.55 | 2.10 | 2.55 | 2.75 | **0.926** |
95
+ | Fanar-1-9B | 128,256 | 2.85 | 1.36 | 2.11 | 2.27 | 2.93 | 0.775 |
96
+ | Hala-9B | 128,256 | 2.85 | 1.36 | 2.11 | 2.27 | 2.93 | 0.775 |
97
+ | GPT-4o | 200,019 | 2.81 | 1.44 | 2.12 | 2.45 | 3.37 | 0.726 |
98
+ | Command-R-Arabic | 255,033 | 3.00 | 1.33 | 2.16 | 2.17 | 3.04 | 0.714 |
99
+ | Qwen3-4B | 151,669 | 3.06 | 1.50 | 2.28 | 2.04 | 2.92 | 0.697 |
100
+ | GPT-4 | 100,277 | 4.59 | 1.50 | 3.05 | 1.35 | 3.24 | 0.417 |
101
+ | Mistral-7B-v0.3 | 32,768 | 5.56 | 1.48 | 3.52 | 1.11 | 2.64 | 0.418 |
102
 
103
  **Metrics explained:**
104
+ - **Fertility**: Average tokens per word (lower is better - more efficient encoding)
105
+ - **C/T**: Characters per token (higher is better - more characters encoded per token)
106
  - **Parity**: AR chars/token ÷ EN chars/token (1.0 = equal treatment of both languages)
107
 
108
  **Key findings:**
109
+ - **SARFTokenizer achieves best Arabic fertility** (1.72 tokens/word) - 35% better than GPT-4o
110
+ - **Lowest average fertility** (1.64) among all tokenizers tested
111
+ - **Best Arabic characters/token** (3.45) - encodes more Arabic per token than any competitor
112
+ - Compact vocabulary (64k) while maintaining top performance
113
+ - ALLaM-7B shows similar efficiency (both use morpheme-aware approaches)
114
+ - Falcon-H1-7B has best parity (0.926) but 28% higher fertility than SARF
115
+ - GPT-4 and Mistral struggle with Arabic (4.6-5.6 tokens/word vs 1.7 for SARF)
116
 
117
  ### Throughput Benchmark (1M samples, 680 MB)
118
 
 
199
  url={https://huggingface.co/almaghrabima/SARFTokenizer},
200
  note={Independent research, part of Suhail Project}
201
  }
202
+ ```