almaghrabima commited on
Commit
55db6a1
·
verified ·
1 Parent(s): a88f8a7

Update benchmark results with new tokenizers (Falcon-H1, ALLaM, Hala, Mistral)

Browse files
Files changed (1) hide show
  1. README.md +17 -13
README.md CHANGED
@@ -53,25 +53,29 @@ Comparison with state-of-the-art tokenizers (5 runs, 5000 samples each).
53
 
54
  **Dataset used:** [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data) (60k samples: 30k Arabic + 30k English)
55
 
56
- | Rank | Tokenizer | Vocab | AR Fertility | EN Fertility | AR C/T | EN C/T | Parity |
57
- |------|-----------|-------|--------------|--------------|--------|--------|--------|
58
- | 1 | **SARFTokenizer** | 64,641 | 1.71 | 1.57 | 3.45 | 2.99 | **1.155** |
59
- | 2 | Gemma-3-4B | 262,145 | 2.78 | 1.33 | 2.42 | 3.01 | 0.804 |
60
- | 3 | Fanar-1-9B | 128,256 | 2.85 | 1.36 | 2.27 | 2.94 | 0.774 |
61
- | 4 | GPT-4o | 200,019 | 2.81 | 1.44 | 2.45 | 3.38 | 0.725 |
62
- | 5 | Command-R-Arabic | 255,033 | 3.00 | 1.33 | 2.17 | 3.04 | 0.713 |
63
- | 6 | Qwen3-4B | 151,669 | 3.05 | 1.50 | 2.04 | 2.93 | 0.696 |
64
- | 7 | GPT-4 | 100,277 | 4.59 | 1.50 | 1.35 | 3.25 | 0.416 |
 
 
 
65
 
66
  **Metrics explained:**
67
  - **Fertility**: Average tokens per word (lower is better)
68
- - **C/T**: Characters per token (higher is better - more compression)
69
- - **Parity**: AR C/T ÷ EN C/T (1.0 = equal treatment of both languages)
70
 
71
  **Key findings:**
72
- - SARFTokenizer achieves parity closest to 1.0 (1.155), meaning near-equal treatment of Arabic and English
73
- - SARF tokenizers have the lowest Arabic fertility (1.7 tokens/word vs 2.8+ for others)
 
74
  - Morpheme-aware encoding significantly improves Arabic tokenization efficiency
 
75
 
76
  ### Throughput Benchmark (1M samples, 680 MB)
77
 
 
53
 
54
  **Dataset used:** [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data) (60k samples: 30k Arabic + 30k English)
55
 
56
+ | Tokenizer | Vocab | AR Fert | EN Fert | Avg Fert | Parity | Fert Rank | Parity Rank |
57
+ |-----------|-------|---------|---------|----------|--------|-----------|-------------|
58
+ | **SARFTokenizer** | 64,641 | 1.71 | 1.57 | **1.64** | 1.155 | **#1** | #2 |
59
+ | ALLaM-7B | 64,000 | 1.81 | 1.48 | 1.65 | 1.162 | #2 | #3 |
60
+ | Falcon-H1-7B | 130,049 | 2.64 | 1.55 | 2.10 | **0.926** | #3 | **#1** |
61
+ | Fanar-1-9B | 128,256 | 2.85 | 1.36 | 2.10 | 0.774 | #4 | #4 |
62
+ | Hala-9B | 128,256 | 2.85 | 1.36 | 2.10 | 0.774 | #5 | #5 |
63
+ | GPT-4o | 200,019 | 2.81 | 1.44 | 2.12 | 0.725 | #6 | #6 |
64
+ | Command-R-Arabic | 255,033 | 3.00 | 1.33 | 2.16 | 0.713 | #7 | #7 |
65
+ | Qwen3-4B | 151,669 | 3.05 | 1.50 | 2.28 | 0.696 | #8 | #8 |
66
+ | GPT-4 | 100,277 | 4.59 | 1.50 | 3.05 | 0.416 | #9 | #10 |
67
+ | Mistral-7B-v0.3 | 32,768 | 5.56 | 1.48 | 3.52 | 0.417 | #10 | #9 |
68
 
69
  **Metrics explained:**
70
  - **Fertility**: Average tokens per word (lower is better)
71
+ - **Parity**: AR chars/token ÷ EN chars/token (1.0 = equal treatment of both languages)
 
72
 
73
  **Key findings:**
74
+ - **SARFTokenizer ranks #1 in fertility** (1.64 avg tokens/word) and #2 in parity (1.155)
75
+ - **Falcon-H1-7B has best parity** (0.926) but lower fertility efficiency
76
+ - **SARFTokenizer achieves best Arabic fertility** (1.71 tokens/word vs 2.6+ for others)
77
  - Morpheme-aware encoding significantly improves Arabic tokenization efficiency
78
+ - SARFTokenizer uses smallest vocab (64k) among top performers
79
 
80
  ### Throughput Benchmark (1M samples, 680 MB)
81