File size: 4,650 Bytes
301b160
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5362025
301b160
5362025
 
 
 
 
 
 
 
 
 
 
 
 
 
 
301b160
 
 
 
 
 
 
 
 
 
 
 
 
 
5362025
301b160
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
license: cc-by-nc-4.0
language:
  - ar
  - en
tags:
  - tokenizer
  - arabic
  - morphology
  - benchmark
---

# SARF Tokenizer

**SARF** (Segmentation-Aware Rewriting Framework) is a morphologically-aware tokenizer for Arabic that combines unsupervised morphological segmentation (Morfessor) with Byte-Pair Encoding. It uses Unicode Private Use Area (PUA) characters to map Arabic morphemes to single tokens before BPE training, achieving strong Arabic tokenization with a compact 72K vocabulary.

## Benchmark Results

Evaluation on ~5,000 Arabic + 5,000 English samples from the eval-test-data dataset. Ranked by parity (closest to 1.0), then average chars/token.

| Rank | Tokenizer | Vocab | AR Fertility | AR Chars/Tok | EN Fertility | EN Chars/Tok | Parity |
|------|-----------|------:|-------------:|-------------:|-------------:|-------------:|-------:|
| 1 | Gemma-3-4B | 262,145 | 2.311 | 2.864 | 1.137 | 2.911 | 0.9840 |
| 2 | Fanar-1-9B | 128,256 | 2.264 | 2.812 | 1.141 | 2.880 | 0.9764 |
| 3 | Hala-9B | 128,256 | 2.264 | 2.812 | 1.141 | 2.880 | 0.9764 |
| 4 | Command-R-Arabic | 255,033 | 2.320 | 2.799 | 1.142 | 2.906 | 0.9631 |
| 5 | **SARF (Ours)** | **72,195** | **1.978** | **2.832** | 1.561 | 3.163 | **0.8952** |
| 6 | GPT-4o | 200,019 | 2.249 | 3.111 | 1.213 | 3.492 | 0.8909 |
| 7 | Qwen3-4B | 151,669 | 2.314 | 2.599 | 1.225 | 2.964 | 0.8767 |
| 8 | Qwen3-VL-4B | 151,669 | 2.314 | 2.599 | 1.225 | 2.964 | 0.8767 |
| 9 | Falcon-H1-7B | 130,049 | 2.083 | 3.272 | 1.266 | 2.835 | 1.1543 |
| 10 | ALLaM-7B | 64,000 | 1.286 | 3.898 | 1.197 | 2.699 | 1.4442 |
| 11 | Mistral-7B-v0.3 | 32,768 | 5.133 | 1.131 | 1.218 | 2.702 | 0.4185 |
| 12 | GPT-4 | 100,277 | 4.111 | 1.430 | 1.225 | 3.452 | 0.4144 |
| 13 | AceGPT-13B | 32,000 | 5.236 | 1.110 | 1.237 | 2.691 | 0.4124 |

### Metric Definitions

- **AR Fertility**: Arabic tokens per word (lower = better)
- **AR Chars/Tok**: Arabic characters per token (higher = better compression)
- **EN Fertility**: English tokens per word (lower = better)
- **EN Chars/Tok**: English characters per token (higher = better compression)
- **Parity**: AR Chars/Tok / EN Chars/Tok (closer to 1.0 = more balanced)

### Key Findings

- SARF achieves the **lowest Arabic fertility** (1.978 tokens/word) among all tokenizers with vocabulary under 130K, demonstrating that morphological preprocessing enables efficient Arabic tokenization without massive vocabularies.
- With only **72K vocabulary**, SARF achieves Arabic compression (2.832 chars/token) competitive with tokenizers 2-3x its size.
- SARF has **near-perfect parity** (0.895), meaning Arabic and English text are tokenized with similar efficiency — unlike GPT-4 (0.414) or ALLaM (1.444) which show strong language bias.
- SARF ranks **5th in parity** out of 13 tokenizers despite having the **smallest vocabulary** among the top 9.

## Tokenizers Compared

| Tokenizer | Model | Source |
|-----------|-------|--------|
| SARF | DeepLatent | [almaghrabima/deeplatent-tokenizer-parity](https://huggingface.co/almaghrabima/deeplatent-tokenizer-parity) |
| GPT-4o | o200k_base | tiktoken |
| GPT-4 | cl100k_base | tiktoken |
| ALLaM-7B | humain-ai/ALLaM-7B-Instruct-preview | HuggingFace |
| AceGPT-13B | FreedomIntelligence/AceGPT-13B-chat | HuggingFace |
| Gemma-3-4B | google/gemma-3-4b-it | HuggingFace |
| Command-R Arabic | CohereLabs/c4ai-command-r7b-arabic-02-2025 | HuggingFace |
| Fanar-1-9B | QCRI/Fanar-1-9B-Instruct | HuggingFace |
| Hala-9B | hammh0a/Hala-9B | HuggingFace |
| Qwen3-4B | Qwen/Qwen3-4B-Instruct-2507 | HuggingFace |
| Qwen3-VL-4B | Qwen/Qwen3-VL-4B-Instruct | HuggingFace |
| Mistral-7B-v0.3 | mistralai/Mistral-7B-Instruct-v0.3 | HuggingFace |
| Falcon-H1-7B | tiiuae/Falcon-H1-7B-Instruct | HuggingFace |

## How SARF Works

SARF uses a morphologically-aware preprocessing pipeline before BPE:

1. **Morfessor** segments Arabic words into morphemes unsupervised
2. **Morpheme-to-PUA mapping** assigns each morpheme a Unicode Private Use Area character
3. **ByteRewriter** rewrites Arabic text so morphemes become single characters
4. **BPE** trains on the rewritten text, naturally learning morpheme-level tokens

This approach achieves strong Arabic compression without inflating the vocabulary for English or other languages.

## Files

- `results.json` — Raw benchmark data
- `tokenizer_benchmark.py` — Benchmark script (reproduces results)

## Citation

```bibtex
@misc{sarf2025,
  title={SARF: Segmentation-Aware Rewriting Framework for Arabic Tokenization},
  author={Al-Maghrabima},
  year={2025},
  url={https://huggingface.co/almaghrabima/SARF-Tokenizer}
}
```

## License

CC-BY-NC-4.0