SARF Tokenizer
SARF (Segmentation-Aware Rewriting Framework) is a morphologically-aware tokenizer for Arabic that combines unsupervised morphological segmentation (Morfessor) with Byte-Pair Encoding. It uses Unicode Private Use Area (PUA) characters to map Arabic morphemes to single tokens before BPE training, achieving strong Arabic tokenization with a compact 72K vocabulary.
Benchmark Results
Evaluation on ~5,000 Arabic + 5,000 English samples from the eval-test-data dataset. Ranked by parity (closest to 1.0), then average chars/token.
| Rank | Tokenizer | Vocab | AR Fertility | AR Chars/Tok | EN Fertility | EN Chars/Tok | Parity |
|---|---|---|---|---|---|---|---|
| 1 | Gemma-3-4B | 262,145 | 2.311 | 2.864 | 1.137 | 2.911 | 0.9840 |
| 2 | Fanar-1-9B | 128,256 | 2.264 | 2.812 | 1.141 | 2.880 | 0.9764 |
| 3 | Hala-9B | 128,256 | 2.264 | 2.812 | 1.141 | 2.880 | 0.9764 |
| 4 | Command-R-Arabic | 255,033 | 2.320 | 2.799 | 1.142 | 2.906 | 0.9631 |
| 5 | SARF (Ours) | 72,195 | 1.978 | 2.832 | 1.561 | 3.163 | 0.8952 |
| 6 | GPT-4o | 200,019 | 2.249 | 3.111 | 1.213 | 3.492 | 0.8909 |
| 7 | Qwen3-4B | 151,669 | 2.314 | 2.599 | 1.225 | 2.964 | 0.8767 |
| 8 | Qwen3-VL-4B | 151,669 | 2.314 | 2.599 | 1.225 | 2.964 | 0.8767 |
| 9 | Falcon-H1-7B | 130,049 | 2.083 | 3.272 | 1.266 | 2.835 | 1.1543 |
| 10 | ALLaM-7B | 64,000 | 1.286 | 3.898 | 1.197 | 2.699 | 1.4442 |
| 11 | Mistral-7B-v0.3 | 32,768 | 5.133 | 1.131 | 1.218 | 2.702 | 0.4185 |
| 12 | GPT-4 | 100,277 | 4.111 | 1.430 | 1.225 | 3.452 | 0.4144 |
| 13 | AceGPT-13B | 32,000 | 5.236 | 1.110 | 1.237 | 2.691 | 0.4124 |
Metric Definitions
- AR Fertility: Arabic tokens per word (lower = better)
- AR Chars/Tok: Arabic characters per token (higher = better compression)
- EN Fertility: English tokens per word (lower = better)
- EN Chars/Tok: English characters per token (higher = better compression)
- Parity: AR Chars/Tok / EN Chars/Tok (closer to 1.0 = more balanced)
Key Findings
- SARF achieves the lowest Arabic fertility (1.978 tokens/word) among all tokenizers with vocabulary under 130K, demonstrating that morphological preprocessing enables efficient Arabic tokenization without massive vocabularies.
- With only 72K vocabulary, SARF achieves Arabic compression (2.832 chars/token) competitive with tokenizers 2-3x its size.
- SARF has near-perfect parity (0.895), meaning Arabic and English text are tokenized with similar efficiency โ unlike GPT-4 (0.414) or ALLaM (1.444) which show strong language bias.
- SARF ranks 5th in parity out of 13 tokenizers despite having the smallest vocabulary among the top 9.
Tokenizers Compared
| Tokenizer | Model | Source |
|---|---|---|
| SARF | DeepLatent | almaghrabima/deeplatent-tokenizer-parity |
| GPT-4o | o200k_base | tiktoken |
| GPT-4 | cl100k_base | tiktoken |
| ALLaM-7B | humain-ai/ALLaM-7B-Instruct-preview | HuggingFace |
| AceGPT-13B | FreedomIntelligence/AceGPT-13B-chat | HuggingFace |
| Gemma-3-4B | google/gemma-3-4b-it | HuggingFace |
| Command-R Arabic | CohereLabs/c4ai-command-r7b-arabic-02-2025 | HuggingFace |
| Fanar-1-9B | QCRI/Fanar-1-9B-Instruct | HuggingFace |
| Hala-9B | hammh0a/Hala-9B | HuggingFace |
| Qwen3-4B | Qwen/Qwen3-4B-Instruct-2507 | HuggingFace |
| Qwen3-VL-4B | Qwen/Qwen3-VL-4B-Instruct | HuggingFace |
| Mistral-7B-v0.3 | mistralai/Mistral-7B-Instruct-v0.3 | HuggingFace |
| Falcon-H1-7B | tiiuae/Falcon-H1-7B-Instruct | HuggingFace |
How SARF Works
SARF uses a morphologically-aware preprocessing pipeline before BPE:
- Morfessor segments Arabic words into morphemes unsupervised
- Morpheme-to-PUA mapping assigns each morpheme a Unicode Private Use Area character
- ByteRewriter rewrites Arabic text so morphemes become single characters
- BPE trains on the rewritten text, naturally learning morpheme-level tokens
This approach achieves strong Arabic compression without inflating the vocabulary for English or other languages.
Files
results.jsonโ Raw benchmark datatokenizer_benchmark.pyโ Benchmark script (reproduces results)
Citation
@misc{sarf2025,
title={SARF: Segmentation-Aware Rewriting Framework for Arabic Tokenization},
author={Al-Maghrabima},
year={2025},
url={https://huggingface.co/almaghrabima/SARF-Tokenizer}
}
License
CC-BY-NC-4.0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support