SARF-Tokenizer / README.md
almaghrabima's picture
Update: rank by parity+efficiency, add Falcon-H1-7B
5362025
---
license: cc-by-nc-4.0
language:
- ar
- en
tags:
- tokenizer
- arabic
- morphology
- benchmark
---
# SARF Tokenizer
**SARF** (Segmentation-Aware Rewriting Framework) is a morphologically-aware tokenizer for Arabic that combines unsupervised morphological segmentation (Morfessor) with Byte-Pair Encoding. It uses Unicode Private Use Area (PUA) characters to map Arabic morphemes to single tokens before BPE training, achieving strong Arabic tokenization with a compact 72K vocabulary.
## Benchmark Results
Evaluation on ~5,000 Arabic + 5,000 English samples from the eval-test-data dataset. Ranked by parity (closest to 1.0), then average chars/token.
| Rank | Tokenizer | Vocab | AR Fertility | AR Chars/Tok | EN Fertility | EN Chars/Tok | Parity |
|------|-----------|------:|-------------:|-------------:|-------------:|-------------:|-------:|
| 1 | Gemma-3-4B | 262,145 | 2.311 | 2.864 | 1.137 | 2.911 | 0.9840 |
| 2 | Fanar-1-9B | 128,256 | 2.264 | 2.812 | 1.141 | 2.880 | 0.9764 |
| 3 | Hala-9B | 128,256 | 2.264 | 2.812 | 1.141 | 2.880 | 0.9764 |
| 4 | Command-R-Arabic | 255,033 | 2.320 | 2.799 | 1.142 | 2.906 | 0.9631 |
| 5 | **SARF (Ours)** | **72,195** | **1.978** | **2.832** | 1.561 | 3.163 | **0.8952** |
| 6 | GPT-4o | 200,019 | 2.249 | 3.111 | 1.213 | 3.492 | 0.8909 |
| 7 | Qwen3-4B | 151,669 | 2.314 | 2.599 | 1.225 | 2.964 | 0.8767 |
| 8 | Qwen3-VL-4B | 151,669 | 2.314 | 2.599 | 1.225 | 2.964 | 0.8767 |
| 9 | Falcon-H1-7B | 130,049 | 2.083 | 3.272 | 1.266 | 2.835 | 1.1543 |
| 10 | ALLaM-7B | 64,000 | 1.286 | 3.898 | 1.197 | 2.699 | 1.4442 |
| 11 | Mistral-7B-v0.3 | 32,768 | 5.133 | 1.131 | 1.218 | 2.702 | 0.4185 |
| 12 | GPT-4 | 100,277 | 4.111 | 1.430 | 1.225 | 3.452 | 0.4144 |
| 13 | AceGPT-13B | 32,000 | 5.236 | 1.110 | 1.237 | 2.691 | 0.4124 |
### Metric Definitions
- **AR Fertility**: Arabic tokens per word (lower = better)
- **AR Chars/Tok**: Arabic characters per token (higher = better compression)
- **EN Fertility**: English tokens per word (lower = better)
- **EN Chars/Tok**: English characters per token (higher = better compression)
- **Parity**: AR Chars/Tok / EN Chars/Tok (closer to 1.0 = more balanced)
### Key Findings
- SARF achieves the **lowest Arabic fertility** (1.978 tokens/word) among all tokenizers with vocabulary under 130K, demonstrating that morphological preprocessing enables efficient Arabic tokenization without massive vocabularies.
- With only **72K vocabulary**, SARF achieves Arabic compression (2.832 chars/token) competitive with tokenizers 2-3x its size.
- SARF has **near-perfect parity** (0.895), meaning Arabic and English text are tokenized with similar efficiency — unlike GPT-4 (0.414) or ALLaM (1.444) which show strong language bias.
- SARF ranks **5th in parity** out of 13 tokenizers despite having the **smallest vocabulary** among the top 9.
## Tokenizers Compared
| Tokenizer | Model | Source |
|-----------|-------|--------|
| SARF | DeepLatent | [almaghrabima/deeplatent-tokenizer-parity](https://huggingface.co/almaghrabima/deeplatent-tokenizer-parity) |
| GPT-4o | o200k_base | tiktoken |
| GPT-4 | cl100k_base | tiktoken |
| ALLaM-7B | humain-ai/ALLaM-7B-Instruct-preview | HuggingFace |
| AceGPT-13B | FreedomIntelligence/AceGPT-13B-chat | HuggingFace |
| Gemma-3-4B | google/gemma-3-4b-it | HuggingFace |
| Command-R Arabic | CohereLabs/c4ai-command-r7b-arabic-02-2025 | HuggingFace |
| Fanar-1-9B | QCRI/Fanar-1-9B-Instruct | HuggingFace |
| Hala-9B | hammh0a/Hala-9B | HuggingFace |
| Qwen3-4B | Qwen/Qwen3-4B-Instruct-2507 | HuggingFace |
| Qwen3-VL-4B | Qwen/Qwen3-VL-4B-Instruct | HuggingFace |
| Mistral-7B-v0.3 | mistralai/Mistral-7B-Instruct-v0.3 | HuggingFace |
| Falcon-H1-7B | tiiuae/Falcon-H1-7B-Instruct | HuggingFace |
## How SARF Works
SARF uses a morphologically-aware preprocessing pipeline before BPE:
1. **Morfessor** segments Arabic words into morphemes unsupervised
2. **Morpheme-to-PUA mapping** assigns each morpheme a Unicode Private Use Area character
3. **ByteRewriter** rewrites Arabic text so morphemes become single characters
4. **BPE** trains on the rewritten text, naturally learning morpheme-level tokens
This approach achieves strong Arabic compression without inflating the vocabulary for English or other languages.
## Files
- `results.json` — Raw benchmark data
- `tokenizer_benchmark.py` — Benchmark script (reproduces results)
## Citation
```bibtex
@misc{sarf2025,
title={SARF: Segmentation-Aware Rewriting Framework for Arabic Tokenization},
author={Al-Maghrabima},
year={2025},
url={https://huggingface.co/almaghrabima/SARF-Tokenizer}
}
```
## License
CC-BY-NC-4.0