|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
language: |
|
|
- ar |
|
|
- en |
|
|
tags: |
|
|
- tokenizer |
|
|
- arabic |
|
|
- morphology |
|
|
- benchmark |
|
|
--- |
|
|
|
|
|
# SARF Tokenizer |
|
|
|
|
|
**SARF** (Segmentation-Aware Rewriting Framework) is a morphologically-aware tokenizer for Arabic that combines unsupervised morphological segmentation (Morfessor) with Byte-Pair Encoding. It uses Unicode Private Use Area (PUA) characters to map Arabic morphemes to single tokens before BPE training, achieving strong Arabic tokenization with a compact 72K vocabulary. |
|
|
|
|
|
## Benchmark Results |
|
|
|
|
|
Evaluation on ~5,000 Arabic + 5,000 English samples from the eval-test-data dataset. Ranked by parity (closest to 1.0), then average chars/token. |
|
|
|
|
|
| Rank | Tokenizer | Vocab | AR Fertility | AR Chars/Tok | EN Fertility | EN Chars/Tok | Parity | |
|
|
|------|-----------|------:|-------------:|-------------:|-------------:|-------------:|-------:| |
|
|
| 1 | Gemma-3-4B | 262,145 | 2.311 | 2.864 | 1.137 | 2.911 | 0.9840 | |
|
|
| 2 | Fanar-1-9B | 128,256 | 2.264 | 2.812 | 1.141 | 2.880 | 0.9764 | |
|
|
| 3 | Hala-9B | 128,256 | 2.264 | 2.812 | 1.141 | 2.880 | 0.9764 | |
|
|
| 4 | Command-R-Arabic | 255,033 | 2.320 | 2.799 | 1.142 | 2.906 | 0.9631 | |
|
|
| 5 | **SARF (Ours)** | **72,195** | **1.978** | **2.832** | 1.561 | 3.163 | **0.8952** | |
|
|
| 6 | GPT-4o | 200,019 | 2.249 | 3.111 | 1.213 | 3.492 | 0.8909 | |
|
|
| 7 | Qwen3-4B | 151,669 | 2.314 | 2.599 | 1.225 | 2.964 | 0.8767 | |
|
|
| 8 | Qwen3-VL-4B | 151,669 | 2.314 | 2.599 | 1.225 | 2.964 | 0.8767 | |
|
|
| 9 | Falcon-H1-7B | 130,049 | 2.083 | 3.272 | 1.266 | 2.835 | 1.1543 | |
|
|
| 10 | ALLaM-7B | 64,000 | 1.286 | 3.898 | 1.197 | 2.699 | 1.4442 | |
|
|
| 11 | Mistral-7B-v0.3 | 32,768 | 5.133 | 1.131 | 1.218 | 2.702 | 0.4185 | |
|
|
| 12 | GPT-4 | 100,277 | 4.111 | 1.430 | 1.225 | 3.452 | 0.4144 | |
|
|
| 13 | AceGPT-13B | 32,000 | 5.236 | 1.110 | 1.237 | 2.691 | 0.4124 | |
|
|
|
|
|
### Metric Definitions |
|
|
|
|
|
- **AR Fertility**: Arabic tokens per word (lower = better) |
|
|
- **AR Chars/Tok**: Arabic characters per token (higher = better compression) |
|
|
- **EN Fertility**: English tokens per word (lower = better) |
|
|
- **EN Chars/Tok**: English characters per token (higher = better compression) |
|
|
- **Parity**: AR Chars/Tok / EN Chars/Tok (closer to 1.0 = more balanced) |
|
|
|
|
|
### Key Findings |
|
|
|
|
|
- SARF achieves the **lowest Arabic fertility** (1.978 tokens/word) among all tokenizers with vocabulary under 130K, demonstrating that morphological preprocessing enables efficient Arabic tokenization without massive vocabularies. |
|
|
- With only **72K vocabulary**, SARF achieves Arabic compression (2.832 chars/token) competitive with tokenizers 2-3x its size. |
|
|
- SARF has **near-perfect parity** (0.895), meaning Arabic and English text are tokenized with similar efficiency — unlike GPT-4 (0.414) or ALLaM (1.444) which show strong language bias. |
|
|
- SARF ranks **5th in parity** out of 13 tokenizers despite having the **smallest vocabulary** among the top 9. |
|
|
|
|
|
## Tokenizers Compared |
|
|
|
|
|
| Tokenizer | Model | Source | |
|
|
|-----------|-------|--------| |
|
|
| SARF | DeepLatent | [almaghrabima/deeplatent-tokenizer-parity](https://huggingface.co/almaghrabima/deeplatent-tokenizer-parity) | |
|
|
| GPT-4o | o200k_base | tiktoken | |
|
|
| GPT-4 | cl100k_base | tiktoken | |
|
|
| ALLaM-7B | humain-ai/ALLaM-7B-Instruct-preview | HuggingFace | |
|
|
| AceGPT-13B | FreedomIntelligence/AceGPT-13B-chat | HuggingFace | |
|
|
| Gemma-3-4B | google/gemma-3-4b-it | HuggingFace | |
|
|
| Command-R Arabic | CohereLabs/c4ai-command-r7b-arabic-02-2025 | HuggingFace | |
|
|
| Fanar-1-9B | QCRI/Fanar-1-9B-Instruct | HuggingFace | |
|
|
| Hala-9B | hammh0a/Hala-9B | HuggingFace | |
|
|
| Qwen3-4B | Qwen/Qwen3-4B-Instruct-2507 | HuggingFace | |
|
|
| Qwen3-VL-4B | Qwen/Qwen3-VL-4B-Instruct | HuggingFace | |
|
|
| Mistral-7B-v0.3 | mistralai/Mistral-7B-Instruct-v0.3 | HuggingFace | |
|
|
| Falcon-H1-7B | tiiuae/Falcon-H1-7B-Instruct | HuggingFace | |
|
|
|
|
|
## How SARF Works |
|
|
|
|
|
SARF uses a morphologically-aware preprocessing pipeline before BPE: |
|
|
|
|
|
1. **Morfessor** segments Arabic words into morphemes unsupervised |
|
|
2. **Morpheme-to-PUA mapping** assigns each morpheme a Unicode Private Use Area character |
|
|
3. **ByteRewriter** rewrites Arabic text so morphemes become single characters |
|
|
4. **BPE** trains on the rewritten text, naturally learning morpheme-level tokens |
|
|
|
|
|
This approach achieves strong Arabic compression without inflating the vocabulary for English or other languages. |
|
|
|
|
|
## Files |
|
|
|
|
|
- `results.json` — Raw benchmark data |
|
|
- `tokenizer_benchmark.py` — Benchmark script (reproduces results) |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{sarf2025, |
|
|
title={SARF: Segmentation-Aware Rewriting Framework for Arabic Tokenization}, |
|
|
author={Al-Maghrabima}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/almaghrabima/SARF-Tokenizer} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
CC-BY-NC-4.0 |
|
|
|