--- license: cc-by-nc-4.0 language: - ar - en tags: - tokenizer - arabic - morphology - benchmark --- # SARF Tokenizer **SARF** (Segmentation-Aware Rewriting Framework) is a morphologically-aware tokenizer for Arabic that combines unsupervised morphological segmentation (Morfessor) with Byte-Pair Encoding. It uses Unicode Private Use Area (PUA) characters to map Arabic morphemes to single tokens before BPE training, achieving strong Arabic tokenization with a compact 72K vocabulary. ## Benchmark Results Evaluation on ~5,000 Arabic + 5,000 English samples from the eval-test-data dataset. Ranked by parity (closest to 1.0), then average chars/token. | Rank | Tokenizer | Vocab | AR Fertility | AR Chars/Tok | EN Fertility | EN Chars/Tok | Parity | |------|-----------|------:|-------------:|-------------:|-------------:|-------------:|-------:| | 1 | Gemma-3-4B | 262,145 | 2.311 | 2.864 | 1.137 | 2.911 | 0.9840 | | 2 | Fanar-1-9B | 128,256 | 2.264 | 2.812 | 1.141 | 2.880 | 0.9764 | | 3 | Hala-9B | 128,256 | 2.264 | 2.812 | 1.141 | 2.880 | 0.9764 | | 4 | Command-R-Arabic | 255,033 | 2.320 | 2.799 | 1.142 | 2.906 | 0.9631 | | 5 | **SARF (Ours)** | **72,195** | **1.978** | **2.832** | 1.561 | 3.163 | **0.8952** | | 6 | GPT-4o | 200,019 | 2.249 | 3.111 | 1.213 | 3.492 | 0.8909 | | 7 | Qwen3-4B | 151,669 | 2.314 | 2.599 | 1.225 | 2.964 | 0.8767 | | 8 | Qwen3-VL-4B | 151,669 | 2.314 | 2.599 | 1.225 | 2.964 | 0.8767 | | 9 | Falcon-H1-7B | 130,049 | 2.083 | 3.272 | 1.266 | 2.835 | 1.1543 | | 10 | ALLaM-7B | 64,000 | 1.286 | 3.898 | 1.197 | 2.699 | 1.4442 | | 11 | Mistral-7B-v0.3 | 32,768 | 5.133 | 1.131 | 1.218 | 2.702 | 0.4185 | | 12 | GPT-4 | 100,277 | 4.111 | 1.430 | 1.225 | 3.452 | 0.4144 | | 13 | AceGPT-13B | 32,000 | 5.236 | 1.110 | 1.237 | 2.691 | 0.4124 | ### Metric Definitions - **AR Fertility**: Arabic tokens per word (lower = better) - **AR Chars/Tok**: Arabic characters per token (higher = better compression) - **EN Fertility**: English tokens per word (lower = better) - **EN Chars/Tok**: English characters per token (higher = better compression) - **Parity**: AR Chars/Tok / EN Chars/Tok (closer to 1.0 = more balanced) ### Key Findings - SARF achieves the **lowest Arabic fertility** (1.978 tokens/word) among all tokenizers with vocabulary under 130K, demonstrating that morphological preprocessing enables efficient Arabic tokenization without massive vocabularies. - With only **72K vocabulary**, SARF achieves Arabic compression (2.832 chars/token) competitive with tokenizers 2-3x its size. - SARF has **near-perfect parity** (0.895), meaning Arabic and English text are tokenized with similar efficiency — unlike GPT-4 (0.414) or ALLaM (1.444) which show strong language bias. - SARF ranks **5th in parity** out of 13 tokenizers despite having the **smallest vocabulary** among the top 9. ## Tokenizers Compared | Tokenizer | Model | Source | |-----------|-------|--------| | SARF | DeepLatent | [almaghrabima/deeplatent-tokenizer-parity](https://huggingface.co/almaghrabima/deeplatent-tokenizer-parity) | | GPT-4o | o200k_base | tiktoken | | GPT-4 | cl100k_base | tiktoken | | ALLaM-7B | humain-ai/ALLaM-7B-Instruct-preview | HuggingFace | | AceGPT-13B | FreedomIntelligence/AceGPT-13B-chat | HuggingFace | | Gemma-3-4B | google/gemma-3-4b-it | HuggingFace | | Command-R Arabic | CohereLabs/c4ai-command-r7b-arabic-02-2025 | HuggingFace | | Fanar-1-9B | QCRI/Fanar-1-9B-Instruct | HuggingFace | | Hala-9B | hammh0a/Hala-9B | HuggingFace | | Qwen3-4B | Qwen/Qwen3-4B-Instruct-2507 | HuggingFace | | Qwen3-VL-4B | Qwen/Qwen3-VL-4B-Instruct | HuggingFace | | Mistral-7B-v0.3 | mistralai/Mistral-7B-Instruct-v0.3 | HuggingFace | | Falcon-H1-7B | tiiuae/Falcon-H1-7B-Instruct | HuggingFace | ## How SARF Works SARF uses a morphologically-aware preprocessing pipeline before BPE: 1. **Morfessor** segments Arabic words into morphemes unsupervised 2. **Morpheme-to-PUA mapping** assigns each morpheme a Unicode Private Use Area character 3. **ByteRewriter** rewrites Arabic text so morphemes become single characters 4. **BPE** trains on the rewritten text, naturally learning morpheme-level tokens This approach achieves strong Arabic compression without inflating the vocabulary for English or other languages. ## Files - `results.json` — Raw benchmark data - `tokenizer_benchmark.py` — Benchmark script (reproduces results) ## Citation ```bibtex @misc{sarf2025, title={SARF: Segmentation-Aware Rewriting Framework for Arabic Tokenization}, author={Al-Maghrabima}, year={2025}, url={https://huggingface.co/almaghrabima/SARF-Tokenizer} } ``` ## License CC-BY-NC-4.0