SARF Tokenizer

SARF (Segmentation-Aware Rewriting Framework) is a morphologically-aware tokenizer for Arabic that combines unsupervised morphological segmentation (Morfessor) with Byte-Pair Encoding. It uses Unicode Private Use Area (PUA) characters to map Arabic morphemes to single tokens before BPE training, achieving strong Arabic tokenization with a compact 72K vocabulary.

Benchmark Results

Evaluation on ~5,000 Arabic + 5,000 English samples from the eval-test-data dataset. Ranked by parity (closest to 1.0), then average chars/token.

Rank Tokenizer Vocab AR Fertility AR Chars/Tok EN Fertility EN Chars/Tok Parity
1 Gemma-3-4B 262,145 2.311 2.864 1.137 2.911 0.9840
2 Fanar-1-9B 128,256 2.264 2.812 1.141 2.880 0.9764
3 Hala-9B 128,256 2.264 2.812 1.141 2.880 0.9764
4 Command-R-Arabic 255,033 2.320 2.799 1.142 2.906 0.9631
5 SARF (Ours) 72,195 1.978 2.832 1.561 3.163 0.8952
6 GPT-4o 200,019 2.249 3.111 1.213 3.492 0.8909
7 Qwen3-4B 151,669 2.314 2.599 1.225 2.964 0.8767
8 Qwen3-VL-4B 151,669 2.314 2.599 1.225 2.964 0.8767
9 Falcon-H1-7B 130,049 2.083 3.272 1.266 2.835 1.1543
10 ALLaM-7B 64,000 1.286 3.898 1.197 2.699 1.4442
11 Mistral-7B-v0.3 32,768 5.133 1.131 1.218 2.702 0.4185
12 GPT-4 100,277 4.111 1.430 1.225 3.452 0.4144
13 AceGPT-13B 32,000 5.236 1.110 1.237 2.691 0.4124

Metric Definitions

  • AR Fertility: Arabic tokens per word (lower = better)
  • AR Chars/Tok: Arabic characters per token (higher = better compression)
  • EN Fertility: English tokens per word (lower = better)
  • EN Chars/Tok: English characters per token (higher = better compression)
  • Parity: AR Chars/Tok / EN Chars/Tok (closer to 1.0 = more balanced)

Key Findings

  • SARF achieves the lowest Arabic fertility (1.978 tokens/word) among all tokenizers with vocabulary under 130K, demonstrating that morphological preprocessing enables efficient Arabic tokenization without massive vocabularies.
  • With only 72K vocabulary, SARF achieves Arabic compression (2.832 chars/token) competitive with tokenizers 2-3x its size.
  • SARF has near-perfect parity (0.895), meaning Arabic and English text are tokenized with similar efficiency โ€” unlike GPT-4 (0.414) or ALLaM (1.444) which show strong language bias.
  • SARF ranks 5th in parity out of 13 tokenizers despite having the smallest vocabulary among the top 9.

Tokenizers Compared

Tokenizer Model Source
SARF DeepLatent almaghrabima/deeplatent-tokenizer-parity
GPT-4o o200k_base tiktoken
GPT-4 cl100k_base tiktoken
ALLaM-7B humain-ai/ALLaM-7B-Instruct-preview HuggingFace
AceGPT-13B FreedomIntelligence/AceGPT-13B-chat HuggingFace
Gemma-3-4B google/gemma-3-4b-it HuggingFace
Command-R Arabic CohereLabs/c4ai-command-r7b-arabic-02-2025 HuggingFace
Fanar-1-9B QCRI/Fanar-1-9B-Instruct HuggingFace
Hala-9B hammh0a/Hala-9B HuggingFace
Qwen3-4B Qwen/Qwen3-4B-Instruct-2507 HuggingFace
Qwen3-VL-4B Qwen/Qwen3-VL-4B-Instruct HuggingFace
Mistral-7B-v0.3 mistralai/Mistral-7B-Instruct-v0.3 HuggingFace
Falcon-H1-7B tiiuae/Falcon-H1-7B-Instruct HuggingFace

How SARF Works

SARF uses a morphologically-aware preprocessing pipeline before BPE:

  1. Morfessor segments Arabic words into morphemes unsupervised
  2. Morpheme-to-PUA mapping assigns each morpheme a Unicode Private Use Area character
  3. ByteRewriter rewrites Arabic text so morphemes become single characters
  4. BPE trains on the rewritten text, naturally learning morpheme-level tokens

This approach achieves strong Arabic compression without inflating the vocabulary for English or other languages.

Files

  • results.json โ€” Raw benchmark data
  • tokenizer_benchmark.py โ€” Benchmark script (reproduces results)

Citation

@misc{sarf2025,
  title={SARF: Segmentation-Aware Rewriting Framework for Arabic Tokenization},
  author={Al-Maghrabima},
  year={2025},
  url={https://huggingface.co/almaghrabima/SARF-Tokenizer}
}

License

CC-BY-NC-4.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support