SARF Tokenizer

SARF (Segmentation-Aware Rewriting Framework) is a morphologically-aware tokenizer for Arabic that combines unsupervised morphological segmentation (Morfessor) with Byte-Pair Encoding. It uses Unicode Private Use Area (PUA) characters to map Arabic morphemes to single tokens before BPE training, achieving strong Arabic tokenization with a compact 72K vocabulary.

Benchmark Results

Evaluation on ~5,000 Arabic + 5,000 English samples from the eval-test-data dataset. Ranked by parity (closest to 1.0), then average chars/token.

Rank	Tokenizer	Vocab	AR Fertility	AR Chars/Tok	EN Fertility	EN Chars/Tok	Parity
1	Gemma-3-4B	262,145	2.311	2.864	1.137	2.911	0.9840
2	Fanar-1-9B	128,256	2.264	2.812	1.141	2.880	0.9764
3	Hala-9B	128,256	2.264	2.812	1.141	2.880	0.9764
4	Command-R-Arabic	255,033	2.320	2.799	1.142	2.906	0.9631
5	SARF (Ours)	72,195	1.978	2.832	1.561	3.163	0.8952
6	GPT-4o	200,019	2.249	3.111	1.213	3.492	0.8909
7	Qwen3-4B	151,669	2.314	2.599	1.225	2.964	0.8767
8	Qwen3-VL-4B	151,669	2.314	2.599	1.225	2.964	0.8767
9	Falcon-H1-7B	130,049	2.083	3.272	1.266	2.835	1.1543
10	ALLaM-7B	64,000	1.286	3.898	1.197	2.699	1.4442
11	Mistral-7B-v0.3	32,768	5.133	1.131	1.218	2.702	0.4185
12	GPT-4	100,277	4.111	1.430	1.225	3.452	0.4144
13	AceGPT-13B	32,000	5.236	1.110	1.237	2.691	0.4124

Metric Definitions

AR Fertility: Arabic tokens per word (lower = better)
AR Chars/Tok: Arabic characters per token (higher = better compression)
EN Fertility: English tokens per word (lower = better)
EN Chars/Tok: English characters per token (higher = better compression)
Parity: AR Chars/Tok / EN Chars/Tok (closer to 1.0 = more balanced)

Key Findings

SARF achieves the lowest Arabic fertility (1.978 tokens/word) among all tokenizers with vocabulary under 130K, demonstrating that morphological preprocessing enables efficient Arabic tokenization without massive vocabularies.
With only 72K vocabulary, SARF achieves Arabic compression (2.832 chars/token) competitive with tokenizers 2-3x its size.
SARF has near-perfect parity (0.895), meaning Arabic and English text are tokenized with similar efficiency — unlike GPT-4 (0.414) or ALLaM (1.444) which show strong language bias.
SARF ranks 5th in parity out of 13 tokenizers despite having the smallest vocabulary among the top 9.

Tokenizers Compared

Tokenizer	Model	Source
SARF	DeepLatent	almaghrabima/deeplatent-tokenizer-parity
GPT-4o	o200k_base	tiktoken
GPT-4	cl100k_base	tiktoken
ALLaM-7B	humain-ai/ALLaM-7B-Instruct-preview	HuggingFace
AceGPT-13B	FreedomIntelligence/AceGPT-13B-chat	HuggingFace
Gemma-3-4B	google/gemma-3-4b-it	HuggingFace
Command-R Arabic	CohereLabs/c4ai-command-r7b-arabic-02-2025	HuggingFace
Fanar-1-9B	QCRI/Fanar-1-9B-Instruct	HuggingFace
Hala-9B	hammh0a/Hala-9B	HuggingFace
Qwen3-4B	Qwen/Qwen3-4B-Instruct-2507	HuggingFace
Qwen3-VL-4B	Qwen/Qwen3-VL-4B-Instruct	HuggingFace
Mistral-7B-v0.3	mistralai/Mistral-7B-Instruct-v0.3	HuggingFace
Falcon-H1-7B	tiiuae/Falcon-H1-7B-Instruct	HuggingFace

How SARF Works

SARF uses a morphologically-aware preprocessing pipeline before BPE:

Morfessor segments Arabic words into morphemes unsupervised
Morpheme-to-PUA mapping assigns each morpheme a Unicode Private Use Area character
ByteRewriter rewrites Arabic text so morphemes become single characters
BPE trains on the rewritten text, naturally learning morpheme-level tokens

This approach achieves strong Arabic compression without inflating the vocabulary for English or other languages.

Files

results.json — Raw benchmark data
tokenizer_benchmark.py — Benchmark script (reproduces results)

Citation

@misc{sarf2025,
  title={SARF: Segmentation-Aware Rewriting Framework for Arabic Tokenization},
  author={Al-Maghrabima},
  year={2025},
  url={https://huggingface.co/almaghrabima/SARF-Tokenizer}
}

License

CC-BY-NC-4.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support