SARF-Tokenizer / README.md

Update: rank by parity+efficiency, add Falcon-H1-7B

5362025 3 days ago

4.65 kB

	---
	license: cc-by-nc-4.0
	language:
	- ar
	- en
	tags:
	- tokenizer
	- arabic
	- morphology
	- benchmark
	---

	# SARF Tokenizer

	SARF (Segmentation-Aware Rewriting Framework) is a morphologically-aware tokenizer for Arabic that combines unsupervised morphological segmentation (Morfessor) with Byte-Pair Encoding. It uses Unicode Private Use Area (PUA) characters to map Arabic morphemes to single tokens before BPE training, achieving strong Arabic tokenization with a compact 72K vocabulary.

	## Benchmark Results

	Evaluation on ~5,000 Arabic + 5,000 English samples from the eval-test-data dataset. Ranked by parity (closest to 1.0), then average chars/token.

	\| Rank \| Tokenizer \| Vocab \| AR Fertility \| AR Chars/Tok \| EN Fertility \| EN Chars/Tok \| Parity \|
	\|------\|-----------\|------:\|-------------:\|-------------:\|-------------:\|-------------:\|-------:\|
	\| 1 \| Gemma-3-4B \| 262,145 \| 2.311 \| 2.864 \| 1.137 \| 2.911 \| 0.9840 \|
	\| 2 \| Fanar-1-9B \| 128,256 \| 2.264 \| 2.812 \| 1.141 \| 2.880 \| 0.9764 \|
	\| 3 \| Hala-9B \| 128,256 \| 2.264 \| 2.812 \| 1.141 \| 2.880 \| 0.9764 \|
	\| 4 \| Command-R-Arabic \| 255,033 \| 2.320 \| 2.799 \| 1.142 \| 2.906 \| 0.9631 \|
	\| 5 \| SARF (Ours) \| 72,195 \| 1.978 \| 2.832 \| 1.561 \| 3.163 \| 0.8952 \|
	\| 6 \| GPT-4o \| 200,019 \| 2.249 \| 3.111 \| 1.213 \| 3.492 \| 0.8909 \|
	\| 7 \| Qwen3-4B \| 151,669 \| 2.314 \| 2.599 \| 1.225 \| 2.964 \| 0.8767 \|
	\| 8 \| Qwen3-VL-4B \| 151,669 \| 2.314 \| 2.599 \| 1.225 \| 2.964 \| 0.8767 \|
	\| 9 \| Falcon-H1-7B \| 130,049 \| 2.083 \| 3.272 \| 1.266 \| 2.835 \| 1.1543 \|
	\| 10 \| ALLaM-7B \| 64,000 \| 1.286 \| 3.898 \| 1.197 \| 2.699 \| 1.4442 \|
	\| 11 \| Mistral-7B-v0.3 \| 32,768 \| 5.133 \| 1.131 \| 1.218 \| 2.702 \| 0.4185 \|
	\| 12 \| GPT-4 \| 100,277 \| 4.111 \| 1.430 \| 1.225 \| 3.452 \| 0.4144 \|
	\| 13 \| AceGPT-13B \| 32,000 \| 5.236 \| 1.110 \| 1.237 \| 2.691 \| 0.4124 \|

	### Metric Definitions

	- AR Fertility: Arabic tokens per word (lower = better)
	- AR Chars/Tok: Arabic characters per token (higher = better compression)
	- EN Fertility: English tokens per word (lower = better)
	- EN Chars/Tok: English characters per token (higher = better compression)
	- Parity: AR Chars/Tok / EN Chars/Tok (closer to 1.0 = more balanced)

	### Key Findings

	- SARF achieves the lowest Arabic fertility (1.978 tokens/word) among all tokenizers with vocabulary under 130K, demonstrating that morphological preprocessing enables efficient Arabic tokenization without massive vocabularies.
	- With only 72K vocabulary, SARF achieves Arabic compression (2.832 chars/token) competitive with tokenizers 2-3x its size.
	- SARF has near-perfect parity (0.895), meaning Arabic and English text are tokenized with similar efficiency — unlike GPT-4 (0.414) or ALLaM (1.444) which show strong language bias.
	- SARF ranks 5th in parity out of 13 tokenizers despite having the smallest vocabulary among the top 9.

	## Tokenizers Compared

	\| Tokenizer \| Model \| Source \|
	\|-----------\|-------\|--------\|
	\| SARF \| DeepLatent \| [almaghrabima/deeplatent-tokenizer-parity](https://huggingface.co/almaghrabima/deeplatent-tokenizer-parity) \|
	\| GPT-4o \| o200k_base \| tiktoken \|
	\| GPT-4 \| cl100k_base \| tiktoken \|
	\| ALLaM-7B \| humain-ai/ALLaM-7B-Instruct-preview \| HuggingFace \|
	\| AceGPT-13B \| FreedomIntelligence/AceGPT-13B-chat \| HuggingFace \|
	\| Gemma-3-4B \| google/gemma-3-4b-it \| HuggingFace \|
	\| Command-R Arabic \| CohereLabs/c4ai-command-r7b-arabic-02-2025 \| HuggingFace \|
	\| Fanar-1-9B \| QCRI/Fanar-1-9B-Instruct \| HuggingFace \|
	\| Hala-9B \| hammh0a/Hala-9B \| HuggingFace \|
	\| Qwen3-4B \| Qwen/Qwen3-4B-Instruct-2507 \| HuggingFace \|
	\| Qwen3-VL-4B \| Qwen/Qwen3-VL-4B-Instruct \| HuggingFace \|
	\| Mistral-7B-v0.3 \| mistralai/Mistral-7B-Instruct-v0.3 \| HuggingFace \|
	\| Falcon-H1-7B \| tiiuae/Falcon-H1-7B-Instruct \| HuggingFace \|

	## How SARF Works

	SARF uses a morphologically-aware preprocessing pipeline before BPE:

	1. Morfessor segments Arabic words into morphemes unsupervised
	2. Morpheme-to-PUA mapping assigns each morpheme a Unicode Private Use Area character
	3. ByteRewriter rewrites Arabic text so morphemes become single characters
	4. BPE trains on the rewritten text, naturally learning morpheme-level tokens

	This approach achieves strong Arabic compression without inflating the vocabulary for English or other languages.

	## Files

	- `results.json` — Raw benchmark data
	- `tokenizer_benchmark.py` — Benchmark script (reproduces results)

	## Citation

	```bibtex
	@misc{sarf2025,
	title={SARF: Segmentation-Aware Rewriting Framework for Arabic Tokenization},
	author={Al-Maghrabima},
	year={2025},
	url={https://huggingface.co/almaghrabima/SARF-Tokenizer}
	}
	```

	## License

	CC-BY-NC-4.0

	---
	license: cc-by-nc-4.0
	language:
	- ar
	- en
	tags:
	- tokenizer
	- arabic
	- morphology
	- benchmark
	---

	# SARF Tokenizer

	SARF (Segmentation-Aware Rewriting Framework) is a morphologically-aware tokenizer for Arabic that combines unsupervised morphological segmentation (Morfessor) with Byte-Pair Encoding. It uses Unicode Private Use Area (PUA) characters to map Arabic morphemes to single tokens before BPE training, achieving strong Arabic tokenization with a compact 72K vocabulary.

	## Benchmark Results

	Evaluation on ~5,000 Arabic + 5,000 English samples from the eval-test-data dataset. Ranked by parity (closest to 1.0), then average chars/token.

	\| Rank \| Tokenizer \| Vocab \| AR Fertility \| AR Chars/Tok \| EN Fertility \| EN Chars/Tok \| Parity \|
	\|------\|-----------\|------:\|-------------:\|-------------:\|-------------:\|-------------:\|-------:\|
	\| 1 \| Gemma-3-4B \| 262,145 \| 2.311 \| 2.864 \| 1.137 \| 2.911 \| 0.9840 \|
	\| 2 \| Fanar-1-9B \| 128,256 \| 2.264 \| 2.812 \| 1.141 \| 2.880 \| 0.9764 \|
	\| 3 \| Hala-9B \| 128,256 \| 2.264 \| 2.812 \| 1.141 \| 2.880 \| 0.9764 \|
	\| 4 \| Command-R-Arabic \| 255,033 \| 2.320 \| 2.799 \| 1.142 \| 2.906 \| 0.9631 \|
	\| 5 \| SARF (Ours) \| 72,195 \| 1.978 \| 2.832 \| 1.561 \| 3.163 \| 0.8952 \|
	\| 6 \| GPT-4o \| 200,019 \| 2.249 \| 3.111 \| 1.213 \| 3.492 \| 0.8909 \|
	\| 7 \| Qwen3-4B \| 151,669 \| 2.314 \| 2.599 \| 1.225 \| 2.964 \| 0.8767 \|
	\| 8 \| Qwen3-VL-4B \| 151,669 \| 2.314 \| 2.599 \| 1.225 \| 2.964 \| 0.8767 \|
	\| 9 \| Falcon-H1-7B \| 130,049 \| 2.083 \| 3.272 \| 1.266 \| 2.835 \| 1.1543 \|
	\| 10 \| ALLaM-7B \| 64,000 \| 1.286 \| 3.898 \| 1.197 \| 2.699 \| 1.4442 \|
	\| 11 \| Mistral-7B-v0.3 \| 32,768 \| 5.133 \| 1.131 \| 1.218 \| 2.702 \| 0.4185 \|
	\| 12 \| GPT-4 \| 100,277 \| 4.111 \| 1.430 \| 1.225 \| 3.452 \| 0.4144 \|
	\| 13 \| AceGPT-13B \| 32,000 \| 5.236 \| 1.110 \| 1.237 \| 2.691 \| 0.4124 \|

	### Metric Definitions

	- AR Fertility: Arabic tokens per word (lower = better)
	- AR Chars/Tok: Arabic characters per token (higher = better compression)
	- EN Fertility: English tokens per word (lower = better)
	- EN Chars/Tok: English characters per token (higher = better compression)
	- Parity: AR Chars/Tok / EN Chars/Tok (closer to 1.0 = more balanced)

	### Key Findings

	- SARF achieves the lowest Arabic fertility (1.978 tokens/word) among all tokenizers with vocabulary under 130K, demonstrating that morphological preprocessing enables efficient Arabic tokenization without massive vocabularies.
	- With only 72K vocabulary, SARF achieves Arabic compression (2.832 chars/token) competitive with tokenizers 2-3x its size.
	- SARF has near-perfect parity (0.895), meaning Arabic and English text are tokenized with similar efficiency — unlike GPT-4 (0.414) or ALLaM (1.444) which show strong language bias.
	- SARF ranks 5th in parity out of 13 tokenizers despite having the smallest vocabulary among the top 9.

	## Tokenizers Compared

	\| Tokenizer \| Model \| Source \|
	\|-----------\|-------\|--------\|
	\| SARF \| DeepLatent \| [almaghrabima/deeplatent-tokenizer-parity](https://huggingface.co/almaghrabima/deeplatent-tokenizer-parity) \|
	\| GPT-4o \| o200k_base \| tiktoken \|
	\| GPT-4 \| cl100k_base \| tiktoken \|
	\| ALLaM-7B \| humain-ai/ALLaM-7B-Instruct-preview \| HuggingFace \|
	\| AceGPT-13B \| FreedomIntelligence/AceGPT-13B-chat \| HuggingFace \|
	\| Gemma-3-4B \| google/gemma-3-4b-it \| HuggingFace \|
	\| Command-R Arabic \| CohereLabs/c4ai-command-r7b-arabic-02-2025 \| HuggingFace \|
	\| Fanar-1-9B \| QCRI/Fanar-1-9B-Instruct \| HuggingFace \|
	\| Hala-9B \| hammh0a/Hala-9B \| HuggingFace \|
	\| Qwen3-4B \| Qwen/Qwen3-4B-Instruct-2507 \| HuggingFace \|
	\| Qwen3-VL-4B \| Qwen/Qwen3-VL-4B-Instruct \| HuggingFace \|
	\| Mistral-7B-v0.3 \| mistralai/Mistral-7B-Instruct-v0.3 \| HuggingFace \|
	\| Falcon-H1-7B \| tiiuae/Falcon-H1-7B-Instruct \| HuggingFace \|

	## How SARF Works

	SARF uses a morphologically-aware preprocessing pipeline before BPE:

	1. Morfessor segments Arabic words into morphemes unsupervised
	2. Morpheme-to-PUA mapping assigns each morpheme a Unicode Private Use Area character
	3. ByteRewriter rewrites Arabic text so morphemes become single characters
	4. BPE trains on the rewritten text, naturally learning morpheme-level tokens

	This approach achieves strong Arabic compression without inflating the vocabulary for English or other languages.

	## Files

	- `results.json` — Raw benchmark data
	- `tokenizer_benchmark.py` — Benchmark script (reproduces results)

	## Citation

	```bibtex
	@misc{sarf2025,
	title={SARF: Segmentation-Aware Rewriting Framework for Arabic Tokenization},
	author={Al-Maghrabima},
	year={2025},
	url={https://huggingface.co/almaghrabima/SARF-Tokenizer}
	}
	```

	## License

	CC-BY-NC-4.0