DeepLatent SARF Tokenizer
Part of Suhail Project - Independent Research by Mohammed Almaghrabi
This is the SARF (Sarf-Aware Representation Framework) tokenizer designed for the DeepLatent language model, trained on bilingual Arabic/English data.
What is SARF?
SARF (صَرْف) is the Arabic term for morphology. In classical and modern Arabic linguistics, ṣarf refers to the system that governs:
- Word formation
- Roots and patterns (جذر / وزن)
- Prefixes, suffixes, infixes
- Tense, gender, number, and derivation
Ṣarf is the exact linguistic layer that makes Arabic hard for naive tokenizers.
SARF combines morphological analysis with BPE tokenization to achieve better compression, especially for morphologically rich languages like Arabic.
Most tokenizers treat Arabic as bytes or characters. SARF treats Arabic as a language.
Installation
Install the suhail-nlp package from PyPI:
pip install suhail-nlp
Quick Start
from suhail import SARFTokenizer
# Load tokenizer (automatically downloads from HuggingFace)
tokenizer = SARFTokenizer.from_pretrained()
# Encode text (SARF preprocessing is applied automatically)
text = "مرحبا بكم Hello world"
tokens = tokenizer.encode(text)
print(f"Tokens: {tokens}")
# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")
The suhail-nlp package includes SARF morpheme preprocessing, achieving optimal tokenization efficiency for Arabic text.
Evaluation Results
| Metric | With SARF Preprocessing | Without Preprocessing |
|---|---|---|
| Arabic Fertility | 2.29 | 5.65 |
| English Fertility | 2.10 | 2.91 |
| Parity (Ar/En) | 1.09 | 1.94 |
| Interpretation | EXCELLENT | Moderate |
Fertility = average tokens per word. Lower is better. Parity closer to 1.0 means more equal treatment between languages.
Evaluation Dataset
Evaluation data (10,000 samples: 5,000 Arabic + 5,000 English) is available at: almaghrabima/eval-test-data
Performance Comparison
SARF achieves excellent Arabic efficiency while maintaining strong English performance. Evaluated on 10,000 balanced samples (5,000 Arabic + 5,000 English):
| Tokenizer | Vocab Size | Arabic Fertility | Arabic Chars/Token | English Fertility | English Chars/Token | Score |
|---|---|---|---|---|---|---|
| SARF | 100,000 | 1.469 | 3.959 | 1.779 | 3.353 | 2.251 |
| GPT-4o (o200k_base) | 200,019 | 1.874 | 3.105 | 1.718 | 3.472 | 1.831 |
| ALLaM-7B | 64,000 | 1.496 | 3.888 | 2.234 | 2.669 | 1.758 |
| AceGPT-13B | 44,800 | 1.777 | 3.274 | 2.238 | 2.664 | 1.479 |
| Gemma-3-4B | 262,145 | 2.033 | 2.862 | 2.075 | 2.874 | 1.396 |
| Command-R Arabic | 255,033 | 2.084 | 2.791 | 2.076 | 2.873 | 1.362 |
| Fanar-1-9B | 128,256 | 2.071 | 2.809 | 2.096 | 2.845 | 1.357 |
| Hala-9B | 128,256 | 2.071 | 2.809 | 2.096 | 2.845 | 1.357 |
| Qwen2.5-7B | 151,665 | 2.24 | 2.596 | 2.035 | 2.93 | 1.293 |
| Qwen3-VL-4B | 151,669 | 2.24 | 2.596 | 2.035 | 2.93 | 1.293 |
| GPT-4 (cl100k_base) | 100,277 | 4.071 | 1.429 | 1.736 | 3.435 | 0.838 |
| Mistral-7B | 32,768 | 5.148 | 1.13 | 2.23 | 2.674 | 0.516 |
Key Metrics:
- Fertility: Tokens per word (lower = more efficient, fewer tokens needed)
- Chars/Token: Characters per token (higher = better compression per token)
- Score: Combined bilingual efficiency metric (higher = better)
Understanding the Score
The Score metric measures overall tokenizer efficiency across both languages:
Score = (Arabic_Chars/Token + English_Chars/Token) / (Arabic_Fertility + English_Fertility)
Score Interpretation:
- Score > 2.0: Excellent bilingual efficiency (SARF achieves 2.251)
- Score 1.5-2.0: Good efficiency (GPT-4o, ALLaM-7B)
- Score 1.0-1.5: Moderate efficiency (most Arabic-focused models)
- Score < 1.0: Poor efficiency for Arabic (GPT-4, Mistral)
Key Findings
- SARF ranks #1 with Score 2.251, outperforming all 12 tokenizers tested
- 23% better than GPT-4o: Score 2.251 vs 1.831
- Best vocabulary efficiency: With only 100K vocab, SARF outperforms models with 2-2.5x larger vocabularies
- Balanced multilingual performance: Strong on both Arabic and English
Tokenizer Details
- Type: SARF (Sarf-Aware Representation Framework)
- Vocabulary Size: 100,000
- Special Tokens: 13
- Languages: Arabic + English (50/50 balanced)
- Target Model: DeepLatent
Special Tokens
<|assistant_end|><|assistant_start|><|bos|><|end_of_text|><|mask|><|output_end|><|output_start|><|pad|><|python_end|><|python_start|><|unk|><|user_end|><|user_start|>
Files
tokenizer.json: Main tokenizer file (HuggingFace format)tokenizer.pkl: BPE tokenizer (native format)tokenizer_config.json: Tokenizer configurationspecial_tokens_map.json: Special tokens mappingtoken_bytes.pt: Token byte mapping
Author
- Mohammed Almaghrabi
- Email: almaghrabima@gmail.com
- Project: Suhail Project
- This is independent research
License
This tokenizer is released under CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0 International).
You are free to:
- Share: Copy and redistribute the material
- Adapt: Remix, transform, and build upon the material
Under the following terms:
- Attribution: You must give appropriate credit
- NonCommercial: You may not use the material for commercial purposes
For commercial licensing, please contact: almaghrabima@gmail.com
Citation
If you use this tokenizer in your research, please cite:
@misc{sarf-tokenizer-2026,
title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project},
author={Almaghrabi, Mohammed},
year={2026},
url={https://huggingface.co/almaghrabima/deeplatent-tokenizer},
note={Independent research, part of Suhail Project}
}