almaghrabima's picture
Update README: remove morpheme_map.json references (now bundled in suhail-nlp)
a28373d verified
---
license: cc-by-nc-4.0
tags:
- tokenizer
- sarf
- morpheme
- bpe
- deeplatent
- bilingual
- arabic-english
- arabic
- morphology
language:
- ar
- en
---
# DeepLatent SARF Tokenizer
**Part of Suhail Project - Independent Research by Mohammed Almaghrabi**
This is the **SARF** (Sarf-Aware Representation Framework) tokenizer designed for the DeepLatent language model, trained on bilingual Arabic/English data.
## What is SARF?
**SARF (صَرْف)** is the Arabic term for **morphology**. In classical and modern Arabic linguistics, *ṣarf* refers to the system that governs:
- Word formation
- Roots and patterns (جذر / وزن)
- Prefixes, suffixes, infixes
- Tense, gender, number, and derivation
> **Ṣarf is the exact linguistic layer that makes Arabic hard for naive tokenizers.**
SARF combines morphological analysis with BPE tokenization to achieve better compression, especially for morphologically rich languages like Arabic.
Most tokenizers treat Arabic as **bytes or characters**. **SARF treats Arabic as a *language*.**
## Installation
Install the `suhail-nlp` package from PyPI:
```bash
pip install suhail-nlp
```
## Quick Start
```python
from suhail import SARFTokenizer
# Load tokenizer (automatically downloads from HuggingFace)
tokenizer = SARFTokenizer.from_pretrained()
# Encode text (SARF preprocessing is applied automatically)
text = "مرحبا بكم Hello world"
tokens = tokenizer.encode(text)
print(f"Tokens: {tokens}")
# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")
```
The `suhail-nlp` package includes SARF morpheme preprocessing, achieving optimal tokenization efficiency for Arabic text.
## Evaluation Results
| Metric | With SARF Preprocessing | Without Preprocessing |
|--------|------------------------|----------------------|
| Arabic Fertility | 2.29 | 5.65 |
| English Fertility | 2.10 | 2.91 |
| Parity (Ar/En) | 1.09 | 1.94 |
| Interpretation | EXCELLENT | Moderate |
*Fertility = average tokens per word. Lower is better. Parity closer to 1.0 means more equal treatment between languages.*
## Evaluation Dataset
Evaluation data (10,000 samples: 5,000 Arabic + 5,000 English) is available at:
[almaghrabima/eval-test-data](https://huggingface.co/datasets/almaghrabima/eval-test-data)
## Performance Comparison
SARF achieves excellent Arabic efficiency while maintaining strong English performance. Evaluated on 10,000 balanced samples (5,000 Arabic + 5,000 English):
| Tokenizer | Vocab Size | Arabic Fertility | Arabic Chars/Token | English Fertility | English Chars/Token | Score |
|-----------|------------|------------------|-------------------|------------------|---------------------|-------|
| **SARF** | **100,000** | **1.469** | **3.959** | **1.779** | **3.353** | **2.251** |
| GPT-4o (o200k_base) | 200,019 | 1.874 | 3.105 | 1.718 | 3.472 | 1.831 |
| ALLaM-7B | 64,000 | 1.496 | 3.888 | 2.234 | 2.669 | 1.758 |
| AceGPT-13B | 44,800 | 1.777 | 3.274 | 2.238 | 2.664 | 1.479 |
| Gemma-3-4B | 262,145 | 2.033 | 2.862 | 2.075 | 2.874 | 1.396 |
| Command-R Arabic | 255,033 | 2.084 | 2.791 | 2.076 | 2.873 | 1.362 |
| Fanar-1-9B | 128,256 | 2.071 | 2.809 | 2.096 | 2.845 | 1.357 |
| Hala-9B | 128,256 | 2.071 | 2.809 | 2.096 | 2.845 | 1.357 |
| Qwen2.5-7B | 151,665 | 2.24 | 2.596 | 2.035 | 2.93 | 1.293 |
| Qwen3-VL-4B | 151,669 | 2.24 | 2.596 | 2.035 | 2.93 | 1.293 |
| GPT-4 (cl100k_base) | 100,277 | 4.071 | 1.429 | 1.736 | 3.435 | 0.838 |
| Mistral-7B | 32,768 | 5.148 | 1.13 | 2.23 | 2.674 | 0.516 |
**Key Metrics:**
- **Fertility**: Tokens per word (lower = more efficient, fewer tokens needed)
- **Chars/Token**: Characters per token (higher = better compression per token)
- **Score**: Combined bilingual efficiency metric (higher = better)
### Understanding the Score
The **Score** metric measures overall tokenizer efficiency across both languages:
```
Score = (Arabic_Chars/Token + English_Chars/Token) / (Arabic_Fertility + English_Fertility)
```
**Score Interpretation:**
- Score > 2.0: Excellent bilingual efficiency (SARF achieves 2.251)
- Score 1.5-2.0: Good efficiency (GPT-4o, ALLaM-7B)
- Score 1.0-1.5: Moderate efficiency (most Arabic-focused models)
- Score < 1.0: Poor efficiency for Arabic (GPT-4, Mistral)
### Key Findings
1. **SARF ranks #1** with Score 2.251, outperforming all 12 tokenizers tested
2. **23% better than GPT-4o**: Score 2.251 vs 1.831
3. **Best vocabulary efficiency**: With only 100K vocab, SARF outperforms models with 2-2.5x larger vocabularies
4. **Balanced multilingual performance**: Strong on both Arabic and English
## Tokenizer Details
- **Type**: SARF (Sarf-Aware Representation Framework)
- **Vocabulary Size**: 100,000
- **Special Tokens**: 13
- **Languages**: Arabic + English (50/50 balanced)
- **Target Model**: DeepLatent
## Special Tokens
- `<|assistant_end|>`
- `<|assistant_start|>`
- `<|bos|>`
- `<|end_of_text|>`
- `<|mask|>`
- `<|output_end|>`
- `<|output_start|>`
- `<|pad|>`
- `<|python_end|>`
- `<|python_start|>`
- `<|unk|>`
- `<|user_end|>`
- `<|user_start|>`
## Files
- `tokenizer.json`: Main tokenizer file (HuggingFace format)
- `tokenizer.pkl`: BPE tokenizer (native format)
- `tokenizer_config.json`: Tokenizer configuration
- `special_tokens_map.json`: Special tokens mapping
- `token_bytes.pt`: Token byte mapping
## Author
- **Mohammed Almaghrabi**
- Email: almaghrabima@gmail.com
- Project: Suhail Project
- This is independent research
## License
This tokenizer is released under **CC-BY-NC-4.0** (Creative Commons Attribution-NonCommercial 4.0 International).
**You are free to:**
- Share: Copy and redistribute the material
- Adapt: Remix, transform, and build upon the material
**Under the following terms:**
- **Attribution**: You must give appropriate credit
- **NonCommercial**: You may not use the material for commercial purposes
For commercial licensing, please contact: almaghrabima@gmail.com
## Citation
If you use this tokenizer in your research, please cite:
```bibtex
@misc{sarf-tokenizer-2026,
title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project},
author={Almaghrabi, Mohammed},
year={2026},
url={https://huggingface.co/almaghrabima/deeplatent-tokenizer},
note={Independent research, part of Suhail Project}
}
```