|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
tags: |
|
|
- tokenizer |
|
|
- sarf |
|
|
- morpheme |
|
|
- bpe |
|
|
- deeplatent |
|
|
- bilingual |
|
|
- arabic-english |
|
|
- arabic |
|
|
- morphology |
|
|
language: |
|
|
- ar |
|
|
- en |
|
|
--- |
|
|
|
|
|
# DeepLatent SARF Tokenizer |
|
|
|
|
|
**Part of Suhail Project - Independent Research by Mohammed Almaghrabi** |
|
|
|
|
|
This is the **SARF** (Sarf-Aware Representation Framework) tokenizer designed for the DeepLatent language model, trained on bilingual Arabic/English data. |
|
|
|
|
|
## What is SARF? |
|
|
|
|
|
**SARF (صَرْف)** is the Arabic term for **morphology**. In classical and modern Arabic linguistics, *ṣarf* refers to the system that governs: |
|
|
|
|
|
- Word formation |
|
|
- Roots and patterns (جذر / وزن) |
|
|
- Prefixes, suffixes, infixes |
|
|
- Tense, gender, number, and derivation |
|
|
|
|
|
> **Ṣarf is the exact linguistic layer that makes Arabic hard for naive tokenizers.** |
|
|
|
|
|
SARF combines morphological analysis with BPE tokenization to achieve better compression, especially for morphologically rich languages like Arabic. |
|
|
|
|
|
Most tokenizers treat Arabic as **bytes or characters**. **SARF treats Arabic as a *language*.** |
|
|
|
|
|
## Installation |
|
|
|
|
|
Install the `suhail-nlp` package from PyPI: |
|
|
|
|
|
```bash |
|
|
pip install suhail-nlp |
|
|
``` |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
from suhail import SARFTokenizer |
|
|
|
|
|
# Load tokenizer (automatically downloads from HuggingFace) |
|
|
tokenizer = SARFTokenizer.from_pretrained() |
|
|
|
|
|
# Encode text (SARF preprocessing is applied automatically) |
|
|
text = "مرحبا بكم Hello world" |
|
|
tokens = tokenizer.encode(text) |
|
|
print(f"Tokens: {tokens}") |
|
|
|
|
|
# Decode back to text |
|
|
decoded = tokenizer.decode(tokens) |
|
|
print(f"Decoded: {decoded}") |
|
|
``` |
|
|
|
|
|
The `suhail-nlp` package includes SARF morpheme preprocessing, achieving optimal tokenization efficiency for Arabic text. |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
| Metric | With SARF Preprocessing | Without Preprocessing | |
|
|
|--------|------------------------|----------------------| |
|
|
| Arabic Fertility | 2.29 | 5.65 | |
|
|
| English Fertility | 2.10 | 2.91 | |
|
|
| Parity (Ar/En) | 1.09 | 1.94 | |
|
|
| Interpretation | EXCELLENT | Moderate | |
|
|
|
|
|
*Fertility = average tokens per word. Lower is better. Parity closer to 1.0 means more equal treatment between languages.* |
|
|
|
|
|
## Evaluation Dataset |
|
|
|
|
|
Evaluation data (10,000 samples: 5,000 Arabic + 5,000 English) is available at: |
|
|
[almaghrabima/eval-test-data](https://huggingface.co/datasets/almaghrabima/eval-test-data) |
|
|
|
|
|
## Performance Comparison |
|
|
|
|
|
SARF achieves excellent Arabic efficiency while maintaining strong English performance. Evaluated on 10,000 balanced samples (5,000 Arabic + 5,000 English): |
|
|
|
|
|
| Tokenizer | Vocab Size | Arabic Fertility | Arabic Chars/Token | English Fertility | English Chars/Token | Score | |
|
|
|-----------|------------|------------------|-------------------|------------------|---------------------|-------| |
|
|
| **SARF** | **100,000** | **1.469** | **3.959** | **1.779** | **3.353** | **2.251** | |
|
|
| GPT-4o (o200k_base) | 200,019 | 1.874 | 3.105 | 1.718 | 3.472 | 1.831 | |
|
|
| ALLaM-7B | 64,000 | 1.496 | 3.888 | 2.234 | 2.669 | 1.758 | |
|
|
| AceGPT-13B | 44,800 | 1.777 | 3.274 | 2.238 | 2.664 | 1.479 | |
|
|
| Gemma-3-4B | 262,145 | 2.033 | 2.862 | 2.075 | 2.874 | 1.396 | |
|
|
| Command-R Arabic | 255,033 | 2.084 | 2.791 | 2.076 | 2.873 | 1.362 | |
|
|
| Fanar-1-9B | 128,256 | 2.071 | 2.809 | 2.096 | 2.845 | 1.357 | |
|
|
| Hala-9B | 128,256 | 2.071 | 2.809 | 2.096 | 2.845 | 1.357 | |
|
|
| Qwen2.5-7B | 151,665 | 2.24 | 2.596 | 2.035 | 2.93 | 1.293 | |
|
|
| Qwen3-VL-4B | 151,669 | 2.24 | 2.596 | 2.035 | 2.93 | 1.293 | |
|
|
| GPT-4 (cl100k_base) | 100,277 | 4.071 | 1.429 | 1.736 | 3.435 | 0.838 | |
|
|
| Mistral-7B | 32,768 | 5.148 | 1.13 | 2.23 | 2.674 | 0.516 | |
|
|
|
|
|
**Key Metrics:** |
|
|
- **Fertility**: Tokens per word (lower = more efficient, fewer tokens needed) |
|
|
- **Chars/Token**: Characters per token (higher = better compression per token) |
|
|
- **Score**: Combined bilingual efficiency metric (higher = better) |
|
|
|
|
|
### Understanding the Score |
|
|
|
|
|
The **Score** metric measures overall tokenizer efficiency across both languages: |
|
|
|
|
|
``` |
|
|
Score = (Arabic_Chars/Token + English_Chars/Token) / (Arabic_Fertility + English_Fertility) |
|
|
``` |
|
|
|
|
|
**Score Interpretation:** |
|
|
- Score > 2.0: Excellent bilingual efficiency (SARF achieves 2.251) |
|
|
- Score 1.5-2.0: Good efficiency (GPT-4o, ALLaM-7B) |
|
|
- Score 1.0-1.5: Moderate efficiency (most Arabic-focused models) |
|
|
- Score < 1.0: Poor efficiency for Arabic (GPT-4, Mistral) |
|
|
|
|
|
### Key Findings |
|
|
|
|
|
1. **SARF ranks #1** with Score 2.251, outperforming all 12 tokenizers tested |
|
|
2. **23% better than GPT-4o**: Score 2.251 vs 1.831 |
|
|
3. **Best vocabulary efficiency**: With only 100K vocab, SARF outperforms models with 2-2.5x larger vocabularies |
|
|
4. **Balanced multilingual performance**: Strong on both Arabic and English |
|
|
|
|
|
## Tokenizer Details |
|
|
|
|
|
- **Type**: SARF (Sarf-Aware Representation Framework) |
|
|
- **Vocabulary Size**: 100,000 |
|
|
- **Special Tokens**: 13 |
|
|
- **Languages**: Arabic + English (50/50 balanced) |
|
|
- **Target Model**: DeepLatent |
|
|
|
|
|
## Special Tokens |
|
|
|
|
|
- `<|assistant_end|>` |
|
|
- `<|assistant_start|>` |
|
|
- `<|bos|>` |
|
|
- `<|end_of_text|>` |
|
|
- `<|mask|>` |
|
|
- `<|output_end|>` |
|
|
- `<|output_start|>` |
|
|
- `<|pad|>` |
|
|
- `<|python_end|>` |
|
|
- `<|python_start|>` |
|
|
- `<|unk|>` |
|
|
- `<|user_end|>` |
|
|
- `<|user_start|>` |
|
|
|
|
|
## Files |
|
|
|
|
|
- `tokenizer.json`: Main tokenizer file (HuggingFace format) |
|
|
- `tokenizer.pkl`: BPE tokenizer (native format) |
|
|
- `tokenizer_config.json`: Tokenizer configuration |
|
|
- `special_tokens_map.json`: Special tokens mapping |
|
|
- `token_bytes.pt`: Token byte mapping |
|
|
|
|
|
## Author |
|
|
|
|
|
- **Mohammed Almaghrabi** |
|
|
- Email: almaghrabima@gmail.com |
|
|
- Project: Suhail Project |
|
|
- This is independent research |
|
|
|
|
|
## License |
|
|
|
|
|
This tokenizer is released under **CC-BY-NC-4.0** (Creative Commons Attribution-NonCommercial 4.0 International). |
|
|
|
|
|
**You are free to:** |
|
|
- Share: Copy and redistribute the material |
|
|
- Adapt: Remix, transform, and build upon the material |
|
|
|
|
|
**Under the following terms:** |
|
|
- **Attribution**: You must give appropriate credit |
|
|
- **NonCommercial**: You may not use the material for commercial purposes |
|
|
|
|
|
For commercial licensing, please contact: almaghrabima@gmail.com |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this tokenizer in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{sarf-tokenizer-2026, |
|
|
title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project}, |
|
|
author={Almaghrabi, Mohammed}, |
|
|
year={2026}, |
|
|
url={https://huggingface.co/almaghrabima/deeplatent-tokenizer}, |
|
|
note={Independent research, part of Suhail Project} |
|
|
} |
|
|
``` |
|
|
|