--- license: cc-by-nc-4.0 tags: - tokenizer - sarf - morpheme - bpe - deeplatent - bilingual - arabic-english - arabic - morphology language: - ar - en --- # DeepLatent SARF Tokenizer **Part of Suhail Project - Independent Research by Mohammed Almaghrabi** This is the **SARF** (Sarf-Aware Representation Framework) tokenizer designed for the DeepLatent language model, trained on bilingual Arabic/English data. ## What is SARF? **SARF (صَرْف)** is the Arabic term for **morphology**. In classical and modern Arabic linguistics, *ṣarf* refers to the system that governs: - Word formation - Roots and patterns (جذر / وزن) - Prefixes, suffixes, infixes - Tense, gender, number, and derivation > **Ṣarf is the exact linguistic layer that makes Arabic hard for naive tokenizers.** SARF combines morphological analysis with BPE tokenization to achieve better compression, especially for morphologically rich languages like Arabic. Most tokenizers treat Arabic as **bytes or characters**. **SARF treats Arabic as a *language*.** ## Installation Install the `suhail-nlp` package from PyPI: ```bash pip install suhail-nlp ``` ## Quick Start ```python from suhail import SARFTokenizer # Load tokenizer (automatically downloads from HuggingFace) tokenizer = SARFTokenizer.from_pretrained() # Encode text (SARF preprocessing is applied automatically) text = "مرحبا بكم Hello world" tokens = tokenizer.encode(text) print(f"Tokens: {tokens}") # Decode back to text decoded = tokenizer.decode(tokens) print(f"Decoded: {decoded}") ``` The `suhail-nlp` package includes SARF morpheme preprocessing, achieving optimal tokenization efficiency for Arabic text. ## Evaluation Results | Metric | With SARF Preprocessing | Without Preprocessing | |--------|------------------------|----------------------| | Arabic Fertility | 2.29 | 5.65 | | English Fertility | 2.10 | 2.91 | | Parity (Ar/En) | 1.09 | 1.94 | | Interpretation | EXCELLENT | Moderate | *Fertility = average tokens per word. Lower is better. Parity closer to 1.0 means more equal treatment between languages.* ## Evaluation Dataset Evaluation data (10,000 samples: 5,000 Arabic + 5,000 English) is available at: [almaghrabima/eval-test-data](https://huggingface.co/datasets/almaghrabima/eval-test-data) ## Performance Comparison SARF achieves excellent Arabic efficiency while maintaining strong English performance. Evaluated on 10,000 balanced samples (5,000 Arabic + 5,000 English): | Tokenizer | Vocab Size | Arabic Fertility | Arabic Chars/Token | English Fertility | English Chars/Token | Score | |-----------|------------|------------------|-------------------|------------------|---------------------|-------| | **SARF** | **100,000** | **1.469** | **3.959** | **1.779** | **3.353** | **2.251** | | GPT-4o (o200k_base) | 200,019 | 1.874 | 3.105 | 1.718 | 3.472 | 1.831 | | ALLaM-7B | 64,000 | 1.496 | 3.888 | 2.234 | 2.669 | 1.758 | | AceGPT-13B | 44,800 | 1.777 | 3.274 | 2.238 | 2.664 | 1.479 | | Gemma-3-4B | 262,145 | 2.033 | 2.862 | 2.075 | 2.874 | 1.396 | | Command-R Arabic | 255,033 | 2.084 | 2.791 | 2.076 | 2.873 | 1.362 | | Fanar-1-9B | 128,256 | 2.071 | 2.809 | 2.096 | 2.845 | 1.357 | | Hala-9B | 128,256 | 2.071 | 2.809 | 2.096 | 2.845 | 1.357 | | Qwen2.5-7B | 151,665 | 2.24 | 2.596 | 2.035 | 2.93 | 1.293 | | Qwen3-VL-4B | 151,669 | 2.24 | 2.596 | 2.035 | 2.93 | 1.293 | | GPT-4 (cl100k_base) | 100,277 | 4.071 | 1.429 | 1.736 | 3.435 | 0.838 | | Mistral-7B | 32,768 | 5.148 | 1.13 | 2.23 | 2.674 | 0.516 | **Key Metrics:** - **Fertility**: Tokens per word (lower = more efficient, fewer tokens needed) - **Chars/Token**: Characters per token (higher = better compression per token) - **Score**: Combined bilingual efficiency metric (higher = better) ### Understanding the Score The **Score** metric measures overall tokenizer efficiency across both languages: ``` Score = (Arabic_Chars/Token + English_Chars/Token) / (Arabic_Fertility + English_Fertility) ``` **Score Interpretation:** - Score > 2.0: Excellent bilingual efficiency (SARF achieves 2.251) - Score 1.5-2.0: Good efficiency (GPT-4o, ALLaM-7B) - Score 1.0-1.5: Moderate efficiency (most Arabic-focused models) - Score < 1.0: Poor efficiency for Arabic (GPT-4, Mistral) ### Key Findings 1. **SARF ranks #1** with Score 2.251, outperforming all 12 tokenizers tested 2. **23% better than GPT-4o**: Score 2.251 vs 1.831 3. **Best vocabulary efficiency**: With only 100K vocab, SARF outperforms models with 2-2.5x larger vocabularies 4. **Balanced multilingual performance**: Strong on both Arabic and English ## Tokenizer Details - **Type**: SARF (Sarf-Aware Representation Framework) - **Vocabulary Size**: 100,000 - **Special Tokens**: 13 - **Languages**: Arabic + English (50/50 balanced) - **Target Model**: DeepLatent ## Special Tokens - `<|assistant_end|>` - `<|assistant_start|>` - `<|bos|>` - `<|end_of_text|>` - `<|mask|>` - `<|output_end|>` - `<|output_start|>` - `<|pad|>` - `<|python_end|>` - `<|python_start|>` - `<|unk|>` - `<|user_end|>` - `<|user_start|>` ## Files - `tokenizer.json`: Main tokenizer file (HuggingFace format) - `tokenizer.pkl`: BPE tokenizer (native format) - `tokenizer_config.json`: Tokenizer configuration - `special_tokens_map.json`: Special tokens mapping - `token_bytes.pt`: Token byte mapping ## Author - **Mohammed Almaghrabi** - Email: almaghrabima@gmail.com - Project: Suhail Project - This is independent research ## License This tokenizer is released under **CC-BY-NC-4.0** (Creative Commons Attribution-NonCommercial 4.0 International). **You are free to:** - Share: Copy and redistribute the material - Adapt: Remix, transform, and build upon the material **Under the following terms:** - **Attribution**: You must give appropriate credit - **NonCommercial**: You may not use the material for commercial purposes For commercial licensing, please contact: almaghrabima@gmail.com ## Citation If you use this tokenizer in your research, please cite: ```bibtex @misc{sarf-tokenizer-2026, title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project}, author={Almaghrabi, Mohammed}, year={2026}, url={https://huggingface.co/almaghrabima/deeplatent-tokenizer}, note={Independent research, part of Suhail Project} } ```