DeepLatent SARF Tokenizer

Part of Suhail Project - Independent Research by Mohammed Almaghrabi

This is the SARF (Sarf-Aware Representation Framework) tokenizer designed for the DeepLatent language model, trained on bilingual Arabic/English data.

What is SARF?

SARF (صَرْف) is the Arabic term for morphology. In classical and modern Arabic linguistics, ṣarf refers to the system that governs:

  • Word formation
  • Roots and patterns (جذر / وزن)
  • Prefixes, suffixes, infixes
  • Tense, gender, number, and derivation

Ṣarf is the exact linguistic layer that makes Arabic hard for naive tokenizers.

SARF combines morphological analysis with BPE tokenization to achieve better compression, especially for morphologically rich languages like Arabic.

Most tokenizers treat Arabic as bytes or characters. SARF treats Arabic as a language.

Installation

Install the suhail-nlp package from PyPI:

pip install suhail-nlp

Quick Start

from suhail import SARFTokenizer

# Load tokenizer (automatically downloads from HuggingFace)
tokenizer = SARFTokenizer.from_pretrained()

# Encode text (SARF preprocessing is applied automatically)
text = "مرحبا بكم Hello world"
tokens = tokenizer.encode(text)
print(f"Tokens: {tokens}")

# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

The suhail-nlp package includes SARF morpheme preprocessing, achieving optimal tokenization efficiency for Arabic text.

Evaluation Results

Metric With SARF Preprocessing Without Preprocessing
Arabic Fertility 2.29 5.65
English Fertility 2.10 2.91
Parity (Ar/En) 1.09 1.94
Interpretation EXCELLENT Moderate

Fertility = average tokens per word. Lower is better. Parity closer to 1.0 means more equal treatment between languages.

Evaluation Dataset

Evaluation data (10,000 samples: 5,000 Arabic + 5,000 English) is available at: almaghrabima/eval-test-data

Performance Comparison

SARF achieves excellent Arabic efficiency while maintaining strong English performance. Evaluated on 10,000 balanced samples (5,000 Arabic + 5,000 English):

Tokenizer Vocab Size Arabic Fertility Arabic Chars/Token English Fertility English Chars/Token Score
SARF 100,000 1.469 3.959 1.779 3.353 2.251
GPT-4o (o200k_base) 200,019 1.874 3.105 1.718 3.472 1.831
ALLaM-7B 64,000 1.496 3.888 2.234 2.669 1.758
AceGPT-13B 44,800 1.777 3.274 2.238 2.664 1.479
Gemma-3-4B 262,145 2.033 2.862 2.075 2.874 1.396
Command-R Arabic 255,033 2.084 2.791 2.076 2.873 1.362
Fanar-1-9B 128,256 2.071 2.809 2.096 2.845 1.357
Hala-9B 128,256 2.071 2.809 2.096 2.845 1.357
Qwen2.5-7B 151,665 2.24 2.596 2.035 2.93 1.293
Qwen3-VL-4B 151,669 2.24 2.596 2.035 2.93 1.293
GPT-4 (cl100k_base) 100,277 4.071 1.429 1.736 3.435 0.838
Mistral-7B 32,768 5.148 1.13 2.23 2.674 0.516

Key Metrics:

  • Fertility: Tokens per word (lower = more efficient, fewer tokens needed)
  • Chars/Token: Characters per token (higher = better compression per token)
  • Score: Combined bilingual efficiency metric (higher = better)

Understanding the Score

The Score metric measures overall tokenizer efficiency across both languages:

Score = (Arabic_Chars/Token + English_Chars/Token) / (Arabic_Fertility + English_Fertility)

Score Interpretation:

  • Score > 2.0: Excellent bilingual efficiency (SARF achieves 2.251)
  • Score 1.5-2.0: Good efficiency (GPT-4o, ALLaM-7B)
  • Score 1.0-1.5: Moderate efficiency (most Arabic-focused models)
  • Score < 1.0: Poor efficiency for Arabic (GPT-4, Mistral)

Key Findings

  1. SARF ranks #1 with Score 2.251, outperforming all 12 tokenizers tested
  2. 23% better than GPT-4o: Score 2.251 vs 1.831
  3. Best vocabulary efficiency: With only 100K vocab, SARF outperforms models with 2-2.5x larger vocabularies
  4. Balanced multilingual performance: Strong on both Arabic and English

Tokenizer Details

  • Type: SARF (Sarf-Aware Representation Framework)
  • Vocabulary Size: 100,000
  • Special Tokens: 13
  • Languages: Arabic + English (50/50 balanced)
  • Target Model: DeepLatent

Special Tokens

  • <|assistant_end|>
  • <|assistant_start|>
  • <|bos|>
  • <|end_of_text|>
  • <|mask|>
  • <|output_end|>
  • <|output_start|>
  • <|pad|>
  • <|python_end|>
  • <|python_start|>
  • <|unk|>
  • <|user_end|>
  • <|user_start|>

Files

  • tokenizer.json: Main tokenizer file (HuggingFace format)
  • tokenizer.pkl: BPE tokenizer (native format)
  • tokenizer_config.json: Tokenizer configuration
  • special_tokens_map.json: Special tokens mapping
  • token_bytes.pt: Token byte mapping

Author

License

This tokenizer is released under CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0 International).

You are free to:

  • Share: Copy and redistribute the material
  • Adapt: Remix, transform, and build upon the material

Under the following terms:

  • Attribution: You must give appropriate credit
  • NonCommercial: You may not use the material for commercial purposes

For commercial licensing, please contact: almaghrabima@gmail.com

Citation

If you use this tokenizer in your research, please cite:

@misc{sarf-tokenizer-2026,
  title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project},
  author={Almaghrabi, Mohammed},
  year={2026},
  url={https://huggingface.co/almaghrabima/deeplatent-tokenizer},
  note={Independent research, part of Suhail Project}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support