DeepLatent SARF Tokenizer

Part of Suhail Project - Independent Research by Mohammed Almaghrabi

This is the SARF (Sarf-Aware Representation Framework) tokenizer designed for the DeepLatent language model, trained on bilingual Arabic/English data.

What is SARF?

SARF (صَرْف) is the Arabic term for morphology. In classical and modern Arabic linguistics, ṣarf refers to the system that governs:

Word formation
Roots and patterns (جذر / وزن)
Prefixes, suffixes, infixes
Tense, gender, number, and derivation

Ṣarf is the exact linguistic layer that makes Arabic hard for naive tokenizers.

SARF combines morphological analysis with BPE tokenization to achieve better compression, especially for morphologically rich languages like Arabic.

Most tokenizers treat Arabic as bytes or characters. SARF treats Arabic as a language.

Installation

Install the suhail-nlp package from PyPI:

pip install suhail-nlp

Quick Start

from suhail import SARFTokenizer

# Load tokenizer (automatically downloads from HuggingFace)
tokenizer = SARFTokenizer.from_pretrained()

# Encode text (SARF preprocessing is applied automatically)
text = "مرحبا بكم Hello world"
tokens = tokenizer.encode(text)
print(f"Tokens: {tokens}")

# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

The suhail-nlp package includes SARF morpheme preprocessing, achieving optimal tokenization efficiency for Arabic text.

Evaluation Results

Metric	With SARF Preprocessing	Without Preprocessing
Arabic Fertility	2.29	5.65
English Fertility	2.10	2.91
Parity (Ar/En)	1.09	1.94
Interpretation	EXCELLENT	Moderate

Fertility = average tokens per word. Lower is better. Parity closer to 1.0 means more equal treatment between languages.

Evaluation Dataset

Evaluation data (10,000 samples: 5,000 Arabic + 5,000 English) is available at: almaghrabima/eval-test-data

Performance Comparison

SARF achieves excellent Arabic efficiency while maintaining strong English performance. Evaluated on 10,000 balanced samples (5,000 Arabic + 5,000 English):

Tokenizer	Vocab Size	Arabic Fertility	Arabic Chars/Token	English Fertility	English Chars/Token	Score
SARF	100,000	1.469	3.959	1.779	3.353	2.251
GPT-4o (o200k_base)	200,019	1.874	3.105	1.718	3.472	1.831
ALLaM-7B	64,000	1.496	3.888	2.234	2.669	1.758
AceGPT-13B	44,800	1.777	3.274	2.238	2.664	1.479
Gemma-3-4B	262,145	2.033	2.862	2.075	2.874	1.396
Command-R Arabic	255,033	2.084	2.791	2.076	2.873	1.362
Fanar-1-9B	128,256	2.071	2.809	2.096	2.845	1.357
Hala-9B	128,256	2.071	2.809	2.096	2.845	1.357
Qwen2.5-7B	151,665	2.24	2.596	2.035	2.93	1.293
Qwen3-VL-4B	151,669	2.24	2.596	2.035	2.93	1.293
GPT-4 (cl100k_base)	100,277	4.071	1.429	1.736	3.435	0.838
Mistral-7B	32,768	5.148	1.13	2.23	2.674	0.516

Key Metrics:

Fertility: Tokens per word (lower = more efficient, fewer tokens needed)
Chars/Token: Characters per token (higher = better compression per token)
Score: Combined bilingual efficiency metric (higher = better)

Understanding the Score

The Score metric measures overall tokenizer efficiency across both languages:

Score = (Arabic_Chars/Token + English_Chars/Token) / (Arabic_Fertility + English_Fertility)

Score Interpretation:

Score > 2.0: Excellent bilingual efficiency (SARF achieves 2.251)
Score 1.5-2.0: Good efficiency (GPT-4o, ALLaM-7B)
Score 1.0-1.5: Moderate efficiency (most Arabic-focused models)
Score < 1.0: Poor efficiency for Arabic (GPT-4, Mistral)

Key Findings

SARF ranks #1 with Score 2.251, outperforming all 12 tokenizers tested
23% better than GPT-4o: Score 2.251 vs 1.831
Best vocabulary efficiency: With only 100K vocab, SARF outperforms models with 2-2.5x larger vocabularies
Balanced multilingual performance: Strong on both Arabic and English

Tokenizer Details

Type: SARF (Sarf-Aware Representation Framework)
Vocabulary Size: 100,000
Special Tokens: 13
Languages: Arabic + English (50/50 balanced)
Target Model: DeepLatent

Special Tokens

<|assistant_end|>
<|assistant_start|>
<|bos|>
<|end_of_text|>
<|mask|>
<|output_end|>
<|output_start|>
<|pad|>
<|python_end|>
<|python_start|>
<|unk|>
<|user_end|>
<|user_start|>

Files

tokenizer.json: Main tokenizer file (HuggingFace format)
tokenizer.pkl: BPE tokenizer (native format)
tokenizer_config.json: Tokenizer configuration
special_tokens_map.json: Special tokens mapping
token_bytes.pt: Token byte mapping

Author

Mohammed Almaghrabi
Email: almaghrabima@gmail.com
Project: Suhail Project
This is independent research

License

This tokenizer is released under CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0 International).

You are free to:

Share: Copy and redistribute the material
Adapt: Remix, transform, and build upon the material

Under the following terms:

Attribution: You must give appropriate credit
NonCommercial: You may not use the material for commercial purposes

For commercial licensing, please contact: almaghrabima@gmail.com

Citation

If you use this tokenizer in your research, please cite:

@misc{sarf-tokenizer-2026,
  title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project},
  author={Almaghrabi, Mohammed},
  year={2026},
  url={https://huggingface.co/almaghrabima/deeplatent-tokenizer},
  note={Independent research, part of Suhail Project}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support