Update README: remove morpheme_map.json references (now bundled in suhail-nlp)

a28373d verified 10 days ago

6.3 kB

	---
	license: cc-by-nc-4.0
	tags:
	- tokenizer
	- sarf
	- morpheme
	- bpe
	- deeplatent
	- bilingual
	- arabic-english
	- arabic
	- morphology
	language:
	- ar
	- en
	---

	# DeepLatent SARF Tokenizer

	Part of Suhail Project - Independent Research by Mohammed Almaghrabi

	This is the SARF (Sarf-Aware Representation Framework) tokenizer designed for the DeepLatent language model, trained on bilingual Arabic/English data.

	## What is SARF?

	SARF (صَرْف) is the Arabic term for morphology. In classical and modern Arabic linguistics, ṣarf refers to the system that governs:

	- Word formation
	- Roots and patterns (جذر / وزن)
	- Prefixes, suffixes, infixes
	- Tense, gender, number, and derivation

	> Ṣarf is the exact linguistic layer that makes Arabic hard for naive tokenizers.

	SARF combines morphological analysis with BPE tokenization to achieve better compression, especially for morphologically rich languages like Arabic.

	Most tokenizers treat Arabic as bytes or characters. *SARF treats Arabic as a language.*

	## Installation

	Install the `suhail-nlp` package from PyPI:

	```bash
	pip install suhail-nlp
	```

	## Quick Start

	```python
	from suhail import SARFTokenizer

	# Load tokenizer (automatically downloads from HuggingFace)
	tokenizer = SARFTokenizer.from_pretrained()

	# Encode text (SARF preprocessing is applied automatically)
	text = "مرحبا بكم Hello world"
	tokens = tokenizer.encode(text)
	print(f"Tokens: {tokens}")

	# Decode back to text
	decoded = tokenizer.decode(tokens)
	print(f"Decoded: {decoded}")
	```

	The `suhail-nlp` package includes SARF morpheme preprocessing, achieving optimal tokenization efficiency for Arabic text.

	## Evaluation Results

	\| Metric \| With SARF Preprocessing \| Without Preprocessing \|
	\|--------\|------------------------\|----------------------\|
	\| Arabic Fertility \| 2.29 \| 5.65 \|
	\| English Fertility \| 2.10 \| 2.91 \|
	\| Parity (Ar/En) \| 1.09 \| 1.94 \|
	\| Interpretation \| EXCELLENT \| Moderate \|

	Fertility = average tokens per word. Lower is better. Parity closer to 1.0 means more equal treatment between languages.

	## Evaluation Dataset

	Evaluation data (10,000 samples: 5,000 Arabic + 5,000 English) is available at:
	[almaghrabima/eval-test-data](https://huggingface.co/datasets/almaghrabima/eval-test-data)

	## Performance Comparison

	SARF achieves excellent Arabic efficiency while maintaining strong English performance. Evaluated on 10,000 balanced samples (5,000 Arabic + 5,000 English):

	\| Tokenizer \| Vocab Size \| Arabic Fertility \| Arabic Chars/Token \| English Fertility \| English Chars/Token \| Score \|
	\|-----------\|------------\|------------------\|-------------------\|------------------\|---------------------\|-------\|
	\| SARF \| 100,000 \| 1.469 \| 3.959 \| 1.779 \| 3.353 \| 2.251 \|
	\| GPT-4o (o200k_base) \| 200,019 \| 1.874 \| 3.105 \| 1.718 \| 3.472 \| 1.831 \|
	\| ALLaM-7B \| 64,000 \| 1.496 \| 3.888 \| 2.234 \| 2.669 \| 1.758 \|
	\| AceGPT-13B \| 44,800 \| 1.777 \| 3.274 \| 2.238 \| 2.664 \| 1.479 \|
	\| Gemma-3-4B \| 262,145 \| 2.033 \| 2.862 \| 2.075 \| 2.874 \| 1.396 \|
	\| Command-R Arabic \| 255,033 \| 2.084 \| 2.791 \| 2.076 \| 2.873 \| 1.362 \|
	\| Fanar-1-9B \| 128,256 \| 2.071 \| 2.809 \| 2.096 \| 2.845 \| 1.357 \|
	\| Hala-9B \| 128,256 \| 2.071 \| 2.809 \| 2.096 \| 2.845 \| 1.357 \|
	\| Qwen2.5-7B \| 151,665 \| 2.24 \| 2.596 \| 2.035 \| 2.93 \| 1.293 \|
	\| Qwen3-VL-4B \| 151,669 \| 2.24 \| 2.596 \| 2.035 \| 2.93 \| 1.293 \|
	\| GPT-4 (cl100k_base) \| 100,277 \| 4.071 \| 1.429 \| 1.736 \| 3.435 \| 0.838 \|
	\| Mistral-7B \| 32,768 \| 5.148 \| 1.13 \| 2.23 \| 2.674 \| 0.516 \|

	Key Metrics:
	- Fertility: Tokens per word (lower = more efficient, fewer tokens needed)
	- Chars/Token: Characters per token (higher = better compression per token)
	- Score: Combined bilingual efficiency metric (higher = better)

	### Understanding the Score

	The Score metric measures overall tokenizer efficiency across both languages:

	```
	Score = (Arabic_Chars/Token + English_Chars/Token) / (Arabic_Fertility + English_Fertility)
	```

	Score Interpretation:
	- Score > 2.0: Excellent bilingual efficiency (SARF achieves 2.251)
	- Score 1.5-2.0: Good efficiency (GPT-4o, ALLaM-7B)
	- Score 1.0-1.5: Moderate efficiency (most Arabic-focused models)
	- Score < 1.0: Poor efficiency for Arabic (GPT-4, Mistral)

	### Key Findings

	1. SARF ranks #1 with Score 2.251, outperforming all 12 tokenizers tested
	2. 23% better than GPT-4o: Score 2.251 vs 1.831
	3. Best vocabulary efficiency: With only 100K vocab, SARF outperforms models with 2-2.5x larger vocabularies
	4. Balanced multilingual performance: Strong on both Arabic and English

	## Tokenizer Details

	- Type: SARF (Sarf-Aware Representation Framework)
	- Vocabulary Size: 100,000
	- Special Tokens: 13
	- Languages: Arabic + English (50/50 balanced)
	- Target Model: DeepLatent

	## Special Tokens

	- `<\|assistant_end\|>`
	- `<\|assistant_start\|>`
	- `<\|bos\|>`
	- `<\|end_of_text\|>`
	- `<\|mask\|>`
	- `<\|output_end\|>`
	- `<\|output_start\|>`
	- `<\|pad\|>`
	- `<\|python_end\|>`
	- `<\|python_start\|>`
	- `<\|unk\|>`
	- `<\|user_end\|>`
	- `<\|user_start\|>`

	## Files

	- `tokenizer.json`: Main tokenizer file (HuggingFace format)
	- `tokenizer.pkl`: BPE tokenizer (native format)
	- `tokenizer_config.json`: Tokenizer configuration
	- `special_tokens_map.json`: Special tokens mapping
	- `token_bytes.pt`: Token byte mapping

	## Author

	- Mohammed Almaghrabi
	- Email: almaghrabima@gmail.com
	- Project: Suhail Project
	- This is independent research

	## License

	This tokenizer is released under CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0 International).

	You are free to:
	- Share: Copy and redistribute the material
	- Adapt: Remix, transform, and build upon the material

	Under the following terms:
	- Attribution: You must give appropriate credit
	- NonCommercial: You may not use the material for commercial purposes

	For commercial licensing, please contact: almaghrabima@gmail.com

	## Citation

	If you use this tokenizer in your research, please cite:

	```bibtex
	@misc{sarf-tokenizer-2026,
	title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project},
	author={Almaghrabi, Mohammed},
	year={2026},
	url={https://huggingface.co/almaghrabima/deeplatent-tokenizer},
	note={Independent research, part of Suhail Project}
	}
	```