SARFTokenizer / README.md

Update README.md

c94c9a8 verified 3 days ago

6.99 kB

	---
	license: cc-by-nc-4.0
	language:
	- ar
	- en
	tags:
	- tokenizer
	- arabic
	- morphology
	- bpe
	- deeplatent
	- english
	- arabic
	pipeline_tag: text-generation
	---

	# DeepLatent SARF Tokenizer

	Part of Suhail Project - Independent Research by Mohammed Almaghrabi

	This is the SARF (Sarf-Aware Representation Framework) tokenizer designed for the DeepLatent language model, trained on bilingual Arabic/English data.

	## What is SARF?

	SARF (صَرْف) is the Arabic term for morphology. In classical and modern Arabic linguistics, ṣarf refers to the system that governs:

	- Word formation
	- Roots and patterns (جذر / وزن)
	- Prefixes, suffixes, infixes
	- Tense, gender, number, and derivation

	SARF combines morphological analysis with BPE tokenization to achieve better compression, especially for morphologically rich languages like Arabic.

	Most tokenizers treat Arabic as bytes or characters. *SARF treats Arabic as a language.*

	## Features

	- Arabic-Optimized: Designed specifically for Arabic and morphologically-rich languages
	- Fast: Rust core with Python bindings (up to 43,000+ texts/sec with parallel processing)
	- Accurate: 100% roundtrip accuracy on 1,000,000 test samples
	- Edge Case Handling: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
	- Unicode Support: Full support for Arabic diacritics, and mixed scripts
	- Parallel Processing: Excellent thread scaling (5x+ speedup with 8 threads)

	## Installation

	```bash
	uv pip install deeplatent-nlp
	```

	## Quick Start

	```python
	from deeplatent import SARFTokenizer

	# Load tokenizer
	tok = SARFTokenizer.from_pretrained("SARFTokenizer")

	# Encode text
	ids = tok.encode("مرحبا بالعالم")
	print(ids)

	# Decode back
	text = tok.decode(ids)
	print(text)
	```

	## Edge Cases Handled

	\| Case \| Example \| Handling \|
	\|------\|---------\|----------\|
	\| Diacritics \| بِسْمِ \| Properly normalized \|
	\| Arabic-Indic digits \| ٠١٢٣٤٥ \| Preserved \|
	\| Alef variants \| أ إ آ ا \| Normalized to ا \|
	\| Taa marbuta \| ة \| Optionally normalized \|
	\| Tatweel (kashida) \| كـتـاب \| Removed \|
	\| Mixed Arabic/English \| Hello مرحبا \| Both handled \|

	## Performance

	### Tokenizer Benchmark Results

	Comparison with state-of-the-art tokenizers on 60,000 samples (30k Arabic + 30k English).

	Dataset: [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data)

	\| Tokenizer \| Vocab \| AR Fert \| EN Fert \| Avg Fert \| AR C/T \| EN C/T \| Parity \|
	\|-----------\|-------\|---------\|---------\|----------\|--------\|--------\|--------\|
	\| SARFTokenizer \| 64,641 \| 1.72 \| 1.57 \| 1.64 \| 3.45 \| 2.99 \| 1.156 \|
	\| ALLaM-7B \| 64,000 \| 1.82 \| 1.48 \| 1.65 \| 3.08 \| 2.65 \| 1.163 \|
	\| Gemma-3-4B \| 262,145 \| 2.78 \| 1.33 \| 2.05 \| 2.42 \| 3.00 \| 0.805 \|
	\| Falcon-H1-7B \| 130,049 \| 2.65 \| 1.55 \| 2.10 \| 2.55 \| 2.75 \| 0.926 \|
	\| Fanar-1-9B \| 128,256 \| 2.85 \| 1.36 \| 2.11 \| 2.27 \| 2.93 \| 0.775 \|
	\| Hala-9B \| 128,256 \| 2.85 \| 1.36 \| 2.11 \| 2.27 \| 2.93 \| 0.775 \|
	\| GPT-4o \| 200,019 \| 2.81 \| 1.44 \| 2.12 \| 2.45 \| 3.37 \| 0.726 \|
	\| Command-R-Arabic \| 255,033 \| 3.00 \| 1.33 \| 2.16 \| 2.17 \| 3.04 \| 0.714 \|
	\| Qwen3-4B \| 151,669 \| 3.06 \| 1.50 \| 2.28 \| 2.04 \| 2.92 \| 0.697 \|
	\| GPT-4 \| 100,277 \| 4.59 \| 1.50 \| 3.05 \| 1.35 \| 3.24 \| 0.417 \|
	\| Mistral-7B-v0.3 \| 32,768 \| 5.56 \| 1.48 \| 3.52 \| 1.11 \| 2.64 \| 0.418 \|

	Metrics explained:
	- Fertility: Average tokens per word (lower is better - more efficient encoding)
	- C/T: Characters per token (higher is better - more characters encoded per token)
	- Parity: AR chars/token ÷ EN chars/token (1.0 = equal treatment of both languages)

	Key findings:
	- SARFTokenizer achieves best Arabic fertility (1.72 tokens/word) - 35% better than GPT-4o
	- Lowest average fertility (1.64) among all tokenizers tested
	- Best Arabic characters/token (3.45) - encodes more Arabic per token than any competitor
	- Compact vocabulary (64k) while maintaining top performance
	- ALLaM-7B shows similar efficiency (both use morpheme-aware approaches)
	- Falcon-H1-7B has best parity (0.926) but 28% higher fertility than SARF
	- GPT-4 and Mistral struggle with Arabic (4.6-5.6 tokens/word vs 1.7 for SARF)

	### Throughput Benchmark (1M samples, 680 MB)

	Comparison with tiktoken on 1,000,000 documents:

	\| Tokenizer \| 1 Thread \| 2 Threads \| 4 Threads \| 8 Threads \|
	\|-----------\|----------\|-----------\|-----------\|-----------\|
	\| SARFTokenizer \| 3.14 MB/s \| 5.57 MB/s \| 9.00 MB/s \| 13.72 MB/s \|
	\| tiktoken (o200k) \| 6.23 MB/s \| 10.55 MB/s \| 14.90 MB/s \| 10.60 MB/s \|
	\| tiktoken (cl100k) \| 7.99 MB/s \| 11.68 MB/s \| 12.02 MB/s \| 8.47 MB/s \|
	\| HF tokenizers \| 1.88 MB/s \| 3.97 MB/s \| 9.27 MB/s \| 17.47 MB/s \|

	Key findings:
	- SARFTokenizer outperforms tiktoken at 8 threads (13.72 MB/s vs 8.47-10.60 MB/s)
	- Excellent parallel scaling: 4.4x speedup from 1 to 8 threads
	- tiktoken degrades with more threads (peaks at 4T, drops at 8T)

	### Million-Scale Roundtrip Accuracy

	Tested on 999,999 samples from real-world data:

	\| Category \| Samples \| Success \| Accuracy \|
	\|----------\|---------\|---------\|----------\|
	\| Arabic \| 333,333 \| 333,333 \| 100.00% \|
	\| English \| 333,333 \| 333,333 \| 100.00% \|
	\| Mixed \| 333,333 \| 333,333 \| 100.00% \|
	\| TOTAL \| 999,999 \| 999,999 \| 100.00% \|

	### Edge Case Tests (58/58 Passed)

	All 12 edge case categories pass with 100% success:

	\| Category \| Tests \| Status \|
	\|----------\|-------\|--------\|
	\| Unicode Normalization \| 6 \| PASS \|
	\| Zero-Width Characters \| 6 \| PASS \|
	\| Unicode Whitespace \| 6 \| PASS \|
	\| Grapheme Clusters \| 6 \| PASS \|
	\| Apostrophes \| 4 \| PASS \|
	\| Dashes \| 4 \| PASS \|
	\| Decimal Separators \| 3 \| PASS \|
	\| URLs/Emails \| 4 \| PASS \|
	\| File Paths \| 3 \| PASS \|
	\| Code Identifiers \| 4 \| PASS \|
	\| Mixed Scripts/RTL \| 6 \| PASS \|
	\| Robustness \| 6 \| PASS \|

	### Reproduce Benchmark Results

	Datasets:
	- Benchmark data (60k samples): [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data)
	- Eval test data: [almaghrabima/eval-test-data](https://huggingface.co/datasets/almaghrabima/eval-test-data)

	```bash
	# Install dependencies
	pip install deeplatent-nlp pyarrow tiktoken transformers huggingface-hub

	# Run parity benchmark (vs GPT-4o, Gemma, etc.)
	python benchmark_pypi.py

	# Run throughput benchmark (vs tiktoken)
	python benchmark_tiktoken_style.py --samples 1000000 --threads 1 2 4 8

	# Run comprehensive tests (roundtrip + edge cases)
	python test_comprehensive_million.py --samples 1000000 --report
	```

	## Requirements

	- Python 3.9+
	- Rust 1.70+ (for building from source)

	## License

	CC-BY-NC-4.0

	## Citation

	```bibtex
	@misc{sarf-tokenizer-2026,
	title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project},
	author={Almaghrabi, Mohammed},
	year={2026},
	url={https://huggingface.co/almaghrabima/SARFTokenizer},
	note={Independent research, part of Suhail Project}
	}
	```