--- license: cc-by-nc-4.0 language: - ar - en tags: - tokenizer - arabic - morphology - bpe - deeplatent - english - arabic pipeline_tag: text-generation --- # DeepLatent SARF Tokenizer **Part of Suhail Project - Independent Research by Mohammed Almaghrabi** This is the **SARF** (Sarf-Aware Representation Framework) tokenizer designed for the DeepLatent language model, trained on bilingual Arabic/English data. ## What is SARF? **SARF (صَرْف)** is the Arabic term for **morphology**. In classical and modern Arabic linguistics, *ṣarf* refers to the system that governs: - Word formation - Roots and patterns (جذر / وزن) - Prefixes, suffixes, infixes - Tense, gender, number, and derivation SARF combines morphological analysis with BPE tokenization to achieve better compression, especially for morphologically rich languages like Arabic. Most tokenizers treat Arabic as **bytes or characters**. **SARF treats Arabic as a *language*.** ## Features - **Arabic-Optimized**: Designed specifically for Arabic and morphologically-rich languages - **Fast**: Rust core with Python bindings (up to 43,000+ texts/sec with parallel processing) - **Accurate**: 100% roundtrip accuracy on 1,000,000 test samples - **Edge Case Handling**: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters - **Unicode Support**: Full support for Arabic diacritics, and mixed scripts - **Parallel Processing**: Excellent thread scaling (5x+ speedup with 8 threads) ## Installation ```bash uv pip install deeplatent-nlp ``` ## Quick Start ```python from deeplatent import SARFTokenizer # Load tokenizer tok = SARFTokenizer.from_pretrained("SARFTokenizer") # Encode text ids = tok.encode("مرحبا بالعالم") print(ids) # Decode back text = tok.decode(ids) print(text) ``` ## Edge Cases Handled | Case | Example | Handling | |------|---------|----------| | Diacritics | بِسْمِ | Properly normalized | | Arabic-Indic digits | ٠١٢٣٤٥ | Preserved | | Alef variants | أ إ آ ا | Normalized to ا | | Taa marbuta | ة | Optionally normalized | | Tatweel (kashida) | كـتـاب | Removed | | Mixed Arabic/English | Hello مرحبا | Both handled | ## Performance ### Tokenizer Benchmark Results Comparison with state-of-the-art tokenizers on 60,000 samples (30k Arabic + 30k English). **Dataset:** [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data) | Tokenizer | Vocab | AR Fert | EN Fert | Avg Fert | AR C/T | EN C/T | Parity | |-----------|-------|---------|---------|----------|--------|--------|--------| | **SARFTokenizer** | 64,641 | **1.72** | 1.57 | **1.64** | 3.45 | 2.99 | **1.156** | | ALLaM-7B | 64,000 | 1.82 | 1.48 | 1.65 | 3.08 | 2.65 | 1.163 | | Gemma-3-4B | 262,145 | 2.78 | 1.33 | 2.05 | 2.42 | 3.00 | 0.805 | | Falcon-H1-7B | 130,049 | 2.65 | 1.55 | 2.10 | 2.55 | 2.75 | 0.926 | | Fanar-1-9B | 128,256 | 2.85 | 1.36 | 2.11 | 2.27 | 2.93 | 0.775 | | Hala-9B | 128,256 | 2.85 | 1.36 | 2.11 | 2.27 | 2.93 | 0.775 | | GPT-4o | 200,019 | 2.81 | 1.44 | 2.12 | 2.45 | 3.37 | 0.726 | | Command-R-Arabic | 255,033 | 3.00 | 1.33 | 2.16 | 2.17 | 3.04 | 0.714 | | Qwen3-4B | 151,669 | 3.06 | 1.50 | 2.28 | 2.04 | 2.92 | 0.697 | | GPT-4 | 100,277 | 4.59 | 1.50 | 3.05 | 1.35 | 3.24 | 0.417 | | Mistral-7B-v0.3 | 32,768 | 5.56 | 1.48 | 3.52 | 1.11 | 2.64 | 0.418 | **Metrics explained:** - **Fertility**: Average tokens per word (lower is better - more efficient encoding) - **C/T**: Characters per token (higher is better - more characters encoded per token) - **Parity**: AR chars/token ÷ EN chars/token (1.0 = equal treatment of both languages) **Key findings:** - **SARFTokenizer achieves best Arabic fertility** (1.72 tokens/word) - 35% better than GPT-4o - **Lowest average fertility** (1.64) among all tokenizers tested - **Best Arabic characters/token** (3.45) - encodes more Arabic per token than any competitor - Compact vocabulary (64k) while maintaining top performance - ALLaM-7B shows similar efficiency (both use morpheme-aware approaches) - Falcon-H1-7B has best parity (0.926) but 28% higher fertility than SARF - GPT-4 and Mistral struggle with Arabic (4.6-5.6 tokens/word vs 1.7 for SARF) ### Throughput Benchmark (1M samples, 680 MB) Comparison with tiktoken on 1,000,000 documents: | Tokenizer | 1 Thread | 2 Threads | 4 Threads | 8 Threads | |-----------|----------|-----------|-----------|-----------| | **SARFTokenizer** | 3.14 MB/s | 5.57 MB/s | 9.00 MB/s | **13.72 MB/s** | | tiktoken (o200k) | 6.23 MB/s | 10.55 MB/s | 14.90 MB/s | 10.60 MB/s | | tiktoken (cl100k) | 7.99 MB/s | 11.68 MB/s | 12.02 MB/s | 8.47 MB/s | | HF tokenizers | 1.88 MB/s | 3.97 MB/s | 9.27 MB/s | 17.47 MB/s | **Key findings:** - **SARFTokenizer outperforms tiktoken at 8 threads** (13.72 MB/s vs 8.47-10.60 MB/s) - **Excellent parallel scaling**: 4.4x speedup from 1 to 8 threads - tiktoken degrades with more threads (peaks at 4T, drops at 8T) ### Million-Scale Roundtrip Accuracy Tested on 999,999 samples from real-world data: | Category | Samples | Success | Accuracy | |----------|---------|---------|----------| | Arabic | 333,333 | 333,333 | **100.00%** | | English | 333,333 | 333,333 | **100.00%** | | Mixed | 333,333 | 333,333 | **100.00%** | | **TOTAL** | **999,999** | **999,999** | **100.00%** | ### Edge Case Tests (58/58 Passed) All 12 edge case categories pass with 100% success: | Category | Tests | Status | |----------|-------|--------| | Unicode Normalization | 6 | PASS | | Zero-Width Characters | 6 | PASS | | Unicode Whitespace | 6 | PASS | | Grapheme Clusters | 6 | PASS | | Apostrophes | 4 | PASS | | Dashes | 4 | PASS | | Decimal Separators | 3 | PASS | | URLs/Emails | 4 | PASS | | File Paths | 3 | PASS | | Code Identifiers | 4 | PASS | | Mixed Scripts/RTL | 6 | PASS | | Robustness | 6 | PASS | ### Reproduce Benchmark Results Datasets: - Benchmark data (60k samples): [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data) - Eval test data: [almaghrabima/eval-test-data](https://huggingface.co/datasets/almaghrabima/eval-test-data) ```bash # Install dependencies pip install deeplatent-nlp pyarrow tiktoken transformers huggingface-hub # Run parity benchmark (vs GPT-4o, Gemma, etc.) python benchmark_pypi.py # Run throughput benchmark (vs tiktoken) python benchmark_tiktoken_style.py --samples 1000000 --threads 1 2 4 8 # Run comprehensive tests (roundtrip + edge cases) python test_comprehensive_million.py --samples 1000000 --report ``` ## Requirements - Python 3.9+ - Rust 1.70+ (for building from source) ## License CC-BY-NC-4.0 ## Citation ```bibtex @misc{sarf-tokenizer-2026, title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project}, author={Almaghrabi, Mohammed}, year={2026}, url={https://huggingface.co/almaghrabima/SARFTokenizer}, note={Independent research, part of Suhail Project} } ```