almaghrabima
/

SARFTokenizer

+# Deeplatent
+High-performance Arabic tokenizer with morphology and parity awareness. Built with Rust for speed, with Python bindings for ease of use.
+## Features
+- **Arabic-Optimized**: Designed specifically for Arabic and morphologically-rich languages
+- **Fast**: Rust core with Python bindings (~30,000 operations/sec)
+- **Accurate**: 100% roundtrip accuracy on 300,000+ test samples
+- **Edge Case Handling**: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
+- **Unicode Support**: Full support for Arabic diacritics, and mixed scripts
+## Installation
+```bash
+pip install deeplatent-nlp
+```
+## Quick Start
+```python
+from deeplatent import SARFTokenizer
+# Load tokenizer
+tok = SARFTokenizer.from_pretrained("SARFTokenizer")
+# Encode text
+ids = tok.encode("مرحبا بالعالم")
+print(ids)
+# Decode back
+text = tok.decode(ids)
+print(text)
+```
+### Using SarfCodec
+```python
+from deeplatent import SarfCodec
+# Load from encrypted morpheme map
+codec = SarfCodec.from_encrypted("morf_map.enc")
+# Encode/decode
+encoded = codec.encode("بسم الله الرحمن الرحيم")
+decoded = codec.decode(encoded)
+```
+## Handling Diacritics (Tashkeel)
+The codec properly handles Arabic diacritics:
+```python
+from deeplatent import SarfCodec
+codec = SarfCodec.from_encrypted("morf_map.enc")
+# Text with full tashkeel
+text = "بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ"
+encoded = codec.encode(text)
+decoded = codec.decode(encoded)
+```
+## Edge Cases Handled
+| Case | Example | Handling |
+|------|---------|----------|
+| Diacritics | بِسْمِ | Properly normalized |
+| Arabic-Indic digits | ٠١٢٣٤٥ | Preserved |
+| Alef variants | أ إ آ ا | Normalized to ا |
+| Taa marbuta | ة | Optionally normalized |
+| Tatweel (kashida) | كـتـاب | Removed |
+| Mixed Arabic/English | Hello مرحبا | Both handled |
+## Performance
+### Tokenizer Benchmark Results
+Comparison with state-of-the-art tokenizers (5 runs, 5000 samples each).
+Benchmark data: [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data)
+| Rank | Tokenizer | Vocab | AR Fertility | EN Fertility | AR C/T | EN C/T | Parity |
+|------|-----------|-------|--------------|--------------|--------|--------|--------|
+| 1 | **SARFTokenizer** | 64,641 | 1.71 | 1.57 | 3.45 | 2.99 | **1.155** |
+| 2 | Gemma-3-4B | 262,145 | 2.78 | 1.33 | 2.42 | 3.01 | 0.804 |
+| 3 | Fanar-1-9B | 128,256 | 2.85 | 1.36 | 2.27 | 2.94 | 0.774 |
+| 4 | GPT-4o | 200,019 | 2.81 | 1.44 | 2.45 | 3.38 | 0.725 |
+| 5 | Command-R-Arabic | 255,033 | 3.00 | 1.33 | 2.17 | 3.04 | 0.713 |
+| 6 | Qwen3-4B | 151,669 | 3.05 | 1.50 | 2.04 | 2.93 | 0.696 |
+| 7 | GPT-4 | 100,277 | 4.59 | 1.50 | 1.35 | 3.25 | 0.416 |
+**Metrics explained:**
+- **Fertility**: Average tokens per word (lower is better)
+- **C/T**: Characters per token (higher is better - more compression)
+- **Parity**: AR C/T ÷ EN C/T (1.0 = equal treatment of both languages)
+**Key findings:**
+- SARFTokenizer achieves parity closest to 1.0 (1.155), meaning near-equal treatment of Arabic and English
+- SARF tokenizers have the lowest Arabic fertility (1.7 tokens/word vs 2.8+ for others)
+- Morpheme-aware encoding significantly improves Arabic tokenization efficiency
+## Requirements
+- Python 3.9+
+- Rust 1.70+ (for building from source)
+## License
+CC-BY-NC-4.0
+## Citation
+```bibtex
+@misc{sarf-tokenizer-2026,
+  title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project},
+  author={Almaghrabi, Mohammed},
+  year={2026},
+  url={https://huggingface.co/almaghrabima/SARFTokenizer},
+  note={Independent research, part of Suhail Project}
+}
+```