Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Deeplatent
|
| 2 |
+
|
| 3 |
+
High-performance Arabic tokenizer with morphology and parity awareness. Built with Rust for speed, with Python bindings for ease of use.
|
| 4 |
+
|
| 5 |
+
## Features
|
| 6 |
+
|
| 7 |
+
- **Arabic-Optimized**: Designed specifically for Arabic and morphologically-rich languages
|
| 8 |
+
- **Fast**: Rust core with Python bindings (~30,000 operations/sec)
|
| 9 |
+
- **Accurate**: 100% roundtrip accuracy on 300,000+ test samples
|
| 10 |
+
- **Edge Case Handling**: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
|
| 11 |
+
- **Unicode Support**: Full support for Arabic diacritics, and mixed scripts
|
| 12 |
+
|
| 13 |
+
## Installation
|
| 14 |
+
|
| 15 |
+
```bash
|
| 16 |
+
pip install deeplatent-nlp
|
| 17 |
+
```
|
| 18 |
+
|
| 19 |
+
## Quick Start
|
| 20 |
+
|
| 21 |
+
```python
|
| 22 |
+
from deeplatent import SARFTokenizer
|
| 23 |
+
|
| 24 |
+
# Load tokenizer
|
| 25 |
+
tok = SARFTokenizer.from_pretrained("SARFTokenizer")
|
| 26 |
+
|
| 27 |
+
# Encode text
|
| 28 |
+
ids = tok.encode("مرحبا بالعالم")
|
| 29 |
+
print(ids)
|
| 30 |
+
|
| 31 |
+
# Decode back
|
| 32 |
+
text = tok.decode(ids)
|
| 33 |
+
print(text)
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
### Using SarfCodec
|
| 37 |
+
|
| 38 |
+
```python
|
| 39 |
+
from deeplatent import SarfCodec
|
| 40 |
+
|
| 41 |
+
# Load from encrypted morpheme map
|
| 42 |
+
codec = SarfCodec.from_encrypted("morf_map.enc")
|
| 43 |
+
|
| 44 |
+
# Encode/decode
|
| 45 |
+
encoded = codec.encode("بسم الله الرحمن الرحيم")
|
| 46 |
+
decoded = codec.decode(encoded)
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
## Handling Diacritics (Tashkeel)
|
| 50 |
+
|
| 51 |
+
The codec properly handles Arabic diacritics:
|
| 52 |
+
|
| 53 |
+
```python
|
| 54 |
+
from deeplatent import SarfCodec
|
| 55 |
+
|
| 56 |
+
codec = SarfCodec.from_encrypted("morf_map.enc")
|
| 57 |
+
|
| 58 |
+
# Text with full tashkeel
|
| 59 |
+
text = "بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ"
|
| 60 |
+
encoded = codec.encode(text)
|
| 61 |
+
decoded = codec.decode(encoded)
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
## Edge Cases Handled
|
| 65 |
+
|
| 66 |
+
| Case | Example | Handling |
|
| 67 |
+
|------|---------|----------|
|
| 68 |
+
| Diacritics | بِسْمِ | Properly normalized |
|
| 69 |
+
| Arabic-Indic digits | ٠١٢٣٤٥ | Preserved |
|
| 70 |
+
| Alef variants | أ إ آ ا | Normalized to ا |
|
| 71 |
+
| Taa marbuta | ة | Optionally normalized |
|
| 72 |
+
| Tatweel (kashida) | كـتـاب | Removed |
|
| 73 |
+
| Mixed Arabic/English | Hello مرحبا | Both handled |
|
| 74 |
+
|
| 75 |
+
## Performance
|
| 76 |
+
|
| 77 |
+
### Tokenizer Benchmark Results
|
| 78 |
+
|
| 79 |
+
Comparison with state-of-the-art tokenizers (5 runs, 5000 samples each).
|
| 80 |
+
Benchmark data: [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data)
|
| 81 |
+
|
| 82 |
+
| Rank | Tokenizer | Vocab | AR Fertility | EN Fertility | AR C/T | EN C/T | Parity |
|
| 83 |
+
|------|-----------|-------|--------------|--------------|--------|--------|--------|
|
| 84 |
+
| 1 | **SARFTokenizer** | 64,641 | 1.71 | 1.57 | 3.45 | 2.99 | **1.155** |
|
| 85 |
+
| 2 | Gemma-3-4B | 262,145 | 2.78 | 1.33 | 2.42 | 3.01 | 0.804 |
|
| 86 |
+
| 3 | Fanar-1-9B | 128,256 | 2.85 | 1.36 | 2.27 | 2.94 | 0.774 |
|
| 87 |
+
| 4 | GPT-4o | 200,019 | 2.81 | 1.44 | 2.45 | 3.38 | 0.725 |
|
| 88 |
+
| 5 | Command-R-Arabic | 255,033 | 3.00 | 1.33 | 2.17 | 3.04 | 0.713 |
|
| 89 |
+
| 6 | Qwen3-4B | 151,669 | 3.05 | 1.50 | 2.04 | 2.93 | 0.696 |
|
| 90 |
+
| 7 | GPT-4 | 100,277 | 4.59 | 1.50 | 1.35 | 3.25 | 0.416 |
|
| 91 |
+
|
| 92 |
+
**Metrics explained:**
|
| 93 |
+
- **Fertility**: Average tokens per word (lower is better)
|
| 94 |
+
- **C/T**: Characters per token (higher is better - more compression)
|
| 95 |
+
- **Parity**: AR C/T ÷ EN C/T (1.0 = equal treatment of both languages)
|
| 96 |
+
|
| 97 |
+
**Key findings:**
|
| 98 |
+
- SARFTokenizer achieves parity closest to 1.0 (1.155), meaning near-equal treatment of Arabic and English
|
| 99 |
+
- SARF tokenizers have the lowest Arabic fertility (1.7 tokens/word vs 2.8+ for others)
|
| 100 |
+
- Morpheme-aware encoding significantly improves Arabic tokenization efficiency
|
| 101 |
+
|
| 102 |
+
## Requirements
|
| 103 |
+
|
| 104 |
+
- Python 3.9+
|
| 105 |
+
- Rust 1.70+ (for building from source)
|
| 106 |
+
|
| 107 |
+
## License
|
| 108 |
+
|
| 109 |
+
CC-BY-NC-4.0
|
| 110 |
+
|
| 111 |
+
## Citation
|
| 112 |
+
|
| 113 |
+
```bibtex
|
| 114 |
+
@misc{sarf-tokenizer-2026,
|
| 115 |
+
title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project},
|
| 116 |
+
author={Almaghrabi, Mohammed},
|
| 117 |
+
year={2026},
|
| 118 |
+
url={https://huggingface.co/almaghrabima/SARFTokenizer},
|
| 119 |
+
note={Independent research, part of Suhail Project}
|
| 120 |
+
}
|
| 121 |
+
```
|