|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
language: |
|
|
- ar |
|
|
- en |
|
|
tags: |
|
|
- tokenizer |
|
|
- arabic |
|
|
- morphology |
|
|
- bpe |
|
|
- deeplatent |
|
|
- english |
|
|
- arabic |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# DeepLatent SARF Tokenizer |
|
|
|
|
|
**Part of Suhail Project - Independent Research by Mohammed Almaghrabi** |
|
|
|
|
|
This is the **SARF** (Sarf-Aware Representation Framework) tokenizer designed for the DeepLatent language model, trained on bilingual Arabic/English data. |
|
|
|
|
|
## What is SARF? |
|
|
|
|
|
**SARF (صَرْف)** is the Arabic term for **morphology**. In classical and modern Arabic linguistics, *ṣarf* refers to the system that governs: |
|
|
|
|
|
- Word formation |
|
|
- Roots and patterns (جذر / وزن) |
|
|
- Prefixes, suffixes, infixes |
|
|
- Tense, gender, number, and derivation |
|
|
|
|
|
SARF combines morphological analysis with BPE tokenization to achieve better compression, especially for morphologically rich languages like Arabic. |
|
|
|
|
|
Most tokenizers treat Arabic as **bytes or characters**. **SARF treats Arabic as a *language*.** |
|
|
|
|
|
## Features |
|
|
|
|
|
- **Arabic-Optimized**: Designed specifically for Arabic and morphologically-rich languages |
|
|
- **Fast**: Rust core with Python bindings (up to 43,000+ texts/sec with parallel processing) |
|
|
- **Accurate**: 100% roundtrip accuracy on 1,000,000 test samples |
|
|
- **Edge Case Handling**: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters |
|
|
- **Unicode Support**: Full support for Arabic diacritics, and mixed scripts |
|
|
- **Parallel Processing**: Excellent thread scaling (5x+ speedup with 8 threads) |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
uv pip install deeplatent-nlp |
|
|
``` |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
from deeplatent import SARFTokenizer |
|
|
|
|
|
# Load tokenizer |
|
|
tok = SARFTokenizer.from_pretrained("SARFTokenizer") |
|
|
|
|
|
# Encode text |
|
|
ids = tok.encode("مرحبا بالعالم") |
|
|
print(ids) |
|
|
|
|
|
# Decode back |
|
|
text = tok.decode(ids) |
|
|
print(text) |
|
|
``` |
|
|
|
|
|
## Edge Cases Handled |
|
|
|
|
|
| Case | Example | Handling | |
|
|
|------|---------|----------| |
|
|
| Diacritics | بِسْمِ | Properly normalized | |
|
|
| Arabic-Indic digits | ٠١٢٣٤٥ | Preserved | |
|
|
| Alef variants | أ إ آ ا | Normalized to ا | |
|
|
| Taa marbuta | ة | Optionally normalized | |
|
|
| Tatweel (kashida) | كـتـاب | Removed | |
|
|
| Mixed Arabic/English | Hello مرحبا | Both handled | |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Tokenizer Benchmark Results |
|
|
|
|
|
Comparison with state-of-the-art tokenizers on 60,000 samples (30k Arabic + 30k English). |
|
|
|
|
|
**Dataset:** [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data) |
|
|
|
|
|
| Tokenizer | Vocab | AR Fert | EN Fert | Avg Fert | AR C/T | EN C/T | Parity | |
|
|
|-----------|-------|---------|---------|----------|--------|--------|--------| |
|
|
| **SARFTokenizer** | 64,641 | **1.72** | 1.57 | **1.64** | 3.45 | 2.99 | **1.156** | |
|
|
| ALLaM-7B | 64,000 | 1.82 | 1.48 | 1.65 | 3.08 | 2.65 | 1.163 | |
|
|
| Gemma-3-4B | 262,145 | 2.78 | 1.33 | 2.05 | 2.42 | 3.00 | 0.805 | |
|
|
| Falcon-H1-7B | 130,049 | 2.65 | 1.55 | 2.10 | 2.55 | 2.75 | 0.926 | |
|
|
| Fanar-1-9B | 128,256 | 2.85 | 1.36 | 2.11 | 2.27 | 2.93 | 0.775 | |
|
|
| Hala-9B | 128,256 | 2.85 | 1.36 | 2.11 | 2.27 | 2.93 | 0.775 | |
|
|
| GPT-4o | 200,019 | 2.81 | 1.44 | 2.12 | 2.45 | 3.37 | 0.726 | |
|
|
| Command-R-Arabic | 255,033 | 3.00 | 1.33 | 2.16 | 2.17 | 3.04 | 0.714 | |
|
|
| Qwen3-4B | 151,669 | 3.06 | 1.50 | 2.28 | 2.04 | 2.92 | 0.697 | |
|
|
| GPT-4 | 100,277 | 4.59 | 1.50 | 3.05 | 1.35 | 3.24 | 0.417 | |
|
|
| Mistral-7B-v0.3 | 32,768 | 5.56 | 1.48 | 3.52 | 1.11 | 2.64 | 0.418 | |
|
|
|
|
|
**Metrics explained:** |
|
|
- **Fertility**: Average tokens per word (lower is better - more efficient encoding) |
|
|
- **C/T**: Characters per token (higher is better - more characters encoded per token) |
|
|
- **Parity**: AR chars/token ÷ EN chars/token (1.0 = equal treatment of both languages) |
|
|
|
|
|
**Key findings:** |
|
|
- **SARFTokenizer achieves best Arabic fertility** (1.72 tokens/word) - 35% better than GPT-4o |
|
|
- **Lowest average fertility** (1.64) among all tokenizers tested |
|
|
- **Best Arabic characters/token** (3.45) - encodes more Arabic per token than any competitor |
|
|
- Compact vocabulary (64k) while maintaining top performance |
|
|
- ALLaM-7B shows similar efficiency (both use morpheme-aware approaches) |
|
|
- Falcon-H1-7B has best parity (0.926) but 28% higher fertility than SARF |
|
|
- GPT-4 and Mistral struggle with Arabic (4.6-5.6 tokens/word vs 1.7 for SARF) |
|
|
|
|
|
### Throughput Benchmark (1M samples, 680 MB) |
|
|
|
|
|
Comparison with tiktoken on 1,000,000 documents: |
|
|
|
|
|
| Tokenizer | 1 Thread | 2 Threads | 4 Threads | 8 Threads | |
|
|
|-----------|----------|-----------|-----------|-----------| |
|
|
| **SARFTokenizer** | 3.14 MB/s | 5.57 MB/s | 9.00 MB/s | **13.72 MB/s** | |
|
|
| tiktoken (o200k) | 6.23 MB/s | 10.55 MB/s | 14.90 MB/s | 10.60 MB/s | |
|
|
| tiktoken (cl100k) | 7.99 MB/s | 11.68 MB/s | 12.02 MB/s | 8.47 MB/s | |
|
|
| HF tokenizers | 1.88 MB/s | 3.97 MB/s | 9.27 MB/s | 17.47 MB/s | |
|
|
|
|
|
**Key findings:** |
|
|
- **SARFTokenizer outperforms tiktoken at 8 threads** (13.72 MB/s vs 8.47-10.60 MB/s) |
|
|
- **Excellent parallel scaling**: 4.4x speedup from 1 to 8 threads |
|
|
- tiktoken degrades with more threads (peaks at 4T, drops at 8T) |
|
|
|
|
|
### Million-Scale Roundtrip Accuracy |
|
|
|
|
|
Tested on 999,999 samples from real-world data: |
|
|
|
|
|
| Category | Samples | Success | Accuracy | |
|
|
|----------|---------|---------|----------| |
|
|
| Arabic | 333,333 | 333,333 | **100.00%** | |
|
|
| English | 333,333 | 333,333 | **100.00%** | |
|
|
| Mixed | 333,333 | 333,333 | **100.00%** | |
|
|
| **TOTAL** | **999,999** | **999,999** | **100.00%** | |
|
|
|
|
|
### Edge Case Tests (58/58 Passed) |
|
|
|
|
|
All 12 edge case categories pass with 100% success: |
|
|
|
|
|
| Category | Tests | Status | |
|
|
|----------|-------|--------| |
|
|
| Unicode Normalization | 6 | PASS | |
|
|
| Zero-Width Characters | 6 | PASS | |
|
|
| Unicode Whitespace | 6 | PASS | |
|
|
| Grapheme Clusters | 6 | PASS | |
|
|
| Apostrophes | 4 | PASS | |
|
|
| Dashes | 4 | PASS | |
|
|
| Decimal Separators | 3 | PASS | |
|
|
| URLs/Emails | 4 | PASS | |
|
|
| File Paths | 3 | PASS | |
|
|
| Code Identifiers | 4 | PASS | |
|
|
| Mixed Scripts/RTL | 6 | PASS | |
|
|
| Robustness | 6 | PASS | |
|
|
|
|
|
### Reproduce Benchmark Results |
|
|
|
|
|
Datasets: |
|
|
- Benchmark data (60k samples): [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data) |
|
|
- Eval test data: [almaghrabima/eval-test-data](https://huggingface.co/datasets/almaghrabima/eval-test-data) |
|
|
|
|
|
```bash |
|
|
# Install dependencies |
|
|
pip install deeplatent-nlp pyarrow tiktoken transformers huggingface-hub |
|
|
|
|
|
# Run parity benchmark (vs GPT-4o, Gemma, etc.) |
|
|
python benchmark_pypi.py |
|
|
|
|
|
# Run throughput benchmark (vs tiktoken) |
|
|
python benchmark_tiktoken_style.py --samples 1000000 --threads 1 2 4 8 |
|
|
|
|
|
# Run comprehensive tests (roundtrip + edge cases) |
|
|
python test_comprehensive_million.py --samples 1000000 --report |
|
|
``` |
|
|
|
|
|
## Requirements |
|
|
|
|
|
- Python 3.9+ |
|
|
- Rust 1.70+ (for building from source) |
|
|
|
|
|
## License |
|
|
|
|
|
CC-BY-NC-4.0 |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{sarf-tokenizer-2026, |
|
|
title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project}, |
|
|
author={Almaghrabi, Mohammed}, |
|
|
year={2026}, |
|
|
url={https://huggingface.co/almaghrabima/SARFTokenizer}, |
|
|
note={Independent research, part of Suhail Project} |
|
|
} |
|
|
``` |
|
|
|