SARFTokenizer / README.md
almaghrabima's picture
Update README.md
c94c9a8 verified
---
license: cc-by-nc-4.0
language:
- ar
- en
tags:
- tokenizer
- arabic
- morphology
- bpe
- deeplatent
- english
- arabic
pipeline_tag: text-generation
---
# DeepLatent SARF Tokenizer
**Part of Suhail Project - Independent Research by Mohammed Almaghrabi**
This is the **SARF** (Sarf-Aware Representation Framework) tokenizer designed for the DeepLatent language model, trained on bilingual Arabic/English data.
## What is SARF?
**SARF (صَرْف)** is the Arabic term for **morphology**. In classical and modern Arabic linguistics, *ṣarf* refers to the system that governs:
- Word formation
- Roots and patterns (جذر / وزن)
- Prefixes, suffixes, infixes
- Tense, gender, number, and derivation
SARF combines morphological analysis with BPE tokenization to achieve better compression, especially for morphologically rich languages like Arabic.
Most tokenizers treat Arabic as **bytes or characters**. **SARF treats Arabic as a *language*.**
## Features
- **Arabic-Optimized**: Designed specifically for Arabic and morphologically-rich languages
- **Fast**: Rust core with Python bindings (up to 43,000+ texts/sec with parallel processing)
- **Accurate**: 100% roundtrip accuracy on 1,000,000 test samples
- **Edge Case Handling**: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
- **Unicode Support**: Full support for Arabic diacritics, and mixed scripts
- **Parallel Processing**: Excellent thread scaling (5x+ speedup with 8 threads)
## Installation
```bash
uv pip install deeplatent-nlp
```
## Quick Start
```python
from deeplatent import SARFTokenizer
# Load tokenizer
tok = SARFTokenizer.from_pretrained("SARFTokenizer")
# Encode text
ids = tok.encode("مرحبا بالعالم")
print(ids)
# Decode back
text = tok.decode(ids)
print(text)
```
## Edge Cases Handled
| Case | Example | Handling |
|------|---------|----------|
| Diacritics | بِسْمِ | Properly normalized |
| Arabic-Indic digits | ٠١٢٣٤٥ | Preserved |
| Alef variants | أ إ آ ا | Normalized to ا |
| Taa marbuta | ة | Optionally normalized |
| Tatweel (kashida) | كـتـاب | Removed |
| Mixed Arabic/English | Hello مرحبا | Both handled |
## Performance
### Tokenizer Benchmark Results
Comparison with state-of-the-art tokenizers on 60,000 samples (30k Arabic + 30k English).
**Dataset:** [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data)
| Tokenizer | Vocab | AR Fert | EN Fert | Avg Fert | AR C/T | EN C/T | Parity |
|-----------|-------|---------|---------|----------|--------|--------|--------|
| **SARFTokenizer** | 64,641 | **1.72** | 1.57 | **1.64** | 3.45 | 2.99 | **1.156** |
| ALLaM-7B | 64,000 | 1.82 | 1.48 | 1.65 | 3.08 | 2.65 | 1.163 |
| Gemma-3-4B | 262,145 | 2.78 | 1.33 | 2.05 | 2.42 | 3.00 | 0.805 |
| Falcon-H1-7B | 130,049 | 2.65 | 1.55 | 2.10 | 2.55 | 2.75 | 0.926 |
| Fanar-1-9B | 128,256 | 2.85 | 1.36 | 2.11 | 2.27 | 2.93 | 0.775 |
| Hala-9B | 128,256 | 2.85 | 1.36 | 2.11 | 2.27 | 2.93 | 0.775 |
| GPT-4o | 200,019 | 2.81 | 1.44 | 2.12 | 2.45 | 3.37 | 0.726 |
| Command-R-Arabic | 255,033 | 3.00 | 1.33 | 2.16 | 2.17 | 3.04 | 0.714 |
| Qwen3-4B | 151,669 | 3.06 | 1.50 | 2.28 | 2.04 | 2.92 | 0.697 |
| GPT-4 | 100,277 | 4.59 | 1.50 | 3.05 | 1.35 | 3.24 | 0.417 |
| Mistral-7B-v0.3 | 32,768 | 5.56 | 1.48 | 3.52 | 1.11 | 2.64 | 0.418 |
**Metrics explained:**
- **Fertility**: Average tokens per word (lower is better - more efficient encoding)
- **C/T**: Characters per token (higher is better - more characters encoded per token)
- **Parity**: AR chars/token ÷ EN chars/token (1.0 = equal treatment of both languages)
**Key findings:**
- **SARFTokenizer achieves best Arabic fertility** (1.72 tokens/word) - 35% better than GPT-4o
- **Lowest average fertility** (1.64) among all tokenizers tested
- **Best Arabic characters/token** (3.45) - encodes more Arabic per token than any competitor
- Compact vocabulary (64k) while maintaining top performance
- ALLaM-7B shows similar efficiency (both use morpheme-aware approaches)
- Falcon-H1-7B has best parity (0.926) but 28% higher fertility than SARF
- GPT-4 and Mistral struggle with Arabic (4.6-5.6 tokens/word vs 1.7 for SARF)
### Throughput Benchmark (1M samples, 680 MB)
Comparison with tiktoken on 1,000,000 documents:
| Tokenizer | 1 Thread | 2 Threads | 4 Threads | 8 Threads |
|-----------|----------|-----------|-----------|-----------|
| **SARFTokenizer** | 3.14 MB/s | 5.57 MB/s | 9.00 MB/s | **13.72 MB/s** |
| tiktoken (o200k) | 6.23 MB/s | 10.55 MB/s | 14.90 MB/s | 10.60 MB/s |
| tiktoken (cl100k) | 7.99 MB/s | 11.68 MB/s | 12.02 MB/s | 8.47 MB/s |
| HF tokenizers | 1.88 MB/s | 3.97 MB/s | 9.27 MB/s | 17.47 MB/s |
**Key findings:**
- **SARFTokenizer outperforms tiktoken at 8 threads** (13.72 MB/s vs 8.47-10.60 MB/s)
- **Excellent parallel scaling**: 4.4x speedup from 1 to 8 threads
- tiktoken degrades with more threads (peaks at 4T, drops at 8T)
### Million-Scale Roundtrip Accuracy
Tested on 999,999 samples from real-world data:
| Category | Samples | Success | Accuracy |
|----------|---------|---------|----------|
| Arabic | 333,333 | 333,333 | **100.00%** |
| English | 333,333 | 333,333 | **100.00%** |
| Mixed | 333,333 | 333,333 | **100.00%** |
| **TOTAL** | **999,999** | **999,999** | **100.00%** |
### Edge Case Tests (58/58 Passed)
All 12 edge case categories pass with 100% success:
| Category | Tests | Status |
|----------|-------|--------|
| Unicode Normalization | 6 | PASS |
| Zero-Width Characters | 6 | PASS |
| Unicode Whitespace | 6 | PASS |
| Grapheme Clusters | 6 | PASS |
| Apostrophes | 4 | PASS |
| Dashes | 4 | PASS |
| Decimal Separators | 3 | PASS |
| URLs/Emails | 4 | PASS |
| File Paths | 3 | PASS |
| Code Identifiers | 4 | PASS |
| Mixed Scripts/RTL | 6 | PASS |
| Robustness | 6 | PASS |
### Reproduce Benchmark Results
Datasets:
- Benchmark data (60k samples): [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data)
- Eval test data: [almaghrabima/eval-test-data](https://huggingface.co/datasets/almaghrabima/eval-test-data)
```bash
# Install dependencies
pip install deeplatent-nlp pyarrow tiktoken transformers huggingface-hub
# Run parity benchmark (vs GPT-4o, Gemma, etc.)
python benchmark_pypi.py
# Run throughput benchmark (vs tiktoken)
python benchmark_tiktoken_style.py --samples 1000000 --threads 1 2 4 8
# Run comprehensive tests (roundtrip + edge cases)
python test_comprehensive_million.py --samples 1000000 --report
```
## Requirements
- Python 3.9+
- Rust 1.70+ (for building from source)
## License
CC-BY-NC-4.0
## Citation
```bibtex
@misc{sarf-tokenizer-2026,
title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project},
author={Almaghrabi, Mohammed},
year={2026},
url={https://huggingface.co/almaghrabima/SARFTokenizer},
note={Independent research, part of Suhail Project}
}
```