File size: 6,987 Bytes
d3c23ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce0f0a1
d3c23ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce0f0a1
 
 
 
e180732
 
ce0f0a1
 
e180732
ce0f0a1
 
 
 
344b20c
ce0f0a1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d1f910
 
 
 
 
 
c94c9a8
2d1f910
 
c94c9a8
2d1f910
 
 
 
 
 
 
ce0f0a1
 
2d1f910
 
55db6a1
ce0f0a1
 
2d1f910
 
 
 
 
 
 
ce0f0a1
e180732
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6aa17b2
 
 
 
 
 
 
 
 
 
e180732
6aa17b2
e180732
 
 
 
 
 
6aa17b2
 
ce0f0a1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d1f910
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
license: cc-by-nc-4.0
language:
- ar
- en
tags:
- tokenizer
- arabic
- morphology
- bpe
- deeplatent
- english
- arabic
pipeline_tag: text-generation
---

# DeepLatent SARF Tokenizer

**Part of Suhail Project - Independent Research by Mohammed Almaghrabi**

This is the **SARF** (Sarf-Aware Representation Framework) tokenizer designed for the DeepLatent language model, trained on bilingual Arabic/English data.

## What is SARF?

**SARF (صَرْف)** is the Arabic term for **morphology**. In classical and modern Arabic linguistics, *ṣarf* refers to the system that governs:

- Word formation
- Roots and patterns (جذر / وزن)
- Prefixes, suffixes, infixes
- Tense, gender, number, and derivation

SARF combines morphological analysis with BPE tokenization to achieve better compression, especially for morphologically rich languages like Arabic.

Most tokenizers treat Arabic as **bytes or characters**. **SARF treats Arabic as a *language*.**

## Features

- **Arabic-Optimized**: Designed specifically for Arabic and morphologically-rich languages
- **Fast**: Rust core with Python bindings (up to 43,000+ texts/sec with parallel processing)
- **Accurate**: 100% roundtrip accuracy on 1,000,000 test samples
- **Edge Case Handling**: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
- **Unicode Support**: Full support for Arabic diacritics, and mixed scripts
- **Parallel Processing**: Excellent thread scaling (5x+ speedup with 8 threads)

## Installation

```bash
uv pip install deeplatent-nlp
```

## Quick Start

```python
from deeplatent import SARFTokenizer

# Load tokenizer
tok = SARFTokenizer.from_pretrained("SARFTokenizer")

# Encode text
ids = tok.encode("مرحبا بالعالم")
print(ids)

# Decode back
text = tok.decode(ids)
print(text)
```

## Edge Cases Handled

| Case | Example | Handling |
|------|---------|----------|
| Diacritics | بِسْمِ | Properly normalized |
| Arabic-Indic digits | ٠١٢٣٤٥ | Preserved |
| Alef variants | أ إ آ ا | Normalized to ا |
| Taa marbuta | ة | Optionally normalized |
| Tatweel (kashida) | كـتـاب | Removed |
| Mixed Arabic/English | Hello مرحبا | Both handled |

## Performance

### Tokenizer Benchmark Results

Comparison with state-of-the-art tokenizers on 60,000 samples (30k Arabic + 30k English).

**Dataset:** [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data)

| Tokenizer | Vocab | AR Fert | EN Fert | Avg Fert | AR C/T | EN C/T | Parity |
|-----------|-------|---------|---------|----------|--------|--------|--------|
| **SARFTokenizer** | 64,641 | **1.72** | 1.57 | **1.64** | 3.45 | 2.99 | **1.156** |
| ALLaM-7B | 64,000 | 1.82 | 1.48 | 1.65 | 3.08 | 2.65 | 1.163 |
| Gemma-3-4B | 262,145 | 2.78 | 1.33 | 2.05 | 2.42 | 3.00 | 0.805 |
| Falcon-H1-7B | 130,049 | 2.65 | 1.55 | 2.10 | 2.55 | 2.75 | 0.926 |
| Fanar-1-9B | 128,256 | 2.85 | 1.36 | 2.11 | 2.27 | 2.93 | 0.775 |
| Hala-9B | 128,256 | 2.85 | 1.36 | 2.11 | 2.27 | 2.93 | 0.775 |
| GPT-4o | 200,019 | 2.81 | 1.44 | 2.12 | 2.45 | 3.37 | 0.726 |
| Command-R-Arabic | 255,033 | 3.00 | 1.33 | 2.16 | 2.17 | 3.04 | 0.714 |
| Qwen3-4B | 151,669 | 3.06 | 1.50 | 2.28 | 2.04 | 2.92 | 0.697 |
| GPT-4 | 100,277 | 4.59 | 1.50 | 3.05 | 1.35 | 3.24 | 0.417 |
| Mistral-7B-v0.3 | 32,768 | 5.56 | 1.48 | 3.52 | 1.11 | 2.64 | 0.418 |

**Metrics explained:**
- **Fertility**: Average tokens per word (lower is better - more efficient encoding)
- **C/T**: Characters per token (higher is better - more characters encoded per token)
- **Parity**: AR chars/token ÷ EN chars/token (1.0 = equal treatment of both languages)

**Key findings:**
- **SARFTokenizer achieves best Arabic fertility** (1.72 tokens/word) - 35% better than GPT-4o
- **Lowest average fertility** (1.64) among all tokenizers tested
- **Best Arabic characters/token** (3.45) - encodes more Arabic per token than any competitor
- Compact vocabulary (64k) while maintaining top performance
- ALLaM-7B shows similar efficiency (both use morpheme-aware approaches)
- Falcon-H1-7B has best parity (0.926) but 28% higher fertility than SARF
- GPT-4 and Mistral struggle with Arabic (4.6-5.6 tokens/word vs 1.7 for SARF)

### Throughput Benchmark (1M samples, 680 MB)

Comparison with tiktoken on 1,000,000 documents:

| Tokenizer | 1 Thread | 2 Threads | 4 Threads | 8 Threads |
|-----------|----------|-----------|-----------|-----------|
| **SARFTokenizer** | 3.14 MB/s | 5.57 MB/s | 9.00 MB/s | **13.72 MB/s** |
| tiktoken (o200k) | 6.23 MB/s | 10.55 MB/s | 14.90 MB/s | 10.60 MB/s |
| tiktoken (cl100k) | 7.99 MB/s | 11.68 MB/s | 12.02 MB/s | 8.47 MB/s |
| HF tokenizers | 1.88 MB/s | 3.97 MB/s | 9.27 MB/s | 17.47 MB/s |

**Key findings:**
- **SARFTokenizer outperforms tiktoken at 8 threads** (13.72 MB/s vs 8.47-10.60 MB/s)
- **Excellent parallel scaling**: 4.4x speedup from 1 to 8 threads
- tiktoken degrades with more threads (peaks at 4T, drops at 8T)

### Million-Scale Roundtrip Accuracy

Tested on 999,999 samples from real-world data:

| Category | Samples | Success | Accuracy |
|----------|---------|---------|----------|
| Arabic | 333,333 | 333,333 | **100.00%** |
| English | 333,333 | 333,333 | **100.00%** |
| Mixed | 333,333 | 333,333 | **100.00%** |
| **TOTAL** | **999,999** | **999,999** | **100.00%** |

### Edge Case Tests (58/58 Passed)

All 12 edge case categories pass with 100% success:

| Category | Tests | Status |
|----------|-------|--------|
| Unicode Normalization | 6 | PASS |
| Zero-Width Characters | 6 | PASS |
| Unicode Whitespace | 6 | PASS |
| Grapheme Clusters | 6 | PASS |
| Apostrophes | 4 | PASS |
| Dashes | 4 | PASS |
| Decimal Separators | 3 | PASS |
| URLs/Emails | 4 | PASS |
| File Paths | 3 | PASS |
| Code Identifiers | 4 | PASS |
| Mixed Scripts/RTL | 6 | PASS |
| Robustness | 6 | PASS |

### Reproduce Benchmark Results

Datasets:
- Benchmark data (60k samples): [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data)
- Eval test data: [almaghrabima/eval-test-data](https://huggingface.co/datasets/almaghrabima/eval-test-data)

```bash
# Install dependencies
pip install deeplatent-nlp pyarrow tiktoken transformers huggingface-hub

# Run parity benchmark (vs GPT-4o, Gemma, etc.)
python benchmark_pypi.py

# Run throughput benchmark (vs tiktoken)
python benchmark_tiktoken_style.py --samples 1000000 --threads 1 2 4 8

# Run comprehensive tests (roundtrip + edge cases)
python test_comprehensive_million.py --samples 1000000 --report
```

## Requirements

- Python 3.9+
- Rust 1.70+ (for building from source)

## License

CC-BY-NC-4.0

## Citation

```bibtex
@misc{sarf-tokenizer-2026,
  title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project},
  author={Almaghrabi, Mohammed},
  year={2026},
  url={https://huggingface.co/almaghrabima/SARFTokenizer},
  note={Independent research, part of Suhail Project}
}
```