almaghrabima commited on
Commit
ce0f0a1
·
verified ·
1 Parent(s): c90e6b5

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deeplatent
2
+
3
+ High-performance Arabic tokenizer with morphology and parity awareness. Built with Rust for speed, with Python bindings for ease of use.
4
+
5
+ ## Features
6
+
7
+ - **Arabic-Optimized**: Designed specifically for Arabic and morphologically-rich languages
8
+ - **Fast**: Rust core with Python bindings (~30,000 operations/sec)
9
+ - **Accurate**: 100% roundtrip accuracy on 300,000+ test samples
10
+ - **Edge Case Handling**: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
11
+ - **Unicode Support**: Full support for Arabic diacritics, and mixed scripts
12
+
13
+ ## Installation
14
+
15
+ ```bash
16
+ pip install deeplatent-nlp
17
+ ```
18
+
19
+ ## Quick Start
20
+
21
+ ```python
22
+ from deeplatent import SARFTokenizer
23
+
24
+ # Load tokenizer
25
+ tok = SARFTokenizer.from_pretrained("SARFTokenizer")
26
+
27
+ # Encode text
28
+ ids = tok.encode("مرحبا بالعالم")
29
+ print(ids)
30
+
31
+ # Decode back
32
+ text = tok.decode(ids)
33
+ print(text)
34
+ ```
35
+
36
+ ### Using SarfCodec
37
+
38
+ ```python
39
+ from deeplatent import SarfCodec
40
+
41
+ # Load from encrypted morpheme map
42
+ codec = SarfCodec.from_encrypted("morf_map.enc")
43
+
44
+ # Encode/decode
45
+ encoded = codec.encode("بسم الله الرحمن الرحيم")
46
+ decoded = codec.decode(encoded)
47
+ ```
48
+
49
+ ## Handling Diacritics (Tashkeel)
50
+
51
+ The codec properly handles Arabic diacritics:
52
+
53
+ ```python
54
+ from deeplatent import SarfCodec
55
+
56
+ codec = SarfCodec.from_encrypted("morf_map.enc")
57
+
58
+ # Text with full tashkeel
59
+ text = "بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ"
60
+ encoded = codec.encode(text)
61
+ decoded = codec.decode(encoded)
62
+ ```
63
+
64
+ ## Edge Cases Handled
65
+
66
+ | Case | Example | Handling |
67
+ |------|---------|----------|
68
+ | Diacritics | بِسْمِ | Properly normalized |
69
+ | Arabic-Indic digits | ٠١٢٣٤٥ | Preserved |
70
+ | Alef variants | أ إ آ ا | Normalized to ا |
71
+ | Taa marbuta | ة | Optionally normalized |
72
+ | Tatweel (kashida) | كـتـاب | Removed |
73
+ | Mixed Arabic/English | Hello مرحبا | Both handled |
74
+
75
+ ## Performance
76
+
77
+ ### Tokenizer Benchmark Results
78
+
79
+ Comparison with state-of-the-art tokenizers (5 runs, 5000 samples each).
80
+ Benchmark data: [almaghrabima/deeplatent-benchmark-data](https://huggingface.co/datasets/almaghrabima/deeplatent-benchmark-data)
81
+
82
+ | Rank | Tokenizer | Vocab | AR Fertility | EN Fertility | AR C/T | EN C/T | Parity |
83
+ |------|-----------|-------|--------------|--------------|--------|--------|--------|
84
+ | 1 | **SARFTokenizer** | 64,641 | 1.71 | 1.57 | 3.45 | 2.99 | **1.155** |
85
+ | 2 | Gemma-3-4B | 262,145 | 2.78 | 1.33 | 2.42 | 3.01 | 0.804 |
86
+ | 3 | Fanar-1-9B | 128,256 | 2.85 | 1.36 | 2.27 | 2.94 | 0.774 |
87
+ | 4 | GPT-4o | 200,019 | 2.81 | 1.44 | 2.45 | 3.38 | 0.725 |
88
+ | 5 | Command-R-Arabic | 255,033 | 3.00 | 1.33 | 2.17 | 3.04 | 0.713 |
89
+ | 6 | Qwen3-4B | 151,669 | 3.05 | 1.50 | 2.04 | 2.93 | 0.696 |
90
+ | 7 | GPT-4 | 100,277 | 4.59 | 1.50 | 1.35 | 3.25 | 0.416 |
91
+
92
+ **Metrics explained:**
93
+ - **Fertility**: Average tokens per word (lower is better)
94
+ - **C/T**: Characters per token (higher is better - more compression)
95
+ - **Parity**: AR C/T ÷ EN C/T (1.0 = equal treatment of both languages)
96
+
97
+ **Key findings:**
98
+ - SARFTokenizer achieves parity closest to 1.0 (1.155), meaning near-equal treatment of Arabic and English
99
+ - SARF tokenizers have the lowest Arabic fertility (1.7 tokens/word vs 2.8+ for others)
100
+ - Morpheme-aware encoding significantly improves Arabic tokenization efficiency
101
+
102
+ ## Requirements
103
+
104
+ - Python 3.9+
105
+ - Rust 1.70+ (for building from source)
106
+
107
+ ## License
108
+
109
+ CC-BY-NC-4.0
110
+
111
+ ## Citation
112
+
113
+ ```bibtex
114
+ @misc{sarf-tokenizer-2026,
115
+ title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project},
116
+ author={Almaghrabi, Mohammed},
117
+ year={2026},
118
+ url={https://huggingface.co/almaghrabima/SARFTokenizer},
119
+ note={Independent research, part of Suhail Project}
120
+ }
121
+ ```