almaghrabima commited on
Commit
57f81bb
·
verified ·
1 Parent(s): c24518d

Upload BENCHMARK_REPORT.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. BENCHMARK_REPORT.md +179 -0
BENCHMARK_REPORT.md ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Comprehensive Test & Benchmark Report
2
+
3
+ **Package:** deeplatent-nlp v0.3.8
4
+ **Date:** 2026-02-04
5
+ **Tokenizer:** almaghrabima/SARFTokenizer (from HuggingFace)
6
+ **Rust Available:** Yes
7
+
8
+ ---
9
+
10
+ ## Executive Summary
11
+
12
+ All tests passed with **100% roundtrip accuracy** on 100,000 samples and **58/58 edge case tests** passed. SARFTokenizer shows excellent parallel scaling, outperforming tiktoken at higher thread counts.
13
+
14
+ ---
15
+
16
+ ## 1. Roundtrip Accuracy Tests
17
+
18
+ ### 100,000 Sample Test Results
19
+
20
+ | Category | Samples | Success | Failed | Accuracy |
21
+ |----------|---------|---------|--------|----------|
22
+ | Arabic | 33,333 | 33,333 | 0 | **100.00%** |
23
+ | English | 33,333 | 33,333 | 0 | **100.00%** |
24
+ | Mixed | 33,333 | 33,333 | 0 | **100.00%** |
25
+ | **TOTAL** | **99,999** | **99,999** | **0** | **100.00%** |
26
+
27
+ ### Timing Breakdown
28
+ - Arabic encode: 4.55s (7,329 texts/sec)
29
+ - English encode: 1.18s (28,294 texts/sec)
30
+ - Mixed encode: 10.80s (3,086 texts/sec)
31
+
32
+ ---
33
+
34
+ ## 2. Edge Case Tests (12 Categories)
35
+
36
+ All **58 tests passed** across 12 edge case categories:
37
+
38
+ | Category | Tests | Passed | Failed |
39
+ |----------|-------|--------|--------|
40
+ | Unicode Normalization | 6 | 6 | 0 |
41
+ | Zero-Width Characters | 6 | 6 | 0 |
42
+ | Unicode Whitespace | 6 | 6 | 0 |
43
+ | Grapheme Clusters | 6 | 6 | 0 |
44
+ | Apostrophes | 4 | 4 | 0 |
45
+ | Dashes | 4 | 4 | 0 |
46
+ | Decimal Separators | 3 | 3 | 0 |
47
+ | URLs/Emails | 4 | 4 | 0 |
48
+ | File Paths | 3 | 3 | 0 |
49
+ | Code Identifiers | 4 | 4 | 0 |
50
+ | Mixed Scripts/RTL | 6 | 6 | 0 |
51
+ | Robustness | 6 | 6 | 0 |
52
+ | **TOTAL** | **58** | **58** | **0** |
53
+
54
+ ### pytest Results
55
+ - `test_edge_cases.py`: **107 tests passed**
56
+ - `test_roundtrip.py`: **62 tests passed**
57
+
58
+ ---
59
+
60
+ ## 3. Performance Benchmarks
61
+
62
+ ### Single-Threaded Performance
63
+
64
+ | Mode | Throughput | MB/s |
65
+ |------|------------|------|
66
+ | Single encode | 4,522 texts/sec | 1.9 MB/s |
67
+ | Batch (1K) | 20,929 texts/sec | 8.7 MB/s |
68
+ | Batch (10K) | 29,796 texts/sec | 12.4 MB/s |
69
+
70
+ ### Parallel Benchmark Comparison
71
+
72
+ | Tokenizer | Vocab | 1 Thread | 2 Threads | 4 Threads | 8 Threads | Peak Memory |
73
+ |-----------|-------|----------|-----------|-----------|-----------|-------------|
74
+ | **SARFTokenizer (HF)** | 64,641 | 8,010/s | 41,282/s | 39,911/s | **43,148/s** | 0.2 MB |
75
+ | SARFTokenizer (Local) | 64,641 | 7,714/s | 41,155/s | 42,578/s | 41,036/s | 0.2 MB |
76
+ | tiktoken (o200k) | 200,019 | 21,733/s | 18,489/s | 20,683/s | 11,939/s | 0.2 MB |
77
+ | tiktoken (cl100k) | 100,277 | 29,908/s | 19,036/s | 15,785/s | 11,093/s | 0.4 MB |
78
+
79
+ ### Thread Scaling Efficiency
80
+
81
+ | Tokenizer | 2T Speedup | 4T Speedup | 8T Speedup |
82
+ |-----------|------------|------------|------------|
83
+ | **SARFTokenizer (HF)** | **5.15x** | **4.98x** | **5.39x** |
84
+ | SARFTokenizer (Local) | 5.33x | 5.52x | 5.32x |
85
+ | tiktoken (o200k) | 0.89x | 0.97x | 0.57x |
86
+ | tiktoken (cl100k) | 0.65x | 0.66x | 0.41x |
87
+
88
+ ### Key Findings
89
+
90
+ 1. **SARFTokenizer outperforms tiktoken at higher thread counts**
91
+ - At 8 threads: SARFTokenizer (43,148/s) vs tiktoken cl100k (11,093/s) = **3.9x faster**
92
+
93
+ 2. **Excellent parallel scaling**
94
+ - SARFTokenizer achieves 5x+ speedup with parallel processing
95
+ - tiktoken shows negative scaling (slower with more threads)
96
+
97
+ 3. **Memory efficient**
98
+ - Peak memory usage: 0.2-0.4 MB for all tokenizers
99
+
100
+ ---
101
+
102
+ ## 4. Memory Usage
103
+
104
+ | Metric | Value |
105
+ |--------|-------|
106
+ | Peak Memory | 0.28 MB |
107
+ | Samples Processed | 10,000 |
108
+ | Memory per Sample | ~0.03 KB |
109
+
110
+ ---
111
+
112
+ ## 5. PyPI vs Local Tokenizer Comparison
113
+
114
+ | Metric | PyPI (HF) | Local |
115
+ |--------|-----------|-------|
116
+ | Vocab Size | 64,641 | 64,641 |
117
+ | 1T Throughput | 8,010/s | 7,714/s |
118
+ | 8T Throughput | 43,148/s | 41,036/s |
119
+ | Memory | 0.2 MB | 0.2 MB |
120
+
121
+ **Conclusion:** Both produce identical results. PyPI version is slightly faster.
122
+
123
+ ---
124
+
125
+ ## 6. Test Data Source
126
+
127
+ - **Location:** `/root/.cache/deeplatent/base_data/`
128
+ - **Format:** Parquet shards (748 files)
129
+ - **Total Samples Available:** ~150M texts
130
+ - **Languages:** Arabic, English, Mixed
131
+
132
+ ---
133
+
134
+ ## 7. Scripts Created
135
+
136
+ | Script | Purpose |
137
+ |--------|---------|
138
+ | `test_comprehensive_million.py` | Million-scale roundtrip + edge case tests |
139
+ | `benchmark_parallel.py` | Parallel benchmark comparison with tiktoken |
140
+
141
+ ### Usage
142
+
143
+ ```bash
144
+ # Quick test (3,000 samples)
145
+ python test_comprehensive_million.py --samples 3000
146
+
147
+ # Full test (100,000 samples)
148
+ python test_comprehensive_million.py --samples 100000 --report
149
+
150
+ # Million-scale test
151
+ python test_comprehensive_million.py --samples 1000000 --report
152
+
153
+ # Parallel benchmark
154
+ python benchmark_parallel.py --samples 10000 --threads 1 2 4 8
155
+
156
+ # Compare specific tokenizers
157
+ python benchmark_parallel.py --tokenizers sarf tiktoken
158
+ ```
159
+
160
+ ---
161
+
162
+ ## 8. Conclusion
163
+
164
+ The SARFTokenizer from `almaghrabima/SARFTokenizer` on HuggingFace demonstrates:
165
+
166
+ 1. **Perfect roundtrip accuracy** (100% on 100K samples)
167
+ 2. **Robust edge case handling** (58/58 tests passed)
168
+ 3. **Superior parallel performance** (5x speedup, outperforms tiktoken at scale)
169
+ 4. **Minimal memory footprint** (0.2 MB peak)
170
+
171
+ The tokenizer is production-ready for Arabic and multilingual NLP applications.
172
+
173
+ ---
174
+
175
+ ## Appendix: Raw JSON Results
176
+
177
+ Results saved to:
178
+ - `test_comprehensive_results.json`
179
+ - `benchmark_parallel_results.json`