almaghrabima commited on
Commit
301b160
Β·
1 Parent(s): 324a77f

Add benchmark results: 13 tokenizer comparison

Browse files
Files changed (3) hide show
  1. README.md +99 -0
  2. results.json +137 -0
  3. tokenizer_benchmark.py +331 -0
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - ar
5
+ - en
6
+ tags:
7
+ - tokenizer
8
+ - arabic
9
+ - morphology
10
+ - benchmark
11
+ ---
12
+
13
+ # SARF Tokenizer
14
+
15
+ **SARF** (Segmentation-Aware Rewriting Framework) is a morphologically-aware tokenizer for Arabic that combines unsupervised morphological segmentation (Morfessor) with Byte-Pair Encoding. It uses Unicode Private Use Area (PUA) characters to map Arabic morphemes to single tokens before BPE training, achieving strong Arabic tokenization with a compact 72K vocabulary.
16
+
17
+ ## Benchmark Results
18
+
19
+ Evaluation on ~5,000 Arabic + 5,000 English samples from the eval-test-data dataset.
20
+
21
+ | Rank | Tokenizer | Vocab | AR Fertility | AR Chars/Tok | EN Fertility | EN Chars/Tok | Parity | Score |
22
+ |------|-----------|------:|-------------:|-------------:|-------------:|-------------:|-------:|------:|
23
+ | 1 | GPT-4o | 200,019 | 2.249 | 3.111 | 1.213 | 3.492 | 0.8909 | 0.8280 |
24
+ | 2 | Gemma-3-4B | 262,145 | 2.311 | 2.864 | 1.137 | 2.911 | 0.9840 | 0.7740 |
25
+ | 3 | Fanar-1-9B | 128,256 | 2.264 | 2.812 | 1.141 | 2.880 | 0.9764 | 0.7603 |
26
+ | 4 | Hala-9B | 128,256 | 2.264 | 2.812 | 1.141 | 2.880 | 0.9764 | 0.7603 |
27
+ | 5 | Command-R-Arabic | 255,033 | 2.320 | 2.799 | 1.142 | 2.906 | 0.9631 | 0.7545 |
28
+ | 6 | Qwen3-4B | 151,669 | 2.314 | 2.599 | 1.225 | 2.964 | 0.8767 | 0.6720 |
29
+ | 7 | Qwen3-VL-4B | 151,669 | 2.314 | 2.599 | 1.225 | 2.964 | 0.8767 | 0.6720 |
30
+ | 8 | Falcon-H1-7B | 130,049 | 2.083 | 3.272 | 1.266 | 2.835 | 1.1543 | 0.6608 |
31
+ | 9 | **SARF (Ours)** | **72,195** | **1.978** | **2.832** | 1.561 | 3.163 | 0.8952 | 0.6203 |
32
+ | 10 | ALLaM-7B | 64,000 | 1.286 | 3.898 | 1.197 | 2.699 | 1.4442 | 0.5615 |
33
+ | 11 | GPT-4 | 100,277 | 4.111 | 1.430 | 1.225 | 3.452 | 0.4144 | 0.3583 |
34
+ | 12 | Mistral-7B-v0.3 | 32,768 | 5.133 | 1.131 | 1.218 | 2.702 | 0.4185 | 0.1459 |
35
+ | 13 | AceGPT-13B | 32,000 | 5.236 | 1.110 | 1.237 | 2.691 | 0.4124 | 0.1274 |
36
+
37
+ ### Metric Definitions
38
+
39
+ - **AR Fertility**: Arabic tokens per word (lower = better)
40
+ - **AR Chars/Tok**: Arabic characters per token (higher = better compression)
41
+ - **EN Fertility**: English tokens per word (lower = better)
42
+ - **EN Chars/Tok**: English characters per token (higher = better compression)
43
+ - **Parity**: AR Chars/Tok / EN Chars/Tok (closer to 1.0 = more balanced)
44
+ - **Score**: Composite metric (33% Arabic efficiency + 33% English efficiency + 33% parity), min-max normalized across all tokenizers
45
+
46
+ ### Key Findings
47
+
48
+ - SARF achieves the **lowest Arabic fertility** (1.978 tokens/word) among all tokenizers with vocabulary under 130K, demonstrating that morphological preprocessing enables efficient Arabic tokenization without massive vocabularies.
49
+ - With only **72K vocabulary**, SARF achieves Arabic compression (2.832 chars/token) competitive with tokenizers 2-3x its size.
50
+ - SARF has **near-perfect parity** (0.895), meaning Arabic and English text are tokenized with similar efficiency β€” unlike GPT-4 (0.414) or ALLaM (1.444) which show strong language bias.
51
+
52
+ ## Tokenizers Compared
53
+
54
+ | Tokenizer | Model | Source |
55
+ |-----------|-------|--------|
56
+ | SARF | DeepLatent | [almaghrabima/deeplatent-tokenizer-parity](https://huggingface.co/almaghrabima/deeplatent-tokenizer-parity) |
57
+ | GPT-4o | o200k_base | tiktoken |
58
+ | GPT-4 | cl100k_base | tiktoken |
59
+ | ALLaM-7B | humain-ai/ALLaM-7B-Instruct-preview | HuggingFace |
60
+ | AceGPT-13B | FreedomIntelligence/AceGPT-13B-chat | HuggingFace |
61
+ | Gemma-3-4B | google/gemma-3-4b-it | HuggingFace |
62
+ | Command-R Arabic | CohereLabs/c4ai-command-r7b-arabic-02-2025 | HuggingFace |
63
+ | Fanar-1-9B | QCRI/Fanar-1-9B-Instruct | HuggingFace |
64
+ | Hala-9B | hammh0a/Hala-9B | HuggingFace |
65
+ | Qwen3-4B | Qwen/Qwen3-4B-Instruct-2507 | HuggingFace |
66
+ | Qwen3-VL-4B | Qwen/Qwen3-VL-4B-Instruct | HuggingFace |
67
+ | Mistral-7B-v0.3 | mistralai/Mistral-7B-Instruct-v0.3 | HuggingFace |
68
+ | Falcon-H1-7B | tiiuae/Falcon-H1-7B-Instruct | HuggingFace |
69
+
70
+ ## How SARF Works
71
+
72
+ SARF uses a morphologically-aware preprocessing pipeline before BPE:
73
+
74
+ 1. **Morfessor** segments Arabic words into morphemes unsupervised
75
+ 2. **Morpheme-to-PUA mapping** assigns each morpheme a Unicode Private Use Area character
76
+ 3. **ByteRewriter** rewrites Arabic text so morphemes become single characters
77
+ 4. **BPE** trains on the rewritten text, naturally learning morpheme-level tokens
78
+
79
+ This approach achieves strong Arabic compression without inflating the vocabulary for English or other languages.
80
+
81
+ ## Files
82
+
83
+ - `results.json` β€” Raw benchmark data
84
+ - `tokenizer_benchmark.py` β€” Benchmark script (reproduces results)
85
+
86
+ ## Citation
87
+
88
+ ```bibtex
89
+ @misc{sarf2025,
90
+ title={SARF: Segmentation-Aware Rewriting Framework for Arabic Tokenization},
91
+ author={Al-Maghrabima},
92
+ year={2025},
93
+ url={https://huggingface.co/almaghrabima/SARF-Tokenizer}
94
+ }
95
+ ```
96
+
97
+ ## License
98
+
99
+ CC-BY-NC-4.0
results.json ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "num_ar_samples": 4998,
3
+ "num_en_samples": 5000,
4
+ "results": [
5
+ {
6
+ "name": "GPT-4o",
7
+ "vocab_size": 200019,
8
+ "ar_fertility": 2.2489,
9
+ "ar_chars_per_token": 3.1108,
10
+ "en_fertility": 1.2132,
11
+ "en_chars_per_token": 3.4918,
12
+ "parity": 0.8909,
13
+ "score": 0.828
14
+ },
15
+ {
16
+ "name": "Gemma-3-4B",
17
+ "vocab_size": 262145,
18
+ "ar_fertility": 2.3109,
19
+ "ar_chars_per_token": 2.8642,
20
+ "en_fertility": 1.1368,
21
+ "en_chars_per_token": 2.9107,
22
+ "parity": 0.984,
23
+ "score": 0.774
24
+ },
25
+ {
26
+ "name": "Fanar-1-9B",
27
+ "vocab_size": 128256,
28
+ "ar_fertility": 2.2643,
29
+ "ar_chars_per_token": 2.8119,
30
+ "en_fertility": 1.1412,
31
+ "en_chars_per_token": 2.88,
32
+ "parity": 0.9764,
33
+ "score": 0.7603
34
+ },
35
+ {
36
+ "name": "Hala-9B",
37
+ "vocab_size": 128256,
38
+ "ar_fertility": 2.2643,
39
+ "ar_chars_per_token": 2.8119,
40
+ "en_fertility": 1.1412,
41
+ "en_chars_per_token": 2.88,
42
+ "parity": 0.9764,
43
+ "score": 0.7603
44
+ },
45
+ {
46
+ "name": "Command-R-Arabic",
47
+ "vocab_size": 255033,
48
+ "ar_fertility": 2.3196,
49
+ "ar_chars_per_token": 2.7987,
50
+ "en_fertility": 1.1422,
51
+ "en_chars_per_token": 2.906,
52
+ "parity": 0.9631,
53
+ "score": 0.7545
54
+ },
55
+ {
56
+ "name": "Qwen3-4B",
57
+ "vocab_size": 151669,
58
+ "ar_fertility": 2.314,
59
+ "ar_chars_per_token": 2.5988,
60
+ "en_fertility": 1.2247,
61
+ "en_chars_per_token": 2.9641,
62
+ "parity": 0.8767,
63
+ "score": 0.672
64
+ },
65
+ {
66
+ "name": "Qwen3-VL-4B",
67
+ "vocab_size": 151669,
68
+ "ar_fertility": 2.314,
69
+ "ar_chars_per_token": 2.5988,
70
+ "en_fertility": 1.2247,
71
+ "en_chars_per_token": 2.9641,
72
+ "parity": 0.8767,
73
+ "score": 0.672
74
+ },
75
+ {
76
+ "name": "Falcon-H1-7B",
77
+ "vocab_size": 130049,
78
+ "ar_fertility": 2.0829,
79
+ "ar_chars_per_token": 3.2722,
80
+ "en_fertility": 1.2661,
81
+ "en_chars_per_token": 2.8348,
82
+ "parity": 1.1543,
83
+ "score": 0.6608
84
+ },
85
+ {
86
+ "name": "SARF (Ours)",
87
+ "vocab_size": 72195,
88
+ "ar_fertility": 1.9778,
89
+ "ar_chars_per_token": 2.8319,
90
+ "en_fertility": 1.5609,
91
+ "en_chars_per_token": 3.1635,
92
+ "parity": 0.8952,
93
+ "score": 0.6203
94
+ },
95
+ {
96
+ "name": "ALLaM-7B",
97
+ "vocab_size": 64000,
98
+ "ar_fertility": 1.2856,
99
+ "ar_chars_per_token": 3.8978,
100
+ "en_fertility": 1.1973,
101
+ "en_chars_per_token": 2.699,
102
+ "parity": 1.4442,
103
+ "score": 0.5615
104
+ },
105
+ {
106
+ "name": "GPT-4",
107
+ "vocab_size": 100277,
108
+ "ar_fertility": 4.1107,
109
+ "ar_chars_per_token": 1.4303,
110
+ "en_fertility": 1.2247,
111
+ "en_chars_per_token": 3.4518,
112
+ "parity": 0.4144,
113
+ "score": 0.3583
114
+ },
115
+ {
116
+ "name": "Mistral-7B-v0.3",
117
+ "vocab_size": 32768,
118
+ "ar_fertility": 5.1329,
119
+ "ar_chars_per_token": 1.1307,
120
+ "en_fertility": 1.2185,
121
+ "en_chars_per_token": 2.7016,
122
+ "parity": 0.4185,
123
+ "score": 0.1459
124
+ },
125
+ {
126
+ "name": "AceGPT-13B",
127
+ "vocab_size": 32000,
128
+ "ar_fertility": 5.236,
129
+ "ar_chars_per_token": 1.1098,
130
+ "en_fertility": 1.2368,
131
+ "en_chars_per_token": 2.6909,
132
+ "parity": 0.4124,
133
+ "score": 0.1274
134
+ }
135
+ ],
136
+ "markdown_table": "| Rank | Tokenizer | Vocab | AR Fertility | AR Chars/Tok | EN Fertility | EN Chars/Tok | Parity | Score |\n|------|-----------|------:|-------------:|-------------:|-------------:|-------------:|-------:|------:|\n| 1 | GPT-4o | 200,019 | 2.249 | 3.111 | 1.213 | 3.492 | 0.8909 | 0.8280 |\n| 2 | Gemma-3-4B | 262,145 | 2.311 | 2.864 | 1.137 | 2.911 | 0.9840 | 0.7740 |\n| 3 | Fanar-1-9B | 128,256 | 2.264 | 2.812 | 1.141 | 2.880 | 0.9764 | 0.7603 |\n| 4 | Hala-9B | 128,256 | 2.264 | 2.812 | 1.141 | 2.880 | 0.9764 | 0.7603 |\n| 5 | Command-R-Arabic | 255,033 | 2.320 | 2.799 | 1.142 | 2.906 | 0.9631 | 0.7545 |\n| 6 | Qwen3-4B | 151,669 | 2.314 | 2.599 | 1.225 | 2.964 | 0.8767 | 0.6720 |\n| 7 | Qwen3-VL-4B | 151,669 | 2.314 | 2.599 | 1.225 | 2.964 | 0.8767 | 0.6720 |\n| 8 | Falcon-H1-7B | 130,049 | 2.083 | 3.272 | 1.266 | 2.835 | 1.1543 | 0.6608 |\n| 9 | SARF (Ours) | 72,195 | 1.978 | 2.832 | 1.561 | 3.163 | 0.8952 | 0.6203 |\n| 10 | ALLaM-7B | 64,000 | 1.286 | 3.898 | 1.197 | 2.699 | 1.4442 | 0.5615 |\n| 11 | GPT-4 | 100,277 | 4.111 | 1.430 | 1.225 | 3.452 | 0.4144 | 0.3583 |\n| 12 | Mistral-7B-v0.3 | 32,768 | 5.133 | 1.131 | 1.218 | 2.702 | 0.4185 | 0.1459 |\n| 13 | AceGPT-13B | 32,000 | 5.236 | 1.110 | 1.237 | 2.691 | 0.4124 | 0.1274 |"
137
+ }
tokenizer_benchmark.py ADDED
@@ -0,0 +1,331 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Multi-tokenizer comparison benchmark.
3
+
4
+ Evaluates SARF against 11 other tokenizers on Arabic+English text,
5
+ computing fertility, chars/token, parity, and a composite score.
6
+ """
7
+
8
+ import os, sys, re, json, argparse, time
9
+
10
+ sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
11
+
12
+ # Load .env file
13
+ _env_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), ".env")
14
+ if os.path.exists(_env_path):
15
+ with open(_env_path) as _f:
16
+ for _line in _f:
17
+ _line = _line.strip()
18
+ if _line and not _line.startswith("#") and "=" in _line:
19
+ _k, _v = _line.split("=", 1)
20
+ os.environ.setdefault(_k.strip(), _v.strip())
21
+
22
+ # Disable hf_transfer if not installed
23
+ try:
24
+ import hf_transfer # noqa: F401
25
+ except ImportError:
26
+ os.environ.pop("HF_HUB_ENABLE_HF_TRANSFER", None)
27
+
28
+ import pyarrow.parquet as pq
29
+ import glob as globmod
30
+
31
+ from scripts.rewrite_bytes import ByteRewriter
32
+
33
+
34
+ # ── Tokenizer wrappers ──────────────────────────────────────────────
35
+
36
+ class SarfTokenizer:
37
+ def __init__(self, tokenizer_dir, morf_map_path):
38
+ from transformers import PreTrainedTokenizerFast
39
+ self._tok = PreTrainedTokenizerFast(
40
+ tokenizer_file=os.path.join(tokenizer_dir, "tokenizer.json")
41
+ )
42
+ self._rewriter = ByteRewriter(morf_map_path)
43
+
44
+ def encode(self, text):
45
+ return self._tok.encode(self._rewriter.rewrite_text(text), add_special_tokens=False)
46
+
47
+ @property
48
+ def vocab_size(self):
49
+ return len(self._tok)
50
+
51
+ @property
52
+ def name(self):
53
+ return "SARF (Ours)"
54
+
55
+
56
+ class TiktokenTokenizer:
57
+ def __init__(self, encoding_name, display_name=None):
58
+ import tiktoken
59
+ self._enc = tiktoken.get_encoding(encoding_name)
60
+ self._name = display_name or encoding_name
61
+
62
+ def encode(self, text):
63
+ return self._enc.encode(text, allowed_special="all")
64
+
65
+ @property
66
+ def vocab_size(self):
67
+ return self._enc.n_vocab
68
+
69
+ @property
70
+ def name(self):
71
+ return self._name
72
+
73
+
74
+ class HFTokenizer:
75
+ def __init__(self, model_id, display_name=None):
76
+ from transformers import AutoTokenizer
77
+ try:
78
+ self._tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
79
+ except Exception:
80
+ self._tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, use_fast=False)
81
+ self._name = display_name or model_id.split("/")[-1]
82
+
83
+ def encode(self, text):
84
+ return self._tok.encode(text, add_special_tokens=False)
85
+
86
+ @property
87
+ def vocab_size(self):
88
+ return len(self._tok)
89
+
90
+ @property
91
+ def name(self):
92
+ return self._name
93
+
94
+
95
+ # ── Tokenizer registry ──────────────────────────────────────────────
96
+
97
+ TOKENIZER_DEFS = [
98
+ # (display_name, type, source)
99
+ ("SARF (Ours)", "sarf", None),
100
+ ("GPT-4o", "tiktoken", "o200k_base"),
101
+ ("GPT-4", "tiktoken", "cl100k_base"),
102
+ ("ALLaM-7B", "hf", "humain-ai/ALLaM-7B-Instruct-preview"),
103
+ ("AceGPT-13B", "hf", "FreedomIntelligence/AceGPT-13B-chat"),
104
+ ("Gemma-3-4B", "hf", "google/gemma-3-4b-it"),
105
+ ("Command-R-Arabic", "hf", "CohereLabs/c4ai-command-r7b-arabic-02-2025"),
106
+ ("Fanar-1-9B", "hf", "QCRI/Fanar-1-9B-Instruct"),
107
+ ("Hala-9B", "hf", "hammh0a/Hala-9B"),
108
+ ("Qwen3-4B", "hf", "Qwen/Qwen3-4B-Instruct-2507"),
109
+ ("Qwen3-VL-4B", "hf", "Qwen/Qwen3-VL-4B-Instruct"),
110
+ ("Mistral-7B-v0.3", "hf", "mistralai/Mistral-7B-Instruct-v0.3"),
111
+ ("Falcon-H1-7B", "hf", "tiiuae/Falcon-H1-7B-Instruct"),
112
+ ]
113
+
114
+
115
+ def load_all_tokenizers(tokenizer_dir, morf_map_path):
116
+ """Load all tokenizers. Returns list of wrapper objects."""
117
+ tokenizers = []
118
+ for display_name, typ, source in TOKENIZER_DEFS:
119
+ print(f"Loading {display_name}...", end=" ", flush=True)
120
+ t0 = time.time()
121
+ try:
122
+ if typ == "sarf":
123
+ tok = SarfTokenizer(tokenizer_dir, morf_map_path)
124
+ elif typ == "tiktoken":
125
+ tok = TiktokenTokenizer(source, display_name)
126
+ elif typ == "hf":
127
+ tok = HFTokenizer(source, display_name)
128
+ else:
129
+ raise ValueError(f"Unknown type: {typ}")
130
+ print(f"OK (vocab={tok.vocab_size:,}, {time.time()-t0:.1f}s)")
131
+ tokenizers.append(tok)
132
+ except Exception as e:
133
+ print(f"FAILED: {e}")
134
+ return tokenizers
135
+
136
+
137
+ # ── Data loading ─────────────────────────────────────────────────────
138
+
139
+ AR_DETECT = re.compile(r'[\u0600-\u06FF]')
140
+
141
+
142
+ def load_samples(data_dir, num_ar=5000, num_en=5000):
143
+ parquet_files = sorted(globmod.glob(os.path.join(data_dir, '*.parquet')))
144
+ ar_samples, en_samples = [], []
145
+ for filepath in parquet_files:
146
+ if len(ar_samples) >= num_ar and len(en_samples) >= num_en:
147
+ break
148
+ pf = pq.ParquetFile(filepath)
149
+ for rg_idx in range(pf.num_row_groups):
150
+ rg = pf.read_row_group(rg_idx)
151
+ for text in rg.column("text").to_pylist():
152
+ if len(text) < 100:
153
+ continue
154
+ ar_chars = len(AR_DETECT.findall(text))
155
+ ar_ratio = ar_chars / len(text)
156
+ if ar_ratio > 0.3 and len(ar_samples) < num_ar:
157
+ ar_samples.append(text[:2000])
158
+ elif ar_ratio < 0.05 and len(en_samples) < num_en:
159
+ en_samples.append(text[:2000])
160
+ if len(ar_samples) >= num_ar and len(en_samples) >= num_en:
161
+ break
162
+ print(f"Loaded {len(ar_samples)} Arabic, {len(en_samples)} English samples")
163
+ return ar_samples, en_samples
164
+
165
+
166
+ # ── Metrics ──────────────────────────────────────────────────────────
167
+
168
+ AR_WORD = re.compile(r'[\u0600-\u06FF]+')
169
+ EN_WORD = re.compile(r'[a-zA-Z]+')
170
+
171
+
172
+ def compute_metrics(tokenizer, ar_texts, en_texts):
173
+ """Compute fertility, chars/token, and parity for one tokenizer."""
174
+ ar_total_chars = ar_total_tokens = ar_total_words = ar_total_word_tokens = 0
175
+ for text in ar_texts:
176
+ tokens = tokenizer.encode(text)
177
+ ar_total_chars += len(text)
178
+ ar_total_tokens += len(tokens)
179
+ words = AR_WORD.findall(text)
180
+ ar_total_words += len(words)
181
+ for w in words:
182
+ ar_total_word_tokens += len(tokenizer.encode(w))
183
+
184
+ en_total_chars = en_total_tokens = en_total_words = en_total_word_tokens = 0
185
+ for text in en_texts:
186
+ tokens = tokenizer.encode(text)
187
+ en_total_chars += len(text)
188
+ en_total_tokens += len(tokens)
189
+ words = EN_WORD.findall(text)
190
+ en_total_words += len(words)
191
+ for w in words:
192
+ en_total_word_tokens += len(tokenizer.encode(w))
193
+
194
+ ar_fertility = ar_total_word_tokens / ar_total_words if ar_total_words else 0
195
+ ar_cpt = ar_total_chars / ar_total_tokens if ar_total_tokens else 0
196
+ en_fertility = en_total_word_tokens / en_total_words if en_total_words else 0
197
+ en_cpt = en_total_chars / en_total_tokens if en_total_tokens else 0
198
+ parity = ar_cpt / en_cpt if en_cpt else 0
199
+
200
+ return {
201
+ "name": tokenizer.name,
202
+ "vocab_size": tokenizer.vocab_size,
203
+ "ar_fertility": round(ar_fertility, 4),
204
+ "ar_chars_per_token": round(ar_cpt, 4),
205
+ "en_fertility": round(en_fertility, 4),
206
+ "en_chars_per_token": round(en_cpt, 4),
207
+ "parity": round(parity, 4),
208
+ }
209
+
210
+
211
+ # ── Composite score ──────────────────────────────────────────────────
212
+
213
+ def compute_scores(results):
214
+ """Add normalized composite score to each result dict (in-place).
215
+
216
+ Score = 33% Arabic + 33% English + 33% Parity
217
+ Arabic sub = 50% fertility (lower=better, inverted) + 50% chars/token (higher=better)
218
+ English sub = same
219
+ Parity sub = closeness to 1.0 (lower |1-p| = better, inverted)
220
+ Each sub-metric min-max normalized across tokenizers.
221
+ """
222
+ def minmax(vals, invert=False):
223
+ lo, hi = min(vals), max(vals)
224
+ if hi == lo:
225
+ return [0.5] * len(vals)
226
+ normed = [(v - lo) / (hi - lo) for v in vals]
227
+ if invert:
228
+ normed = [1.0 - n for n in normed]
229
+ return normed
230
+
231
+ n = len(results)
232
+ ar_fert_n = minmax([r["ar_fertility"] for r in results], invert=True)
233
+ ar_cpt_n = minmax([r["ar_chars_per_token"] for r in results])
234
+ en_fert_n = minmax([r["en_fertility"] for r in results], invert=True)
235
+ en_cpt_n = minmax([r["en_chars_per_token"] for r in results])
236
+ parity_dev = [abs(1.0 - r["parity"]) for r in results]
237
+ parity_n = minmax(parity_dev, invert=True)
238
+
239
+ for i, r in enumerate(results):
240
+ ar_sub = 0.5 * ar_fert_n[i] + 0.5 * ar_cpt_n[i]
241
+ en_sub = 0.5 * en_fert_n[i] + 0.5 * en_cpt_n[i]
242
+ score = (ar_sub + en_sub + parity_n[i]) / 3.0
243
+ r["score"] = round(score, 4)
244
+
245
+
246
+ # ── Display ──────────────────────────────────────────────────────────
247
+
248
+ def print_table(results):
249
+ results_sorted = sorted(results, key=lambda r: r["score"], reverse=True)
250
+ header = f"{'Rank':<5} {'Tokenizer':<22} {'Vocab':>9} {'AR Fert':>9} {'AR C/T':>9} {'EN Fert':>9} {'EN C/T':>9} {'Parity':>9} {'Score':>9}"
251
+ print("\n" + "=" * len(header))
252
+ print("TOKENIZER BENCHMARK RESULTS")
253
+ print("=" * len(header))
254
+ print(header)
255
+ print("-" * len(header))
256
+ for rank, r in enumerate(results_sorted, 1):
257
+ print(f"{rank:<5} {r['name']:<22} {r['vocab_size']:>9,} {r['ar_fertility']:>9.3f} {r['ar_chars_per_token']:>9.3f} {r['en_fertility']:>9.3f} {r['en_chars_per_token']:>9.3f} {r['parity']:>9.4f} {r['score']:>9.4f}")
258
+ print("=" * len(header))
259
+ print("AR Fert = Arabic tokens/word (lower=better)")
260
+ print("AR C/T = Arabic chars/token (higher=better)")
261
+ print("EN Fert = English tokens/word (lower=better)")
262
+ print("EN C/T = English chars/token (higher=better)")
263
+ print("Parity = AR_C/T / EN_C/T (closer to 1.0=better)")
264
+ print("Score = composite (higher=better)\n")
265
+
266
+
267
+ def results_to_markdown(results):
268
+ """Return a markdown table string for the results."""
269
+ results_sorted = sorted(results, key=lambda r: r["score"], reverse=True)
270
+ lines = [
271
+ "| Rank | Tokenizer | Vocab | AR Fertility | AR Chars/Tok | EN Fertility | EN Chars/Tok | Parity | Score |",
272
+ "|------|-----------|------:|-------------:|-------------:|-------------:|-------------:|-------:|------:|",
273
+ ]
274
+ for rank, r in enumerate(results_sorted, 1):
275
+ lines.append(
276
+ f"| {rank} | {r['name']} | {r['vocab_size']:,} | {r['ar_fertility']:.3f} | {r['ar_chars_per_token']:.3f} | {r['en_fertility']:.3f} | {r['en_chars_per_token']:.3f} | {r['parity']:.4f} | {r['score']:.4f} |"
277
+ )
278
+ return "\n".join(lines)
279
+
280
+
281
+ # ── Main ─────────────────────────────────────────────────────────────
282
+
283
+ def main():
284
+ parser = argparse.ArgumentParser(description="Multi-tokenizer comparison benchmark")
285
+ parser.add_argument("--data_dir", default="/root/.cache/Deeplatent/eval_1b/data")
286
+ parser.add_argument("--tokenizer_dir", default="/root/.cache/deeplatent/tokenizer_parity")
287
+ parser.add_argument("--morf_map_path", default="/root/.cache/deeplatent/morfessor_models/morf_map.json")
288
+ parser.add_argument("--num_samples", type=int, default=5000)
289
+ parser.add_argument("--output", default="benchmark_results.json")
290
+ parser.add_argument("--dry_run", action="store_true", help="Test on 10 samples first")
291
+ args = parser.parse_args()
292
+
293
+ # Load tokenizers
294
+ print("Loading tokenizers...")
295
+ tokenizers = load_all_tokenizers(args.tokenizer_dir, args.morf_map_path)
296
+ print(f"\nLoaded {len(tokenizers)} tokenizers successfully.\n")
297
+
298
+ # Load data
299
+ n = 10 if args.dry_run else args.num_samples
300
+ print(f"Loading {n} samples per language...")
301
+ ar_texts, en_texts = load_samples(args.data_dir, n, n)
302
+
303
+ # Evaluate
304
+ results = []
305
+ for tok in tokenizers:
306
+ print(f"Evaluating {tok.name}...", end=" ", flush=True)
307
+ t0 = time.time()
308
+ m = compute_metrics(tok, ar_texts, en_texts)
309
+ print(f"done ({time.time()-t0:.1f}s)")
310
+ results.append(m)
311
+
312
+ # Compute composite scores
313
+ compute_scores(results)
314
+
315
+ # Display
316
+ print_table(results)
317
+
318
+ # Save
319
+ output = {
320
+ "num_ar_samples": len(ar_texts),
321
+ "num_en_samples": len(en_texts),
322
+ "results": sorted(results, key=lambda r: r["score"], reverse=True),
323
+ "markdown_table": results_to_markdown(results),
324
+ }
325
+ with open(args.output, 'w') as f:
326
+ json.dump(output, f, indent=2, ensure_ascii=False)
327
+ print(f"Results saved to {args.output}")
328
+
329
+
330
+ if __name__ == "__main__":
331
+ main()