File size: 13,641 Bytes
1ae8a3c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
# Code Documentation β€” Darija Tokenizer Benchmark

This document describes every script, data file, and output artifact in the benchmark codebase.

---

## Overview

The benchmark pipeline consists of four stages: **training**, **evaluation**, **analysis**, and **reporting**. Each stage is implemented by standalone Python scripts. The diagram below shows the data flow:

```
OiQ/daa-pairs (dataset)
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  script.py       │────▢│  results/tokenizers/ β”‚  (60 raw JSON files)
β”‚  (train 8K-32K)  β”‚     β”‚  results/transformersβ”‚  (transformers format)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                          β–²
        β–Ό                          β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚train_large_vocab │────▢│  80K + 110K configs  β”‚
β”‚train_remaining   β”‚     β”‚  (16 additional)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  EVALUATION SCRIPTS                           β”‚
β”‚  β”œβ”€β”€ eval_test_set.py       β†’ test_set_resultsβ”‚
β”‚  β”œβ”€β”€ eval_new_and_append.py β†’ append 80K/110K β”‚
β”‚  β”œβ”€β”€ eval_missing.py        β†’ fill gaps       β”‚
β”‚  β”œβ”€β”€ eval_morph_large.py    β†’ morph 80K/110K  β”‚
β”‚  β”œβ”€β”€ bootstrap_test_set.py  β†’ 95% CIs         β”‚
β”‚  β”œβ”€β”€ eval_all_externals.py  β†’ external comp.  β”‚
β”‚  β”œβ”€β”€ eval_codeswitch_...    β†’ code-switching  β”‚
β”‚  └── eval_doda_independent  β†’ DODa validation β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  regen_figures   │────▢│  figures/*.png       β”‚
β”‚  gen_report      │────▢│  benchmark_report.md β”‚
β”‚  verify_arithmetic│───▢│  (stdout validation) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

## Training Scripts

### `script.py` β€” Master Benchmark Pipeline
**Lines:** ~2032 | **Type:** Runs at import (no `__main__` guard)

The monolithic entry point. Loads the `OiQ/daa-pairs` dataset, trains 24 tokenizers (4 algorithms x 2 architectures x 3 vocab sizes: 8K, 16K, 32K), evaluates them on the full training+test corpus, computes morphological metrics via Farasa, generates bootstrap confidence intervals, and produces all plots and reports.

**Key class:** `ProductionMetricsEvaluator` β€” implements script detection, tokenization, Gini coefficient, and all metric computations.

**Inputs:**
- `OiQ/daa-pairs` dataset (via `huggingface_hub`)
- `BenchmarkConfig` dataclass (vocab sizes, algorithms, hyperparameters)

**Outputs:**
- `results/tokenizers/*.json` β€” 48 raw tokenizer files (24 shared + 24 concat halves)
- `results/transformers_tokenizers/` β€” transformers-compatible exports
- `results/tokenizer_results.csv` / `.json` β€” full metrics with morphological data
- `results/bootstrap_ci.csv` β€” bootstrap CIs
- `results/benchmark_report.md` β€” auto-generated summary
- `results/morphology/farasa_segmentations.json` β€” cached Farasa segmentations (~99 MB)
- `results/plots/*.png` β€” all visualization figures

---

### `train_large_vocab.py` β€” Train 80K/110K Tokenizers
**Lines:** ~146

Trains additional tokenizers at 80K and 110K vocabulary sizes to match DarijaBERT-ar (80K) and DarijaBERT-az (110K) for fair head-to-head comparison.

**Inputs:** `results/corpora/train_{ar,az}.txt`

**Outputs:** `results/tokenizers/{shared,concat_ar,concat_az}_{algo}_{80000,110000}.json`

---

### `train_remaining.py` β€” Train Remaining Tokenizers
**Lines:** ~134

Fills in the last missing tokenizer configurations (3 shared 110K + 6 concat 110K + 1 concat 80K BBPE) that `train_large_vocab.py` did not cover.

**Inputs:** `results/corpora/train_{ar,az}.txt`

**Outputs:** Remaining `results/tokenizers/*.json` files

---

### `retrain_missing_and_compare.py` β€” Retrain + Full Re-evaluation
**Lines:** ~558

Retrains 4 missing concat 32K tokenizers, exports to transformers format, re-evaluates all 28 configs, fixes a WordPiece exact-match bug, and runs the external comparison pipeline.

**Inputs:** Training corpora, HF external models

**Outputs:** Updated tokenizer JSONs, `external_comparison.csv`, external comparison plot

---

## Evaluation Scripts

### `eval_test_set.py` β€” Test-Set Evaluation (Single Source of Truth)
**Lines:** ~227

Re-evaluates all tokenizers on the held-out **test set only** (11,282 sentences per script). This is the authoritative evaluation used in all paper tables and percentage claims.

**Key function:** `normalize_decode()` β€” fixes Metaspace double-space artifacts in WordPiece decoders.

**Inputs:**
- `results/tokenizers/*.json`
- `results/corpora/test_{ar,az,mi}.txt`

**Outputs:** `results/test_set_results.csv` / `.json` (40 rows)

---

### `eval_new_and_append.py` β€” Append 80K/110K Results
**Lines:** ~144

Evaluates the newly trained 80K/110K tokenizers and appends their rows to `test_set_results.csv`.

**Inputs:** `test_set_results.csv`, tokenizers, test corpora

**Outputs:** Updated `test_set_results.csv` (grows from 24 to 40 rows)

---

### `eval_missing.py` β€” Fill Single Gap
**Lines:** ~124

Evaluates the one remaining missing tokenizer (`concat_bbpe_55000`) and merges it into the results CSV.

---

### `eval_morph_large.py` β€” Morphological Metrics for 80K/110K
**Lines:** ~297

Computes morphological fidelity metrics (edit distance and consistency F1) for the 16 large-vocabulary tokenizers (80K, 110K) that were not covered by the original `script.py` morph evaluation. Uses the same Farasa cache and identical algorithms.

**Inputs:**
- `results/morphology/farasa_segmentations.json`
- `results/corpora/test_ar.txt`
- 16 tokenizer JSON files

**Outputs:** `results/morph_large_vocab_results.csv` (16 rows)

---

### `bootstrap_test_set.py` β€” Bootstrap Confidence Intervals
**Lines:** ~163

Computes 95% bootstrap confidence intervals (500 resamples) for fertility and CPT on the test set.

**Inputs:** Tokenizers, `test_{ar,az,mi}.txt`

**Outputs:** `results/bootstrap_ci_test_set.csv` (24 rows for 8K-32K configs)

---

## External Comparison Scripts

### `eval_all_externals.py` β€” Evaluate 9 External Tokenizers
**Lines:** ~281

Evaluates 9 external Arabic/Darija tokenizers from HuggingFace (CaMeLBERT-MSA, Asafaya-BERT, Aranizer-SP-86k, B2BERT, DarijaBERT-ar, DarijaBERT-az, Darija-Tokenizer, Translit-Darija, Qwen2.5-Darija) alongside our best 3 tokenizers.

**Inputs:** HF model repos (requires `HF_TOKEN`), test corpora

**Outputs:** `results/external_comparison.csv` / `.json`, comparison plot

---

### `compare_with_external.py` β€” External Comparison (Earlier Version)
**Lines:** ~269

Earlier version of the external comparison, comparing our 8K/16K/32K tokenizers against 5 external models.

---

### `eval_and_compare.py` β€” Combined Evaluation + Comparison
**Lines:** ~277

Combines internal evaluation and external comparison into a single pipeline run.

---

### `eval_codeswitch_and_new_baselines.py` β€” Code-Switching Evaluation
**Lines:** ~373

Evaluates tokenizers on mixed-script (code-switched) texts as a separate category. Also adds the `atlasia/darija_bpe_tokenizer` baseline and evaluates all tokenizers on DODa.

**Outputs:** `results/codeswitch_results.csv` / `.json`

---

### `eval_doda_independent.py` β€” DODa Independent Validation
**Lines:** ~196

Evaluates all tokenizers on `atlasia/DODa` (87K Arabizi dictionary entries), sampled to 10K entries. Serves as an independent, out-of-training-distribution validation.

**Outputs:** `results/doda_independent_results.csv` / `.json`

---

## Utility Scripts

### `fix_tokenizer_decoders.py` β€” Decoder Bug Fixer
**Lines:** ~152

Patches three decoder bugs in tokenizer JSON files:
1. WordPiece double-space artifact (Metaspace decoder producing `"  "` instead of `" "`)
2. NULL decoder in `concat_bpe_16000` (missing Metaspace decoder)
3. Missing WordPiece sub-decoder in `concat_wordpiece_16000`

**Warning:** Modifies tokenizer files in place. No backup is created.

---

### `gen_report.py` β€” Report Generator
**Lines:** ~70

Generates a Markdown summary report (`benchmark_report.md`) from `tokenizer_results.csv`, including best-by-vocabulary tables and full results.

---

### `regen_figures.py` β€” Figure Regenerator
**Lines:** ~313

Regenerates the 5 main paper figures from `test_set_results.csv` with larger fonts (14pt base) for readability. Reads from CSV and writes PNGs to the paper's `figures/` directory.

**Inputs:** `results/test_set_results.csv`

**Outputs:** `figures/fertility_overall_comparison_v2.png`, `fertility_overall_trends.png`, `fertility_disparity_comparison_v2.png`, `fertility_disparity_heatmap_v2.png`, `external_comparison.png`

---

### `verify_arithmetic.py` β€” Numeric Claims Verification
**Lines:** ~143

Validates every percentage claim in the paper and README against `test_set_results.csv`. Checks:
- Disparity formula: `|F_ar - F_az| / max(F_ar, F_az)`
- Overall fertility derivability from per-script values
- All percentage improvement claims (27-34%, 40-50%, etc.)

**Outputs:** Stdout validation report (no files written)

---

## Result Files

### Primary Data

| File | Rows | Description |
|------|------|-------------|
| `test_set_results.csv` | 40 | **Single source of truth.** Test-set metrics for all tokenizers: fertility, CPT, disparity, Gini, entropy, exact match. |
| `tokenizer_results.csv` | 24 | Full benchmark results incl. morphological metrics and per-script breakdown. Covers 8K-32K only. |
| `morph_large_vocab_results.csv` | 16 | Morphological metrics for 80K/110K tokenizers. |
| `bootstrap_ci_test_set.csv` | 24 | Bootstrap 95% CIs (500 resamples) for fertility and CPT. |
| `external_comparison.csv` | 12 | Our best 3 + 9 external tokenizers: fertility, CPT, disparity, exact match per script. |
| `codeswitch_results.csv` | 5 | Code-switching evaluation with mixed-script category. |
| `doda_independent_results.csv` | 12 | DODa independent validation (Arabizi dictionary). |

### Supporting Data

| File / Directory | Description |
|------------------|-------------|
| `corpora/` | Train/validation/test text splits: `{train,val,test}_{ar,az,mi}.txt` |
| `morphology/farasa_segmentations.json` | Cached Farasa morphological segmentations for Arabic texts (~99 MB) |
| `tokenizers/` | Raw HuggingFace `tokenizers` JSON files (60 files: 24 shared + 36 concat halves) |
| `transformers_tokenizers/` | Tokenizers exported for `transformers` library use |
| `doda_sample_10k.txt` | 10K-line Arabizi sample from DODa dataset |
| `benchmark_report.md` | Auto-generated Markdown summary report |

---

## Reproduction Guide

### Full Reproduction (from scratch)

```bash
# 1. Train 8K-32K tokenizers + initial evaluation + morphology
python script.py

# 2. Train 80K + 110K tokenizers
python train_large_vocab.py
python train_remaining.py

# 3. Evaluate on test set (appends 80K/110K to results)
python eval_test_set.py
python eval_new_and_append.py
python eval_missing.py

# 4. Compute morphological metrics for large vocabs
python eval_morph_large.py

# 5. Bootstrap confidence intervals
python bootstrap_test_set.py

# 6. External tokenizer comparison
python eval_all_externals.py

# 7. Code-switching + DODa validation
python eval_codeswitch_and_new_baselines.py
python eval_doda_independent.py

# 8. Generate figures + reports
python regen_figures.py
python gen_report.py

# 9. Verify all numeric claims
python verify_arithmetic.py
```

### Requirements

- Python 3.10+
- `tokenizers`, `transformers`, `datasets` (HuggingFace stack)
- `scikit-learn` (KMeans for morphological consistency)
- `regex` (Unicode grapheme segmentation)
- `numpy`, `pandas`, `matplotlib`, `seaborn`
- `tqdm`
- Farasa JAR (for morphological segmentation; pre-cached in `morphology/`)
- `HF_TOKEN` environment variable (for loading external models)

---

## Key Design Decisions

1. **Monolithic `script.py`**: The main pipeline runs at import level (no `__main__` guard). This is intentional for checkpoint-based resumption β€” the script detects existing artifacts and skips completed stages.

2. **Duplicated helper functions**: Functions like `detect_script()`, `count_graphemes()`, and `normalize_decode()` are copied across evaluation scripts rather than shared via import. This ensures each eval script is self-contained and runnable independently.

3. **Test-set-only evaluation**: All paper numbers come from `eval_test_set.py`, not `script.py`'s full-corpus evaluation. The test set (11,282 sentences per script) provides unbiased estimates.

4. **Concatenated architecture**: Each concat config is stored as two JSON files (`concat_ar_*.json` + `concat_az_*.json`). The evaluator loads both and applies ID shifting at inference time.