File size: 13,641 Bytes
1ae8a3c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 | # Code Documentation β Darija Tokenizer Benchmark
This document describes every script, data file, and output artifact in the benchmark codebase.
---
## Overview
The benchmark pipeline consists of four stages: **training**, **evaluation**, **analysis**, and **reporting**. Each stage is implemented by standalone Python scripts. The diagram below shows the data flow:
```
OiQ/daa-pairs (dataset)
β
βΌ
ββββββββββββββββββββ βββββββββββββββββββββββ
β script.py ββββββΆβ results/tokenizers/ β (60 raw JSON files)
β (train 8K-32K) β β results/transformersβ (transformers format)
ββββββββββββββββββββ βββββββββββββββββββββββ
β β²
βΌ β
ββββββββββββββββββββ βββββββββββββββββββββββ
βtrain_large_vocab ββββββΆβ 80K + 110K configs β
βtrain_remaining β β (16 additional) β
ββββββββββββββββββββ βββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β EVALUATION SCRIPTS β
β βββ eval_test_set.py β test_set_resultsβ
β βββ eval_new_and_append.py β append 80K/110K β
β βββ eval_missing.py β fill gaps β
β βββ eval_morph_large.py β morph 80K/110K β
β βββ bootstrap_test_set.py β 95% CIs β
β βββ eval_all_externals.py β external comp. β
β βββ eval_codeswitch_... β code-switching β
β βββ eval_doda_independent β DODa validation β
ββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββ βββββββββββββββββββββββ
β regen_figures ββββββΆβ figures/*.png β
β gen_report ββββββΆβ benchmark_report.md β
β verify_arithmeticβββββΆβ (stdout validation) β
ββββββββββββββββββββ βββββββββββββββββββββββ
```
---
## Training Scripts
### `script.py` β Master Benchmark Pipeline
**Lines:** ~2032 | **Type:** Runs at import (no `__main__` guard)
The monolithic entry point. Loads the `OiQ/daa-pairs` dataset, trains 24 tokenizers (4 algorithms x 2 architectures x 3 vocab sizes: 8K, 16K, 32K), evaluates them on the full training+test corpus, computes morphological metrics via Farasa, generates bootstrap confidence intervals, and produces all plots and reports.
**Key class:** `ProductionMetricsEvaluator` β implements script detection, tokenization, Gini coefficient, and all metric computations.
**Inputs:**
- `OiQ/daa-pairs` dataset (via `huggingface_hub`)
- `BenchmarkConfig` dataclass (vocab sizes, algorithms, hyperparameters)
**Outputs:**
- `results/tokenizers/*.json` β 48 raw tokenizer files (24 shared + 24 concat halves)
- `results/transformers_tokenizers/` β transformers-compatible exports
- `results/tokenizer_results.csv` / `.json` β full metrics with morphological data
- `results/bootstrap_ci.csv` β bootstrap CIs
- `results/benchmark_report.md` β auto-generated summary
- `results/morphology/farasa_segmentations.json` β cached Farasa segmentations (~99 MB)
- `results/plots/*.png` β all visualization figures
---
### `train_large_vocab.py` β Train 80K/110K Tokenizers
**Lines:** ~146
Trains additional tokenizers at 80K and 110K vocabulary sizes to match DarijaBERT-ar (80K) and DarijaBERT-az (110K) for fair head-to-head comparison.
**Inputs:** `results/corpora/train_{ar,az}.txt`
**Outputs:** `results/tokenizers/{shared,concat_ar,concat_az}_{algo}_{80000,110000}.json`
---
### `train_remaining.py` β Train Remaining Tokenizers
**Lines:** ~134
Fills in the last missing tokenizer configurations (3 shared 110K + 6 concat 110K + 1 concat 80K BBPE) that `train_large_vocab.py` did not cover.
**Inputs:** `results/corpora/train_{ar,az}.txt`
**Outputs:** Remaining `results/tokenizers/*.json` files
---
### `retrain_missing_and_compare.py` β Retrain + Full Re-evaluation
**Lines:** ~558
Retrains 4 missing concat 32K tokenizers, exports to transformers format, re-evaluates all 28 configs, fixes a WordPiece exact-match bug, and runs the external comparison pipeline.
**Inputs:** Training corpora, HF external models
**Outputs:** Updated tokenizer JSONs, `external_comparison.csv`, external comparison plot
---
## Evaluation Scripts
### `eval_test_set.py` β Test-Set Evaluation (Single Source of Truth)
**Lines:** ~227
Re-evaluates all tokenizers on the held-out **test set only** (11,282 sentences per script). This is the authoritative evaluation used in all paper tables and percentage claims.
**Key function:** `normalize_decode()` β fixes Metaspace double-space artifacts in WordPiece decoders.
**Inputs:**
- `results/tokenizers/*.json`
- `results/corpora/test_{ar,az,mi}.txt`
**Outputs:** `results/test_set_results.csv` / `.json` (40 rows)
---
### `eval_new_and_append.py` β Append 80K/110K Results
**Lines:** ~144
Evaluates the newly trained 80K/110K tokenizers and appends their rows to `test_set_results.csv`.
**Inputs:** `test_set_results.csv`, tokenizers, test corpora
**Outputs:** Updated `test_set_results.csv` (grows from 24 to 40 rows)
---
### `eval_missing.py` β Fill Single Gap
**Lines:** ~124
Evaluates the one remaining missing tokenizer (`concat_bbpe_55000`) and merges it into the results CSV.
---
### `eval_morph_large.py` β Morphological Metrics for 80K/110K
**Lines:** ~297
Computes morphological fidelity metrics (edit distance and consistency F1) for the 16 large-vocabulary tokenizers (80K, 110K) that were not covered by the original `script.py` morph evaluation. Uses the same Farasa cache and identical algorithms.
**Inputs:**
- `results/morphology/farasa_segmentations.json`
- `results/corpora/test_ar.txt`
- 16 tokenizer JSON files
**Outputs:** `results/morph_large_vocab_results.csv` (16 rows)
---
### `bootstrap_test_set.py` β Bootstrap Confidence Intervals
**Lines:** ~163
Computes 95% bootstrap confidence intervals (500 resamples) for fertility and CPT on the test set.
**Inputs:** Tokenizers, `test_{ar,az,mi}.txt`
**Outputs:** `results/bootstrap_ci_test_set.csv` (24 rows for 8K-32K configs)
---
## External Comparison Scripts
### `eval_all_externals.py` β Evaluate 9 External Tokenizers
**Lines:** ~281
Evaluates 9 external Arabic/Darija tokenizers from HuggingFace (CaMeLBERT-MSA, Asafaya-BERT, Aranizer-SP-86k, B2BERT, DarijaBERT-ar, DarijaBERT-az, Darija-Tokenizer, Translit-Darija, Qwen2.5-Darija) alongside our best 3 tokenizers.
**Inputs:** HF model repos (requires `HF_TOKEN`), test corpora
**Outputs:** `results/external_comparison.csv` / `.json`, comparison plot
---
### `compare_with_external.py` β External Comparison (Earlier Version)
**Lines:** ~269
Earlier version of the external comparison, comparing our 8K/16K/32K tokenizers against 5 external models.
---
### `eval_and_compare.py` β Combined Evaluation + Comparison
**Lines:** ~277
Combines internal evaluation and external comparison into a single pipeline run.
---
### `eval_codeswitch_and_new_baselines.py` β Code-Switching Evaluation
**Lines:** ~373
Evaluates tokenizers on mixed-script (code-switched) texts as a separate category. Also adds the `atlasia/darija_bpe_tokenizer` baseline and evaluates all tokenizers on DODa.
**Outputs:** `results/codeswitch_results.csv` / `.json`
---
### `eval_doda_independent.py` β DODa Independent Validation
**Lines:** ~196
Evaluates all tokenizers on `atlasia/DODa` (87K Arabizi dictionary entries), sampled to 10K entries. Serves as an independent, out-of-training-distribution validation.
**Outputs:** `results/doda_independent_results.csv` / `.json`
---
## Utility Scripts
### `fix_tokenizer_decoders.py` β Decoder Bug Fixer
**Lines:** ~152
Patches three decoder bugs in tokenizer JSON files:
1. WordPiece double-space artifact (Metaspace decoder producing `" "` instead of `" "`)
2. NULL decoder in `concat_bpe_16000` (missing Metaspace decoder)
3. Missing WordPiece sub-decoder in `concat_wordpiece_16000`
**Warning:** Modifies tokenizer files in place. No backup is created.
---
### `gen_report.py` β Report Generator
**Lines:** ~70
Generates a Markdown summary report (`benchmark_report.md`) from `tokenizer_results.csv`, including best-by-vocabulary tables and full results.
---
### `regen_figures.py` β Figure Regenerator
**Lines:** ~313
Regenerates the 5 main paper figures from `test_set_results.csv` with larger fonts (14pt base) for readability. Reads from CSV and writes PNGs to the paper's `figures/` directory.
**Inputs:** `results/test_set_results.csv`
**Outputs:** `figures/fertility_overall_comparison_v2.png`, `fertility_overall_trends.png`, `fertility_disparity_comparison_v2.png`, `fertility_disparity_heatmap_v2.png`, `external_comparison.png`
---
### `verify_arithmetic.py` β Numeric Claims Verification
**Lines:** ~143
Validates every percentage claim in the paper and README against `test_set_results.csv`. Checks:
- Disparity formula: `|F_ar - F_az| / max(F_ar, F_az)`
- Overall fertility derivability from per-script values
- All percentage improvement claims (27-34%, 40-50%, etc.)
**Outputs:** Stdout validation report (no files written)
---
## Result Files
### Primary Data
| File | Rows | Description |
|------|------|-------------|
| `test_set_results.csv` | 40 | **Single source of truth.** Test-set metrics for all tokenizers: fertility, CPT, disparity, Gini, entropy, exact match. |
| `tokenizer_results.csv` | 24 | Full benchmark results incl. morphological metrics and per-script breakdown. Covers 8K-32K only. |
| `morph_large_vocab_results.csv` | 16 | Morphological metrics for 80K/110K tokenizers. |
| `bootstrap_ci_test_set.csv` | 24 | Bootstrap 95% CIs (500 resamples) for fertility and CPT. |
| `external_comparison.csv` | 12 | Our best 3 + 9 external tokenizers: fertility, CPT, disparity, exact match per script. |
| `codeswitch_results.csv` | 5 | Code-switching evaluation with mixed-script category. |
| `doda_independent_results.csv` | 12 | DODa independent validation (Arabizi dictionary). |
### Supporting Data
| File / Directory | Description |
|------------------|-------------|
| `corpora/` | Train/validation/test text splits: `{train,val,test}_{ar,az,mi}.txt` |
| `morphology/farasa_segmentations.json` | Cached Farasa morphological segmentations for Arabic texts (~99 MB) |
| `tokenizers/` | Raw HuggingFace `tokenizers` JSON files (60 files: 24 shared + 36 concat halves) |
| `transformers_tokenizers/` | Tokenizers exported for `transformers` library use |
| `doda_sample_10k.txt` | 10K-line Arabizi sample from DODa dataset |
| `benchmark_report.md` | Auto-generated Markdown summary report |
---
## Reproduction Guide
### Full Reproduction (from scratch)
```bash
# 1. Train 8K-32K tokenizers + initial evaluation + morphology
python script.py
# 2. Train 80K + 110K tokenizers
python train_large_vocab.py
python train_remaining.py
# 3. Evaluate on test set (appends 80K/110K to results)
python eval_test_set.py
python eval_new_and_append.py
python eval_missing.py
# 4. Compute morphological metrics for large vocabs
python eval_morph_large.py
# 5. Bootstrap confidence intervals
python bootstrap_test_set.py
# 6. External tokenizer comparison
python eval_all_externals.py
# 7. Code-switching + DODa validation
python eval_codeswitch_and_new_baselines.py
python eval_doda_independent.py
# 8. Generate figures + reports
python regen_figures.py
python gen_report.py
# 9. Verify all numeric claims
python verify_arithmetic.py
```
### Requirements
- Python 3.10+
- `tokenizers`, `transformers`, `datasets` (HuggingFace stack)
- `scikit-learn` (KMeans for morphological consistency)
- `regex` (Unicode grapheme segmentation)
- `numpy`, `pandas`, `matplotlib`, `seaborn`
- `tqdm`
- Farasa JAR (for morphological segmentation; pre-cached in `morphology/`)
- `HF_TOKEN` environment variable (for loading external models)
---
## Key Design Decisions
1. **Monolithic `script.py`**: The main pipeline runs at import level (no `__main__` guard). This is intentional for checkpoint-based resumption β the script detects existing artifacts and skips completed stages.
2. **Duplicated helper functions**: Functions like `detect_script()`, `count_graphemes()`, and `normalize_decode()` are copied across evaluation scripts rather than shared via import. This ensures each eval script is self-contained and runnable independently.
3. **Test-set-only evaluation**: All paper numbers come from `eval_test_set.py`, not `script.py`'s full-corpus evaluation. The test set (11,282 sentences per script) provides unbiased estimates.
4. **Concatenated architecture**: Each concat config is stored as two JSON files (`concat_ar_*.json` + `concat_az_*.json`). The evaluator loads both and applies ID shifting at inference time.
|