| # Code Documentation — Darija Tokenizer Benchmark |
|
|
| This document describes every script, data file, and output artifact in the benchmark codebase. |
|
|
| --- |
|
|
| ## Overview |
|
|
| The benchmark pipeline consists of four stages: **training**, **evaluation**, **analysis**, and **reporting**. Each stage is implemented by standalone Python scripts. The diagram below shows the data flow: |
|
|
| ``` |
| OiQ/daa-pairs (dataset) |
| │ |
| ▼ |
| ┌──────────────────┐ ┌─────────────────────┐ |
| │ script.py │────▶│ results/tokenizers/ │ (60 raw JSON files) |
| │ (train 8K-32K) │ │ results/transformers│ (transformers format) |
| └──────────────────┘ └─────────────────────┘ |
| │ ▲ |
| ▼ │ |
| ┌──────────────────┐ ┌─────────────────────┐ |
| │train_large_vocab │────▶│ 80K + 110K configs │ |
| │train_remaining │ │ (16 additional) │ |
| └──────────────────┘ └─────────────────────┘ |
| │ |
| ▼ |
| ┌──────────────────────────────────────────────┐ |
| │ EVALUATION SCRIPTS │ |
| │ ├── eval_test_set.py → test_set_results│ |
| │ ├── eval_new_and_append.py → append 80K/110K │ |
| │ ├── eval_missing.py → fill gaps │ |
| │ ├── eval_morph_large.py → morph 80K/110K │ |
| │ ├── bootstrap_test_set.py → 95% CIs │ |
| │ ├── eval_all_externals.py → external comp. │ |
| │ ├── eval_codeswitch_... → code-switching │ |
| │ └── eval_doda_independent → DODa validation │ |
| └──────────────────────────────────────────────┘ |
| │ |
| ▼ |
| ┌──────────────────┐ ┌─────────────────────┐ |
| │ regen_figures │────▶│ figures/*.png │ |
| │ gen_report │────▶│ benchmark_report.md │ |
| │ verify_arithmetic│───▶│ (stdout validation) │ |
| └──────────────────┘ └─────────────────────┘ |
| ``` |
|
|
| --- |
|
|
| ## Training Scripts |
|
|
| ### `script.py` — Master Benchmark Pipeline |
| **Lines:** ~2032 | **Type:** Runs at import (no `__main__` guard) |
|
|
| The monolithic entry point. Loads the `OiQ/daa-pairs` dataset, trains 24 tokenizers (4 algorithms x 2 architectures x 3 vocab sizes: 8K, 16K, 32K), evaluates them on the full training+test corpus, computes morphological metrics via Farasa, generates bootstrap confidence intervals, and produces all plots and reports. |
|
|
| **Key class:** `ProductionMetricsEvaluator` — implements script detection, tokenization, Gini coefficient, and all metric computations. |
|
|
| **Inputs:** |
| - `OiQ/daa-pairs` dataset (via `huggingface_hub`) |
| - `BenchmarkConfig` dataclass (vocab sizes, algorithms, hyperparameters) |
|
|
| **Outputs:** |
| - `results/tokenizers/*.json` — 48 raw tokenizer files (24 shared + 24 concat halves) |
| - `results/transformers_tokenizers/` — transformers-compatible exports |
| - `results/tokenizer_results.csv` / `.json` — full metrics with morphological data |
| - `results/bootstrap_ci.csv` — bootstrap CIs |
| - `results/benchmark_report.md` — auto-generated summary |
| - `results/morphology/farasa_segmentations.json` — cached Farasa segmentations (~99 MB) |
| - `results/plots/*.png` — all visualization figures |
|
|
| --- |
|
|
| ### `train_large_vocab.py` — Train 80K/110K Tokenizers |
| **Lines:** ~146 |
|
|
| Trains additional tokenizers at 80K and 110K vocabulary sizes to match DarijaBERT-ar (80K) and DarijaBERT-az (110K) for fair head-to-head comparison. |
|
|
| **Inputs:** `results/corpora/train_{ar,az}.txt` |
|
|
| **Outputs:** `results/tokenizers/{shared,concat_ar,concat_az}_{algo}_{80000,110000}.json` |
|
|
| --- |
|
|
| ### `train_remaining.py` — Train Remaining Tokenizers |
| **Lines:** ~134 |
| |
| Fills in the last missing tokenizer configurations (3 shared 110K + 6 concat 110K + 1 concat 80K BBPE) that `train_large_vocab.py` did not cover. |
| |
| **Inputs:** `results/corpora/train_{ar,az}.txt` |
|
|
| **Outputs:** Remaining `results/tokenizers/*.json` files |
|
|
| --- |
|
|
| ### `retrain_missing_and_compare.py` — Retrain + Full Re-evaluation |
| **Lines:** ~558 |
| |
| Retrains 4 missing concat 32K tokenizers, exports to transformers format, re-evaluates all 28 configs, fixes a WordPiece exact-match bug, and runs the external comparison pipeline. |
| |
| **Inputs:** Training corpora, HF external models |
| |
| **Outputs:** Updated tokenizer JSONs, `external_comparison.csv`, external comparison plot |
|
|
| --- |
|
|
| ## Evaluation Scripts |
|
|
| ### `eval_test_set.py` — Test-Set Evaluation (Single Source of Truth) |
| **Lines:** ~227 |
|
|
| Re-evaluates all tokenizers on the held-out **test set only** (11,282 sentences per script). This is the authoritative evaluation used in all paper tables and percentage claims. |
|
|
| **Key function:** `normalize_decode()` — fixes Metaspace double-space artifacts in WordPiece decoders. |
|
|
| **Inputs:** |
| - `results/tokenizers/*.json` |
| - `results/corpora/test_{ar,az,mi}.txt` |
|
|
| **Outputs:** `results/test_set_results.csv` / `.json` (40 rows) |
|
|
| --- |
|
|
| ### `eval_new_and_append.py` — Append 80K/110K Results |
| **Lines:** ~144 |
| |
| Evaluates the newly trained 80K/110K tokenizers and appends their rows to `test_set_results.csv`. |
| |
| **Inputs:** `test_set_results.csv`, tokenizers, test corpora |
| |
| **Outputs:** Updated `test_set_results.csv` (grows from 24 to 40 rows) |
| |
| --- |
| |
| ### `eval_missing.py` — Fill Single Gap |
| **Lines:** ~124 |
|
|
| Evaluates the one remaining missing tokenizer (`concat_bbpe_55000`) and merges it into the results CSV. |
|
|
| --- |
|
|
| ### `eval_morph_large.py` — Morphological Metrics for 80K/110K |
| **Lines:** ~297 |
|
|
| Computes morphological fidelity metrics (edit distance and consistency F1) for the 16 large-vocabulary tokenizers (80K, 110K) that were not covered by the original `script.py` morph evaluation. Uses the same Farasa cache and identical algorithms. |
|
|
| **Inputs:** |
| - `results/morphology/farasa_segmentations.json` |
| - `results/corpora/test_ar.txt` |
| - 16 tokenizer JSON files |
|
|
| **Outputs:** `results/morph_large_vocab_results.csv` (16 rows) |
|
|
| --- |
|
|
| ### `bootstrap_test_set.py` — Bootstrap Confidence Intervals |
| **Lines:** ~163 |
|
|
| Computes 95% bootstrap confidence intervals (500 resamples) for fertility and CPT on the test set. |
|
|
| **Inputs:** Tokenizers, `test_{ar,az,mi}.txt` |
|
|
| **Outputs:** `results/bootstrap_ci_test_set.csv` (24 rows for 8K-32K configs) |
|
|
| --- |
|
|
| ## External Comparison Scripts |
|
|
| ### `eval_all_externals.py` — Evaluate 9 External Tokenizers |
| **Lines:** ~281 |
|
|
| Evaluates 9 external Arabic/Darija tokenizers from HuggingFace (CaMeLBERT-MSA, Asafaya-BERT, Aranizer-SP-86k, B2BERT, DarijaBERT-ar, DarijaBERT-az, Darija-Tokenizer, Translit-Darija, Qwen2.5-Darija) alongside our best 3 tokenizers. |
|
|
| **Inputs:** HF model repos (requires `HF_TOKEN`), test corpora |
|
|
| **Outputs:** `results/external_comparison.csv` / `.json`, comparison plot |
|
|
| --- |
|
|
| ### `compare_with_external.py` — External Comparison (Earlier Version) |
| **Lines:** ~269 |
|
|
| Earlier version of the external comparison, comparing our 8K/16K/32K tokenizers against 5 external models. |
|
|
| --- |
|
|
| ### `eval_and_compare.py` — Combined Evaluation + Comparison |
| **Lines:** ~277 |
|
|
| Combines internal evaluation and external comparison into a single pipeline run. |
|
|
| --- |
|
|
| ### `eval_codeswitch_and_new_baselines.py` — Code-Switching Evaluation |
| **Lines:** ~373 |
|
|
| Evaluates tokenizers on mixed-script (code-switched) texts as a separate category. Also adds the `atlasia/darija_bpe_tokenizer` baseline and evaluates all tokenizers on DODa. |
|
|
| **Outputs:** `results/codeswitch_results.csv` / `.json` |
|
|
| --- |
|
|
| ### `eval_doda_independent.py` — DODa Independent Validation |
| **Lines:** ~196 |
|
|
| Evaluates all tokenizers on `atlasia/DODa` (87K Arabizi dictionary entries), sampled to 10K entries. Serves as an independent, out-of-training-distribution validation. |
|
|
| **Outputs:** `results/doda_independent_results.csv` / `.json` |
|
|
| --- |
|
|
| ## Utility Scripts |
|
|
| ### `fix_tokenizer_decoders.py` — Decoder Bug Fixer |
| **Lines:** ~152 |
|
|
| Patches three decoder bugs in tokenizer JSON files: |
| 1. WordPiece double-space artifact (Metaspace decoder producing `" "` instead of `" "`) |
| 2. NULL decoder in `concat_bpe_16000` (missing Metaspace decoder) |
| 3. Missing WordPiece sub-decoder in `concat_wordpiece_16000` |
|
|
| **Warning:** Modifies tokenizer files in place. No backup is created. |
|
|
| --- |
|
|
| ### `gen_report.py` — Report Generator |
| **Lines:** ~70 |
| |
| Generates a Markdown summary report (`benchmark_report.md`) from `tokenizer_results.csv`, including best-by-vocabulary tables and full results. |
| |
| --- |
| |
| ### `regen_figures.py` — Figure Regenerator |
| **Lines:** ~313 |
|
|
| Regenerates the 5 main paper figures from `test_set_results.csv` with larger fonts (14pt base) for readability. Reads from CSV and writes PNGs to the paper's `figures/` directory. |
|
|
| **Inputs:** `results/test_set_results.csv` |
|
|
| **Outputs:** `figures/fertility_overall_comparison_v2.png`, `fertility_overall_trends.png`, `fertility_disparity_comparison_v2.png`, `fertility_disparity_heatmap_v2.png`, `external_comparison.png` |
|
|
| --- |
|
|
| ### `verify_arithmetic.py` — Numeric Claims Verification |
| **Lines:** ~143 |
| |
| Validates every percentage claim in the paper and README against `test_set_results.csv`. Checks: |
| - Disparity formula: `|F_ar - F_az| / max(F_ar, F_az)` |
| - Overall fertility derivability from per-script values |
| - All percentage improvement claims (27-34%, 40-50%, etc.) |
| |
| **Outputs:** Stdout validation report (no files written) |
| |
| --- |
| |
| ## Result Files |
| |
| ### Primary Data |
| |
| | File | Rows | Description | |
| |------|------|-------------| |
| | `test_set_results.csv` | 40 | **Single source of truth.** Test-set metrics for all tokenizers: fertility, CPT, disparity, Gini, entropy, exact match. | |
| | `tokenizer_results.csv` | 24 | Full benchmark results incl. morphological metrics and per-script breakdown. Covers 8K-32K only. | |
| | `morph_large_vocab_results.csv` | 16 | Morphological metrics for 80K/110K tokenizers. | |
| | `bootstrap_ci_test_set.csv` | 24 | Bootstrap 95% CIs (500 resamples) for fertility and CPT. | |
| | `external_comparison.csv` | 12 | Our best 3 + 9 external tokenizers: fertility, CPT, disparity, exact match per script. | |
| | `codeswitch_results.csv` | 5 | Code-switching evaluation with mixed-script category. | |
| | `doda_independent_results.csv` | 12 | DODa independent validation (Arabizi dictionary). | |
|
|
| ### Supporting Data |
|
|
| | File / Directory | Description | |
| |------------------|-------------| |
| | `corpora/` | Train/validation/test text splits: `{train,val,test}_{ar,az,mi}.txt` | |
| | `morphology/farasa_segmentations.json` | Cached Farasa morphological segmentations for Arabic texts (~99 MB) | |
| | `tokenizers/` | Raw HuggingFace `tokenizers` JSON files (60 files: 24 shared + 36 concat halves) | |
| | `transformers_tokenizers/` | Tokenizers exported for `transformers` library use | |
| | `doda_sample_10k.txt` | 10K-line Arabizi sample from DODa dataset | |
| | `benchmark_report.md` | Auto-generated Markdown summary report | |
|
|
| --- |
|
|
| ## Reproduction Guide |
|
|
| ### Full Reproduction (from scratch) |
|
|
| ```bash |
| # 1. Train 8K-32K tokenizers + initial evaluation + morphology |
| python script.py |
| |
| # 2. Train 80K + 110K tokenizers |
| python train_large_vocab.py |
| python train_remaining.py |
| |
| # 3. Evaluate on test set (appends 80K/110K to results) |
| python eval_test_set.py |
| python eval_new_and_append.py |
| python eval_missing.py |
| |
| # 4. Compute morphological metrics for large vocabs |
| python eval_morph_large.py |
| |
| # 5. Bootstrap confidence intervals |
| python bootstrap_test_set.py |
| |
| # 6. External tokenizer comparison |
| python eval_all_externals.py |
| |
| # 7. Code-switching + DODa validation |
| python eval_codeswitch_and_new_baselines.py |
| python eval_doda_independent.py |
| |
| # 8. Generate figures + reports |
| python regen_figures.py |
| python gen_report.py |
| |
| # 9. Verify all numeric claims |
| python verify_arithmetic.py |
| ``` |
|
|
| ### Requirements |
|
|
| - Python 3.10+ |
| - `tokenizers`, `transformers`, `datasets` (HuggingFace stack) |
| - `scikit-learn` (KMeans for morphological consistency) |
| - `regex` (Unicode grapheme segmentation) |
| - `numpy`, `pandas`, `matplotlib`, `seaborn` |
| - `tqdm` |
| - Farasa JAR (for morphological segmentation; pre-cached in `morphology/`) |
| - `HF_TOKEN` environment variable (for loading external models) |
|
|
| --- |
|
|
| ## Key Design Decisions |
|
|
| 1. **Monolithic `script.py`**: The main pipeline runs at import level (no `__main__` guard). This is intentional for checkpoint-based resumption — the script detects existing artifacts and skips completed stages. |
|
|
| 2. **Duplicated helper functions**: Functions like `detect_script()`, `count_graphemes()`, and `normalize_decode()` are copied across evaluation scripts rather than shared via import. This ensures each eval script is self-contained and runnable independently. |
|
|
| 3. **Test-set-only evaluation**: All paper numbers come from `eval_test_set.py`, not `script.py`'s full-corpus evaluation. The test set (11,282 sentences per script) provides unbiased estimates. |
|
|
| 4. **Concatenated architecture**: Each concat config is stored as two JSON files (`concat_ar_*.json` + `concat_az_*.json`). The evaluator loads both and applies ID shifting at inference time. |
|
|