File size: 13,641 Bytes

1ae8a3c

# Code Documentation — Darija Tokenizer Benchmark

This document describes every script, data file, and output artifact in the benchmark codebase.

---

## Overview

The benchmark pipeline consists of four stages: **training**, **evaluation**, **analysis**, and **reporting**. Each stage is implemented by standalone Python scripts. The diagram below shows the data flow:

```
OiQ/daa-pairs (dataset)
        │
        ▼
┌──────────────────┐     ┌─────────────────────┐
│  script.py       │────▶│  results/tokenizers/ │  (60 raw JSON files)
│  (train 8K-32K)  │     │  results/transformers│  (transformers format)
└──────────────────┘     └─────────────────────┘
        │                          ▲
        ▼                          │
┌──────────────────┐     ┌─────────────────────┐
│train_large_vocab │────▶│  80K + 110K configs  │
│train_remaining   │     │  (16 additional)     │
└──────────────────┘     └─────────────────────┘
                               │
                               ▼
┌──────────────────────────────────────────────┐
│  EVALUATION SCRIPTS                           │
│  ├── eval_test_set.py       → test_set_results│
│  ├── eval_new_and_append.py → append 80K/110K │
│  ├── eval_missing.py        → fill gaps       │
│  ├── eval_morph_large.py    → morph 80K/110K  │
│  ├── bootstrap_test_set.py  → 95% CIs         │
│  ├── eval_all_externals.py  → external comp.  │
│  ├── eval_codeswitch_...    → code-switching  │
│  └── eval_doda_independent  → DODa validation │
└──────────────────────────────────────────────┘
        │
        ▼
┌──────────────────┐     ┌─────────────────────┐
│  regen_figures   │────▶│  figures/*.png       │
│  gen_report      │────▶│  benchmark_report.md │
│  verify_arithmetic│───▶│  (stdout validation) │
└──────────────────┘     └─────────────────────┘
```

---

## Training Scripts

### `script.py` — Master Benchmark Pipeline
**Lines:** ~2032 | **Type:** Runs at import (no `__main__` guard)

The monolithic entry point. Loads the `OiQ/daa-pairs` dataset, trains 24 tokenizers (4 algorithms x 2 architectures x 3 vocab sizes: 8K, 16K, 32K), evaluates them on the full training+test corpus, computes morphological metrics via Farasa, generates bootstrap confidence intervals, and produces all plots and reports.

**Key class:** `ProductionMetricsEvaluator` — implements script detection, tokenization, Gini coefficient, and all metric computations.

**Inputs:**
- `OiQ/daa-pairs` dataset (via `huggingface_hub`)
- `BenchmarkConfig` dataclass (vocab sizes, algorithms, hyperparameters)

**Outputs:**
- `results/tokenizers/*.json` — 48 raw tokenizer files (24 shared + 24 concat halves)
- `results/transformers_tokenizers/` — transformers-compatible exports
- `results/tokenizer_results.csv` / `.json` — full metrics with morphological data
- `results/bootstrap_ci.csv` — bootstrap CIs
- `results/benchmark_report.md` — auto-generated summary
- `results/morphology/farasa_segmentations.json` — cached Farasa segmentations (~99 MB)
- `results/plots/*.png` — all visualization figures

---

### `train_large_vocab.py` — Train 80K/110K Tokenizers
**Lines:** ~146

Trains additional tokenizers at 80K and 110K vocabulary sizes to match DarijaBERT-ar (80K) and DarijaBERT-az (110K) for fair head-to-head comparison.

**Inputs:** `results/corpora/train_{ar,az}.txt`

**Outputs:** `results/tokenizers/{shared,concat_ar,concat_az}_{algo}_{80000,110000}.json`

---

### `train_remaining.py` — Train Remaining Tokenizers
**Lines:** ~134

Fills in the last missing tokenizer configurations (3 shared 110K + 6 concat 110K + 1 concat 80K BBPE) that `train_large_vocab.py` did not cover.

**Inputs:** `results/corpora/train_{ar,az}.txt`

**Outputs:** Remaining `results/tokenizers/*.json` files

---

### `retrain_missing_and_compare.py` — Retrain + Full Re-evaluation
**Lines:** ~558

Retrains 4 missing concat 32K tokenizers, exports to transformers format, re-evaluates all 28 configs, fixes a WordPiece exact-match bug, and runs the external comparison pipeline.

**Inputs:** Training corpora, HF external models

**Outputs:** Updated tokenizer JSONs, `external_comparison.csv`, external comparison plot

---

## Evaluation Scripts

### `eval_test_set.py` — Test-Set Evaluation (Single Source of Truth)
**Lines:** ~227

Re-evaluates all tokenizers on the held-out **test set only** (11,282 sentences per script). This is the authoritative evaluation used in all paper tables and percentage claims.

**Key function:** `normalize_decode()` — fixes Metaspace double-space artifacts in WordPiece decoders.

**Inputs:**
- `results/tokenizers/*.json`
- `results/corpora/test_{ar,az,mi}.txt`

**Outputs:** `results/test_set_results.csv` / `.json` (40 rows)

---

### `eval_new_and_append.py` — Append 80K/110K Results
**Lines:** ~144

Evaluates the newly trained 80K/110K tokenizers and appends their rows to `test_set_results.csv`.

**Inputs:** `test_set_results.csv`, tokenizers, test corpora

**Outputs:** Updated `test_set_results.csv` (grows from 24 to 40 rows)

---

### `eval_missing.py` — Fill Single Gap
**Lines:** ~124

Evaluates the one remaining missing tokenizer (`concat_bbpe_55000`) and merges it into the results CSV.

---

### `eval_morph_large.py` — Morphological Metrics for 80K/110K
**Lines:** ~297

Computes morphological fidelity metrics (edit distance and consistency F1) for the 16 large-vocabulary tokenizers (80K, 110K) that were not covered by the original `script.py` morph evaluation. Uses the same Farasa cache and identical algorithms.

**Inputs:**
- `results/morphology/farasa_segmentations.json`
- `results/corpora/test_ar.txt`
- 16 tokenizer JSON files

**Outputs:** `results/morph_large_vocab_results.csv` (16 rows)

---

### `bootstrap_test_set.py` — Bootstrap Confidence Intervals
**Lines:** ~163

Computes 95% bootstrap confidence intervals (500 resamples) for fertility and CPT on the test set.

**Inputs:** Tokenizers, `test_{ar,az,mi}.txt`

**Outputs:** `results/bootstrap_ci_test_set.csv` (24 rows for 8K-32K configs)

---

## External Comparison Scripts

### `eval_all_externals.py` — Evaluate 9 External Tokenizers
**Lines:** ~281

Evaluates 9 external Arabic/Darija tokenizers from HuggingFace (CaMeLBERT-MSA, Asafaya-BERT, Aranizer-SP-86k, B2BERT, DarijaBERT-ar, DarijaBERT-az, Darija-Tokenizer, Translit-Darija, Qwen2.5-Darija) alongside our best 3 tokenizers.

**Inputs:** HF model repos (requires `HF_TOKEN`), test corpora

**Outputs:** `results/external_comparison.csv` / `.json`, comparison plot

---

### `compare_with_external.py` — External Comparison (Earlier Version)
**Lines:** ~269

Earlier version of the external comparison, comparing our 8K/16K/32K tokenizers against 5 external models.

---

### `eval_and_compare.py` — Combined Evaluation + Comparison
**Lines:** ~277

Combines internal evaluation and external comparison into a single pipeline run.

---

### `eval_codeswitch_and_new_baselines.py` — Code-Switching Evaluation
**Lines:** ~373

Evaluates tokenizers on mixed-script (code-switched) texts as a separate category. Also adds the `atlasia/darija_bpe_tokenizer` baseline and evaluates all tokenizers on DODa.

**Outputs:** `results/codeswitch_results.csv` / `.json`

---

### `eval_doda_independent.py` — DODa Independent Validation
**Lines:** ~196

Evaluates all tokenizers on `atlasia/DODa` (87K Arabizi dictionary entries), sampled to 10K entries. Serves as an independent, out-of-training-distribution validation.

**Outputs:** `results/doda_independent_results.csv` / `.json`

---

## Utility Scripts

### `fix_tokenizer_decoders.py` — Decoder Bug Fixer
**Lines:** ~152

Patches three decoder bugs in tokenizer JSON files:
1. WordPiece double-space artifact (Metaspace decoder producing `"  "` instead of `" "`)
2. NULL decoder in `concat_bpe_16000` (missing Metaspace decoder)
3. Missing WordPiece sub-decoder in `concat_wordpiece_16000`

**Warning:** Modifies tokenizer files in place. No backup is created.

---

### `gen_report.py` — Report Generator
**Lines:** ~70

Generates a Markdown summary report (`benchmark_report.md`) from `tokenizer_results.csv`, including best-by-vocabulary tables and full results.

---

### `regen_figures.py` — Figure Regenerator
**Lines:** ~313

Regenerates the 5 main paper figures from `test_set_results.csv` with larger fonts (14pt base) for readability. Reads from CSV and writes PNGs to the paper's `figures/` directory.

**Inputs:** `results/test_set_results.csv`

**Outputs:** `figures/fertility_overall_comparison_v2.png`, `fertility_overall_trends.png`, `fertility_disparity_comparison_v2.png`, `fertility_disparity_heatmap_v2.png`, `external_comparison.png`

---

### `verify_arithmetic.py` — Numeric Claims Verification
**Lines:** ~143

Validates every percentage claim in the paper and README against `test_set_results.csv`. Checks:
- Disparity formula: `|F_ar - F_az| / max(F_ar, F_az)`
- Overall fertility derivability from per-script values
- All percentage improvement claims (27-34%, 40-50%, etc.)

**Outputs:** Stdout validation report (no files written)

---

## Result Files

### Primary Data

| File | Rows | Description |
|------|------|-------------|
| `test_set_results.csv` | 40 | **Single source of truth.** Test-set metrics for all tokenizers: fertility, CPT, disparity, Gini, entropy, exact match. |
| `tokenizer_results.csv` | 24 | Full benchmark results incl. morphological metrics and per-script breakdown. Covers 8K-32K only. |
| `morph_large_vocab_results.csv` | 16 | Morphological metrics for 80K/110K tokenizers. |
| `bootstrap_ci_test_set.csv` | 24 | Bootstrap 95% CIs (500 resamples) for fertility and CPT. |
| `external_comparison.csv` | 12 | Our best 3 + 9 external tokenizers: fertility, CPT, disparity, exact match per script. |
| `codeswitch_results.csv` | 5 | Code-switching evaluation with mixed-script category. |
| `doda_independent_results.csv` | 12 | DODa independent validation (Arabizi dictionary). |

### Supporting Data

| File / Directory | Description |
|------------------|-------------|
| `corpora/` | Train/validation/test text splits: `{train,val,test}_{ar,az,mi}.txt` |
| `morphology/farasa_segmentations.json` | Cached Farasa morphological segmentations for Arabic texts (~99 MB) |
| `tokenizers/` | Raw HuggingFace `tokenizers` JSON files (60 files: 24 shared + 36 concat halves) |
| `transformers_tokenizers/` | Tokenizers exported for `transformers` library use |
| `doda_sample_10k.txt` | 10K-line Arabizi sample from DODa dataset |
| `benchmark_report.md` | Auto-generated Markdown summary report |

---

## Reproduction Guide

### Full Reproduction (from scratch)

```bash
# 1. Train 8K-32K tokenizers + initial evaluation + morphology
python script.py

# 2. Train 80K + 110K tokenizers
python train_large_vocab.py
python train_remaining.py

# 3. Evaluate on test set (appends 80K/110K to results)
python eval_test_set.py
python eval_new_and_append.py
python eval_missing.py

# 4. Compute morphological metrics for large vocabs
python eval_morph_large.py

# 5. Bootstrap confidence intervals
python bootstrap_test_set.py

# 6. External tokenizer comparison
python eval_all_externals.py

# 7. Code-switching + DODa validation
python eval_codeswitch_and_new_baselines.py
python eval_doda_independent.py

# 8. Generate figures + reports
python regen_figures.py
python gen_report.py

# 9. Verify all numeric claims
python verify_arithmetic.py
```

### Requirements

- Python 3.10+
- `tokenizers`, `transformers`, `datasets` (HuggingFace stack)
- `scikit-learn` (KMeans for morphological consistency)
- `regex` (Unicode grapheme segmentation)
- `numpy`, `pandas`, `matplotlib`, `seaborn`
- `tqdm`
- Farasa JAR (for morphological segmentation; pre-cached in `morphology/`)
- `HF_TOKEN` environment variable (for loading external models)

---

## Key Design Decisions

1. **Monolithic `script.py`**: The main pipeline runs at import level (no `__main__` guard). This is intentional for checkpoint-based resumption — the script detects existing artifacts and skips completed stages.

2. **Duplicated helper functions**: Functions like `detect_script()`, `count_graphemes()`, and `normalize_decode()` are copied across evaluation scripts rather than shared via import. This ensures each eval script is self-contained and runnable independently.

3. **Test-set-only evaluation**: All paper numbers come from `eval_test_set.py`, not `script.py`'s full-corpus evaluation. The test set (11,282 sentences per script) provides unbiased estimates.

4. **Concatenated architecture**: Each concat config is stored as two JSON files (`concat_ar_*.json` + `concat_az_*.json`). The evaluator loads both and applies ID shifting at inference time.