daa-tokenizers / code.md
Ouaill's picture
Upload code.md with huggingface_hub
1ae8a3c verified
|
Raw
History Blame Contribute Delete
13.6 kB
# Code Documentation — Darija Tokenizer Benchmark
This document describes every script, data file, and output artifact in the benchmark codebase.
---
## Overview
The benchmark pipeline consists of four stages: **training**, **evaluation**, **analysis**, and **reporting**. Each stage is implemented by standalone Python scripts. The diagram below shows the data flow:
```
OiQ/daa-pairs (dataset)
┌──────────────────┐ ┌─────────────────────┐
│ script.py │────▶│ results/tokenizers/ │ (60 raw JSON files)
│ (train 8K-32K) │ │ results/transformers│ (transformers format)
└──────────────────┘ └─────────────────────┘
│ ▲
▼ │
┌──────────────────┐ ┌─────────────────────┐
│train_large_vocab │────▶│ 80K + 110K configs │
│train_remaining │ │ (16 additional) │
└──────────────────┘ └─────────────────────┘
┌──────────────────────────────────────────────┐
│ EVALUATION SCRIPTS │
│ ├── eval_test_set.py → test_set_results│
│ ├── eval_new_and_append.py → append 80K/110K │
│ ├── eval_missing.py → fill gaps │
│ ├── eval_morph_large.py → morph 80K/110K │
│ ├── bootstrap_test_set.py → 95% CIs │
│ ├── eval_all_externals.py → external comp. │
│ ├── eval_codeswitch_... → code-switching │
│ └── eval_doda_independent → DODa validation │
└──────────────────────────────────────────────┘
┌──────────────────┐ ┌─────────────────────┐
│ regen_figures │────▶│ figures/*.png │
│ gen_report │────▶│ benchmark_report.md │
│ verify_arithmetic│───▶│ (stdout validation) │
└──────────────────┘ └─────────────────────┘
```
---
## Training Scripts
### `script.py` — Master Benchmark Pipeline
**Lines:** ~2032 | **Type:** Runs at import (no `__main__` guard)
The monolithic entry point. Loads the `OiQ/daa-pairs` dataset, trains 24 tokenizers (4 algorithms x 2 architectures x 3 vocab sizes: 8K, 16K, 32K), evaluates them on the full training+test corpus, computes morphological metrics via Farasa, generates bootstrap confidence intervals, and produces all plots and reports.
**Key class:** `ProductionMetricsEvaluator` — implements script detection, tokenization, Gini coefficient, and all metric computations.
**Inputs:**
- `OiQ/daa-pairs` dataset (via `huggingface_hub`)
- `BenchmarkConfig` dataclass (vocab sizes, algorithms, hyperparameters)
**Outputs:**
- `results/tokenizers/*.json` — 48 raw tokenizer files (24 shared + 24 concat halves)
- `results/transformers_tokenizers/` — transformers-compatible exports
- `results/tokenizer_results.csv` / `.json` — full metrics with morphological data
- `results/bootstrap_ci.csv` — bootstrap CIs
- `results/benchmark_report.md` — auto-generated summary
- `results/morphology/farasa_segmentations.json` — cached Farasa segmentations (~99 MB)
- `results/plots/*.png` — all visualization figures
---
### `train_large_vocab.py` — Train 80K/110K Tokenizers
**Lines:** ~146
Trains additional tokenizers at 80K and 110K vocabulary sizes to match DarijaBERT-ar (80K) and DarijaBERT-az (110K) for fair head-to-head comparison.
**Inputs:** `results/corpora/train_{ar,az}.txt`
**Outputs:** `results/tokenizers/{shared,concat_ar,concat_az}_{algo}_{80000,110000}.json`
---
### `train_remaining.py` — Train Remaining Tokenizers
**Lines:** ~134
Fills in the last missing tokenizer configurations (3 shared 110K + 6 concat 110K + 1 concat 80K BBPE) that `train_large_vocab.py` did not cover.
**Inputs:** `results/corpora/train_{ar,az}.txt`
**Outputs:** Remaining `results/tokenizers/*.json` files
---
### `retrain_missing_and_compare.py` — Retrain + Full Re-evaluation
**Lines:** ~558
Retrains 4 missing concat 32K tokenizers, exports to transformers format, re-evaluates all 28 configs, fixes a WordPiece exact-match bug, and runs the external comparison pipeline.
**Inputs:** Training corpora, HF external models
**Outputs:** Updated tokenizer JSONs, `external_comparison.csv`, external comparison plot
---
## Evaluation Scripts
### `eval_test_set.py` — Test-Set Evaluation (Single Source of Truth)
**Lines:** ~227
Re-evaluates all tokenizers on the held-out **test set only** (11,282 sentences per script). This is the authoritative evaluation used in all paper tables and percentage claims.
**Key function:** `normalize_decode()` — fixes Metaspace double-space artifacts in WordPiece decoders.
**Inputs:**
- `results/tokenizers/*.json`
- `results/corpora/test_{ar,az,mi}.txt`
**Outputs:** `results/test_set_results.csv` / `.json` (40 rows)
---
### `eval_new_and_append.py` — Append 80K/110K Results
**Lines:** ~144
Evaluates the newly trained 80K/110K tokenizers and appends their rows to `test_set_results.csv`.
**Inputs:** `test_set_results.csv`, tokenizers, test corpora
**Outputs:** Updated `test_set_results.csv` (grows from 24 to 40 rows)
---
### `eval_missing.py` — Fill Single Gap
**Lines:** ~124
Evaluates the one remaining missing tokenizer (`concat_bbpe_55000`) and merges it into the results CSV.
---
### `eval_morph_large.py` — Morphological Metrics for 80K/110K
**Lines:** ~297
Computes morphological fidelity metrics (edit distance and consistency F1) for the 16 large-vocabulary tokenizers (80K, 110K) that were not covered by the original `script.py` morph evaluation. Uses the same Farasa cache and identical algorithms.
**Inputs:**
- `results/morphology/farasa_segmentations.json`
- `results/corpora/test_ar.txt`
- 16 tokenizer JSON files
**Outputs:** `results/morph_large_vocab_results.csv` (16 rows)
---
### `bootstrap_test_set.py` — Bootstrap Confidence Intervals
**Lines:** ~163
Computes 95% bootstrap confidence intervals (500 resamples) for fertility and CPT on the test set.
**Inputs:** Tokenizers, `test_{ar,az,mi}.txt`
**Outputs:** `results/bootstrap_ci_test_set.csv` (24 rows for 8K-32K configs)
---
## External Comparison Scripts
### `eval_all_externals.py` — Evaluate 9 External Tokenizers
**Lines:** ~281
Evaluates 9 external Arabic/Darija tokenizers from HuggingFace (CaMeLBERT-MSA, Asafaya-BERT, Aranizer-SP-86k, B2BERT, DarijaBERT-ar, DarijaBERT-az, Darija-Tokenizer, Translit-Darija, Qwen2.5-Darija) alongside our best 3 tokenizers.
**Inputs:** HF model repos (requires `HF_TOKEN`), test corpora
**Outputs:** `results/external_comparison.csv` / `.json`, comparison plot
---
### `compare_with_external.py` — External Comparison (Earlier Version)
**Lines:** ~269
Earlier version of the external comparison, comparing our 8K/16K/32K tokenizers against 5 external models.
---
### `eval_and_compare.py` — Combined Evaluation + Comparison
**Lines:** ~277
Combines internal evaluation and external comparison into a single pipeline run.
---
### `eval_codeswitch_and_new_baselines.py` — Code-Switching Evaluation
**Lines:** ~373
Evaluates tokenizers on mixed-script (code-switched) texts as a separate category. Also adds the `atlasia/darija_bpe_tokenizer` baseline and evaluates all tokenizers on DODa.
**Outputs:** `results/codeswitch_results.csv` / `.json`
---
### `eval_doda_independent.py` — DODa Independent Validation
**Lines:** ~196
Evaluates all tokenizers on `atlasia/DODa` (87K Arabizi dictionary entries), sampled to 10K entries. Serves as an independent, out-of-training-distribution validation.
**Outputs:** `results/doda_independent_results.csv` / `.json`
---
## Utility Scripts
### `fix_tokenizer_decoders.py` — Decoder Bug Fixer
**Lines:** ~152
Patches three decoder bugs in tokenizer JSON files:
1. WordPiece double-space artifact (Metaspace decoder producing `" "` instead of `" "`)
2. NULL decoder in `concat_bpe_16000` (missing Metaspace decoder)
3. Missing WordPiece sub-decoder in `concat_wordpiece_16000`
**Warning:** Modifies tokenizer files in place. No backup is created.
---
### `gen_report.py` — Report Generator
**Lines:** ~70
Generates a Markdown summary report (`benchmark_report.md`) from `tokenizer_results.csv`, including best-by-vocabulary tables and full results.
---
### `regen_figures.py` — Figure Regenerator
**Lines:** ~313
Regenerates the 5 main paper figures from `test_set_results.csv` with larger fonts (14pt base) for readability. Reads from CSV and writes PNGs to the paper's `figures/` directory.
**Inputs:** `results/test_set_results.csv`
**Outputs:** `figures/fertility_overall_comparison_v2.png`, `fertility_overall_trends.png`, `fertility_disparity_comparison_v2.png`, `fertility_disparity_heatmap_v2.png`, `external_comparison.png`
---
### `verify_arithmetic.py` — Numeric Claims Verification
**Lines:** ~143
Validates every percentage claim in the paper and README against `test_set_results.csv`. Checks:
- Disparity formula: `|F_ar - F_az| / max(F_ar, F_az)`
- Overall fertility derivability from per-script values
- All percentage improvement claims (27-34%, 40-50%, etc.)
**Outputs:** Stdout validation report (no files written)
---
## Result Files
### Primary Data
| File | Rows | Description |
|------|------|-------------|
| `test_set_results.csv` | 40 | **Single source of truth.** Test-set metrics for all tokenizers: fertility, CPT, disparity, Gini, entropy, exact match. |
| `tokenizer_results.csv` | 24 | Full benchmark results incl. morphological metrics and per-script breakdown. Covers 8K-32K only. |
| `morph_large_vocab_results.csv` | 16 | Morphological metrics for 80K/110K tokenizers. |
| `bootstrap_ci_test_set.csv` | 24 | Bootstrap 95% CIs (500 resamples) for fertility and CPT. |
| `external_comparison.csv` | 12 | Our best 3 + 9 external tokenizers: fertility, CPT, disparity, exact match per script. |
| `codeswitch_results.csv` | 5 | Code-switching evaluation with mixed-script category. |
| `doda_independent_results.csv` | 12 | DODa independent validation (Arabizi dictionary). |
### Supporting Data
| File / Directory | Description |
|------------------|-------------|
| `corpora/` | Train/validation/test text splits: `{train,val,test}_{ar,az,mi}.txt` |
| `morphology/farasa_segmentations.json` | Cached Farasa morphological segmentations for Arabic texts (~99 MB) |
| `tokenizers/` | Raw HuggingFace `tokenizers` JSON files (60 files: 24 shared + 36 concat halves) |
| `transformers_tokenizers/` | Tokenizers exported for `transformers` library use |
| `doda_sample_10k.txt` | 10K-line Arabizi sample from DODa dataset |
| `benchmark_report.md` | Auto-generated Markdown summary report |
---
## Reproduction Guide
### Full Reproduction (from scratch)
```bash
# 1. Train 8K-32K tokenizers + initial evaluation + morphology
python script.py
# 2. Train 80K + 110K tokenizers
python train_large_vocab.py
python train_remaining.py
# 3. Evaluate on test set (appends 80K/110K to results)
python eval_test_set.py
python eval_new_and_append.py
python eval_missing.py
# 4. Compute morphological metrics for large vocabs
python eval_morph_large.py
# 5. Bootstrap confidence intervals
python bootstrap_test_set.py
# 6. External tokenizer comparison
python eval_all_externals.py
# 7. Code-switching + DODa validation
python eval_codeswitch_and_new_baselines.py
python eval_doda_independent.py
# 8. Generate figures + reports
python regen_figures.py
python gen_report.py
# 9. Verify all numeric claims
python verify_arithmetic.py
```
### Requirements
- Python 3.10+
- `tokenizers`, `transformers`, `datasets` (HuggingFace stack)
- `scikit-learn` (KMeans for morphological consistency)
- `regex` (Unicode grapheme segmentation)
- `numpy`, `pandas`, `matplotlib`, `seaborn`
- `tqdm`
- Farasa JAR (for morphological segmentation; pre-cached in `morphology/`)
- `HF_TOKEN` environment variable (for loading external models)
---
## Key Design Decisions
1. **Monolithic `script.py`**: The main pipeline runs at import level (no `__main__` guard). This is intentional for checkpoint-based resumption — the script detects existing artifacts and skips completed stages.
2. **Duplicated helper functions**: Functions like `detect_script()`, `count_graphemes()`, and `normalize_decode()` are copied across evaluation scripts rather than shared via import. This ensures each eval script is self-contained and runnable independently.
3. **Test-set-only evaluation**: All paper numbers come from `eval_test_set.py`, not `script.py`'s full-corpus evaluation. The test set (11,282 sentences per script) provides unbiased estimates.
4. **Concatenated architecture**: Each concat config is stored as two JSON files (`concat_ar_*.json` + `concat_az_*.json`). The evaluator loads both and applies ID shifting at inference time.