Upload code.md with huggingface_hub

1ae8a3c verified 12 days ago

13.6 kB

	# Code Documentation — Darija Tokenizer Benchmark

	This document describes every script, data file, and output artifact in the benchmark codebase.

	---

	## Overview

	The benchmark pipeline consists of four stages: training, evaluation, analysis, and reporting. Each stage is implemented by standalone Python scripts. The diagram below shows the data flow:

	```
	OiQ/daa-pairs (dataset)
	│
	▼
	┌──────────────────┐ ┌─────────────────────┐
	│ script.py │────▶│ results/tokenizers/ │ (60 raw JSON files)
	│ (train 8K-32K) │ │ results/transformers│ (transformers format)
	└──────────────────┘ └─────────────────────┘
	│ ▲
	▼ │
	┌──────────────────┐ ┌─────────────────────┐
	│train_large_vocab │────▶│ 80K + 110K configs │
	│train_remaining │ │ (16 additional) │
	└──────────────────┘ └─────────────────────┘
	│
	▼
	┌──────────────────────────────────────────────┐
	│ EVALUATION SCRIPTS │
	│ ├── eval_test_set.py → test_set_results│
	│ ├── eval_new_and_append.py → append 80K/110K │
	│ ├── eval_missing.py → fill gaps │
	│ ├── eval_morph_large.py → morph 80K/110K │
	│ ├── bootstrap_test_set.py → 95% CIs │
	│ ├── eval_all_externals.py → external comp. │
	│ ├── eval_codeswitch_... → code-switching │
	│ └── eval_doda_independent → DODa validation │
	└──────────────────────────────────────────────┘
	│
	▼
	┌──────────────────┐ ┌─────────────────────┐
	│ regen_figures │────▶│ figures/*.png │
	│ gen_report │────▶│ benchmark_report.md │
	│ verify_arithmetic│───▶│ (stdout validation) │
	└──────────────────┘ └─────────────────────┘
	```

	---

	## Training Scripts

	### `script.py` — Master Benchmark Pipeline
	Lines: ~2032 \| Type: Runs at import (no `__main__` guard)

	The monolithic entry point. Loads the `OiQ/daa-pairs` dataset, trains 24 tokenizers (4 algorithms x 2 architectures x 3 vocab sizes: 8K, 16K, 32K), evaluates them on the full training+test corpus, computes morphological metrics via Farasa, generates bootstrap confidence intervals, and produces all plots and reports.

	Key class: `ProductionMetricsEvaluator` — implements script detection, tokenization, Gini coefficient, and all metric computations.

	Inputs:
	- `OiQ/daa-pairs` dataset (via `huggingface_hub`)
	- `BenchmarkConfig` dataclass (vocab sizes, algorithms, hyperparameters)

	Outputs:
	- `results/tokenizers/*.json` — 48 raw tokenizer files (24 shared + 24 concat halves)
	- `results/transformers_tokenizers/` — transformers-compatible exports
	- `results/tokenizer_results.csv` / `.json` — full metrics with morphological data
	- `results/bootstrap_ci.csv` — bootstrap CIs
	- `results/benchmark_report.md` — auto-generated summary
	- `results/morphology/farasa_segmentations.json` — cached Farasa segmentations (~99 MB)
	- `results/plots/*.png` — all visualization figures

	---

	### `train_large_vocab.py` — Train 80K/110K Tokenizers
	Lines: ~146

	Trains additional tokenizers at 80K and 110K vocabulary sizes to match DarijaBERT-ar (80K) and DarijaBERT-az (110K) for fair head-to-head comparison.

	Inputs: `results/corpora/train_{ar,az}.txt`

	Outputs: `results/tokenizers/{shared,concat_ar,concat_az}_{algo}_{80000,110000}.json`

	---

	### `train_remaining.py` — Train Remaining Tokenizers
	Lines: ~134

	Fills in the last missing tokenizer configurations (3 shared 110K + 6 concat 110K + 1 concat 80K BBPE) that `train_large_vocab.py` did not cover.

	Inputs: `results/corpora/train_{ar,az}.txt`

	Outputs: Remaining `results/tokenizers/*.json` files

	---

	### `retrain_missing_and_compare.py` — Retrain + Full Re-evaluation
	Lines: ~558

	Retrains 4 missing concat 32K tokenizers, exports to transformers format, re-evaluates all 28 configs, fixes a WordPiece exact-match bug, and runs the external comparison pipeline.

	Inputs: Training corpora, HF external models

	Outputs: Updated tokenizer JSONs, `external_comparison.csv`, external comparison plot

	---

	## Evaluation Scripts

	### `eval_test_set.py` — Test-Set Evaluation (Single Source of Truth)
	Lines: ~227

	Re-evaluates all tokenizers on the held-out test set only (11,282 sentences per script). This is the authoritative evaluation used in all paper tables and percentage claims.

	Key function: `normalize_decode()` — fixes Metaspace double-space artifacts in WordPiece decoders.

	Inputs:
	- `results/tokenizers/*.json`
	- `results/corpora/test_{ar,az,mi}.txt`

	Outputs: `results/test_set_results.csv` / `.json` (40 rows)

	---

	### `eval_new_and_append.py` — Append 80K/110K Results
	Lines: ~144

	Evaluates the newly trained 80K/110K tokenizers and appends their rows to `test_set_results.csv`.

	Inputs: `test_set_results.csv`, tokenizers, test corpora

	Outputs: Updated `test_set_results.csv` (grows from 24 to 40 rows)

	---

	### `eval_missing.py` — Fill Single Gap
	Lines: ~124

	Evaluates the one remaining missing tokenizer (`concat_bbpe_55000`) and merges it into the results CSV.

	---

	### `eval_morph_large.py` — Morphological Metrics for 80K/110K
	Lines: ~297

	Computes morphological fidelity metrics (edit distance and consistency F1) for the 16 large-vocabulary tokenizers (80K, 110K) that were not covered by the original `script.py` morph evaluation. Uses the same Farasa cache and identical algorithms.

	Inputs:
	- `results/morphology/farasa_segmentations.json`
	- `results/corpora/test_ar.txt`
	- 16 tokenizer JSON files

	Outputs: `results/morph_large_vocab_results.csv` (16 rows)

	---

	### `bootstrap_test_set.py` — Bootstrap Confidence Intervals
	Lines: ~163

	Computes 95% bootstrap confidence intervals (500 resamples) for fertility and CPT on the test set.

	Inputs: Tokenizers, `test_{ar,az,mi}.txt`

	Outputs: `results/bootstrap_ci_test_set.csv` (24 rows for 8K-32K configs)

	---

	## External Comparison Scripts

	### `eval_all_externals.py` — Evaluate 9 External Tokenizers
	Lines: ~281

	Evaluates 9 external Arabic/Darija tokenizers from HuggingFace (CaMeLBERT-MSA, Asafaya-BERT, Aranizer-SP-86k, B2BERT, DarijaBERT-ar, DarijaBERT-az, Darija-Tokenizer, Translit-Darija, Qwen2.5-Darija) alongside our best 3 tokenizers.

	Inputs: HF model repos (requires `HF_TOKEN`), test corpora

	Outputs: `results/external_comparison.csv` / `.json`, comparison plot

	---

	### `compare_with_external.py` — External Comparison (Earlier Version)
	Lines: ~269

	Earlier version of the external comparison, comparing our 8K/16K/32K tokenizers against 5 external models.

	---

	### `eval_and_compare.py` — Combined Evaluation + Comparison
	Lines: ~277

	Combines internal evaluation and external comparison into a single pipeline run.

	---

	### `eval_codeswitch_and_new_baselines.py` — Code-Switching Evaluation
	Lines: ~373

	Evaluates tokenizers on mixed-script (code-switched) texts as a separate category. Also adds the `atlasia/darija_bpe_tokenizer` baseline and evaluates all tokenizers on DODa.

	Outputs: `results/codeswitch_results.csv` / `.json`

	---

	### `eval_doda_independent.py` — DODa Independent Validation
	Lines: ~196

	Evaluates all tokenizers on `atlasia/DODa` (87K Arabizi dictionary entries), sampled to 10K entries. Serves as an independent, out-of-training-distribution validation.

	Outputs: `results/doda_independent_results.csv` / `.json`

	---

	## Utility Scripts

	### `fix_tokenizer_decoders.py` — Decoder Bug Fixer
	Lines: ~152

	Patches three decoder bugs in tokenizer JSON files:
	1. WordPiece double-space artifact (Metaspace decoder producing `" "` instead of `" "`)
	2. NULL decoder in `concat_bpe_16000` (missing Metaspace decoder)
	3. Missing WordPiece sub-decoder in `concat_wordpiece_16000`

	Warning: Modifies tokenizer files in place. No backup is created.

	---

	### `gen_report.py` — Report Generator
	Lines: ~70

	Generates a Markdown summary report (`benchmark_report.md`) from `tokenizer_results.csv`, including best-by-vocabulary tables and full results.

	---

	### `regen_figures.py` — Figure Regenerator
	Lines: ~313

	Regenerates the 5 main paper figures from `test_set_results.csv` with larger fonts (14pt base) for readability. Reads from CSV and writes PNGs to the paper's `figures/` directory.

	Inputs: `results/test_set_results.csv`

	Outputs: `figures/fertility_overall_comparison_v2.png`, `fertility_overall_trends.png`, `fertility_disparity_comparison_v2.png`, `fertility_disparity_heatmap_v2.png`, `external_comparison.png`

	---

	### `verify_arithmetic.py` — Numeric Claims Verification
	Lines: ~143

	Validates every percentage claim in the paper and README against `test_set_results.csv`. Checks:
	- Disparity formula: `\|F_ar - F_az\| / max(F_ar, F_az)`
	- Overall fertility derivability from per-script values
	- All percentage improvement claims (27-34%, 40-50%, etc.)

	Outputs: Stdout validation report (no files written)

	---

	## Result Files

	### Primary Data

	\| File \| Rows \| Description \|
	\|------\|------\|-------------\|
	\| `test_set_results.csv` \| 40 \| Single source of truth. Test-set metrics for all tokenizers: fertility, CPT, disparity, Gini, entropy, exact match. \|
	\| `tokenizer_results.csv` \| 24 \| Full benchmark results incl. morphological metrics and per-script breakdown. Covers 8K-32K only. \|
	\| `morph_large_vocab_results.csv` \| 16 \| Morphological metrics for 80K/110K tokenizers. \|
	\| `bootstrap_ci_test_set.csv` \| 24 \| Bootstrap 95% CIs (500 resamples) for fertility and CPT. \|
	\| `external_comparison.csv` \| 12 \| Our best 3 + 9 external tokenizers: fertility, CPT, disparity, exact match per script. \|
	\| `codeswitch_results.csv` \| 5 \| Code-switching evaluation with mixed-script category. \|
	\| `doda_independent_results.csv` \| 12 \| DODa independent validation (Arabizi dictionary). \|

	### Supporting Data

	\| File / Directory \| Description \|
	\|------------------\|-------------\|
	\| `corpora/` \| Train/validation/test text splits: `{train,val,test}_{ar,az,mi}.txt` \|
	\| `morphology/farasa_segmentations.json` \| Cached Farasa morphological segmentations for Arabic texts (~99 MB) \|
	\| `tokenizers/` \| Raw HuggingFace `tokenizers` JSON files (60 files: 24 shared + 36 concat halves) \|
	\| `transformers_tokenizers/` \| Tokenizers exported for `transformers` library use \|
	\| `doda_sample_10k.txt` \| 10K-line Arabizi sample from DODa dataset \|
	\| `benchmark_report.md` \| Auto-generated Markdown summary report \|

	---

	## Reproduction Guide

	### Full Reproduction (from scratch)

	```bash
	# 1. Train 8K-32K tokenizers + initial evaluation + morphology
	python script.py

	# 2. Train 80K + 110K tokenizers
	python train_large_vocab.py
	python train_remaining.py

	# 3. Evaluate on test set (appends 80K/110K to results)
	python eval_test_set.py
	python eval_new_and_append.py
	python eval_missing.py

	# 4. Compute morphological metrics for large vocabs
	python eval_morph_large.py

	# 5. Bootstrap confidence intervals
	python bootstrap_test_set.py

	# 6. External tokenizer comparison
	python eval_all_externals.py

	# 7. Code-switching + DODa validation
	python eval_codeswitch_and_new_baselines.py
	python eval_doda_independent.py

	# 8. Generate figures + reports
	python regen_figures.py
	python gen_report.py

	# 9. Verify all numeric claims
	python verify_arithmetic.py
	```

	### Requirements

	- Python 3.10+
	- `tokenizers`, `transformers`, `datasets` (HuggingFace stack)
	- `scikit-learn` (KMeans for morphological consistency)
	- `regex` (Unicode grapheme segmentation)
	- `numpy`, `pandas`, `matplotlib`, `seaborn`
	- `tqdm`
	- Farasa JAR (for morphological segmentation; pre-cached in `morphology/`)
	- `HF_TOKEN` environment variable (for loading external models)

	---

	## Key Design Decisions

	1. Monolithic `script.py`: The main pipeline runs at import level (no `__main__` guard). This is intentional for checkpoint-based resumption — the script detects existing artifacts and skips completed stages.

	2. Duplicated helper functions: Functions like `detect_script()`, `count_graphemes()`, and `normalize_decode()` are copied across evaluation scripts rather than shared via import. This ensures each eval script is self-contained and runnable independently.

	3. Test-set-only evaluation: All paper numbers come from `eval_test_set.py`, not `script.py`'s full-corpus evaluation. The test set (11,282 sentences per script) provides unbiased estimates.

	4. Concatenated architecture: Each concat config is stored as two JSON files (`concat_ar_.json` + `concat_az_.json`). The evaluator loads both and applies ID shifting at inference time.