--- title: Darija Subword Tokenizer Benchmark license: mit tags: - tokenizer - moroccan-darija - arabic - bpe - unigram - wordpiece - bbpe - benchmark language: - ar - lat size_categories: - n>100K --- # Darija Subword Tokenizer Benchmark

In collaboration with
UM6P College of Computing

--- ## Overview The first systematic subword tokenizer benchmark for **Moroccan Darija**, a low-resource dialect written concurrently in Arabic script and Arabizi (Latin script). We train and evaluate **40 tokenizer configurations** spanning four algorithms, two architectures, and five vocabulary sizes (8K--110K) on 112,814 parallel sentence pairs from [`OiQ/daa-pairs`](https://huggingface.co/datasets/OiQ/daa-pairs). Our tokenizers achieve **27--33% lower fertility** than existing Darija tokenizers (DarijaBERT) at matching vocabulary sizes, and **40--50% lower fertility** than MSA-trained tokenizers. All 40 configurations maintain ≥99% exact reconstruction. --- ## Tokenizers | Architecture | Description | Algorithms | Vocab Sizes | Count | |---|---|---|---|---| | **Shared** | Single vocabulary trained on mixed Arabic + Arabizi corpus | BPE, Unigram, WordPiece, BBPE | 8K, 16K, 32K, 80K, 110K | 20 | | **Concatenated** | Separate per-script vocabularies (V/2 each) with ID shifting | BPE, Unigram, WordPiece, BBPE | 8K, 16K, 32K, 80K, 110K | 20 | All tokenizers are released in both raw (HuggingFace `tokenizers`) and `transformers`-compatible formats. The 80K and 110K sizes match DarijaBERT's vocabulary sizes for direct comparison. --- ## Quick Start ```python from transformers import AutoTokenizer # Load a shared tokenizer tok = AutoTokenizer.from_pretrained( "OiQ/daa-tokenizers", subfolder="transformers_tokenizers/shared_bpe_32000" ) # Tokenize Arabic-script text text_ar = "مابقاش كيعرف شنو يدير، بين القانون وبين وليداتو." print(tok.encode(text_ar)) # Load a concatenated tokenizer (separate Arabic and Arabizi sub-tokenizers) tok_ar = AutoTokenizer.from_pretrained( "OiQ/daa-tokenizers", subfolder="transformers_tokenizers/concat_bpe_32000_tokenizer_ar" ) tok_az = AutoTokenizer.from_pretrained( "OiQ/daa-tokenizers", subfolder="transformers_tokenizers/concat_bpe_32000_tokenizer_az" ) text_az = "wash kayn shi jdid?" print(tok_az.encode(text_az)) ``` --- ## Key Results ### Best Tokenizer per Vocabulary Size | Vocab | Configuration | Algorithm | Fertility ↓ | Disparity ↓ | Exact Match | |---|---|---|---|---|---| | 8K | Shared | WordPiece | **1.572** | 0.164 | 99.9% | | 16K | Shared | WordPiece | **1.402** | 0.138 | 99.9% | | 32K | Shared | WordPiece | **1.274** | 0.099 | 99.9% | | 80K | Shared | WordPiece | **1.171** | 0.049 | 99.9% | | 110K | Concat | WordPiece | **1.155** | 0.093 | 99.6% | ### Comparison with Existing Tokenizers | Tokenizer | Vocab | Fertility ↓ | Disparity ↓ | EM (Ar) | EM (Az) | |---|---|---|---|---|---| | **Ours: concat WP 110K** | **110K** | **1.155** | 0.093 | **99.9%** | **99.6%** | | **Ours: concat WP 80K** | **80K** | **1.183** | 0.090 | **99.9%** | **99.6%** | | **Ours: concat BPE 32K** | **32K** | **1.307** | 0.084 | **99.9%** | **99.6%** | | DarijaBERT-ar | 80K | 1.761 | 0.410 | 13.7% | 8.0% | | DarijaBERT-az | 110K | 1.575 | 0.055 | 14.8% | 8.0% | | DarijaBERT-mix | 160K | 1.414 | 0.149 | 14.8% | 8.0% | | CaMeLBERT-MSA | 30K | 2.289 | 0.427 | 29.9% | 38.9% | | Aranizer-SP-86k | 86K | 1.918 | 0.368 | 99.8% | 99.6% | | Qwen2.5-Darija | 152K | 2.307 | 0.040 | 100.0% | 100.0% | At matching vocabulary sizes, our 80K tokenizer achieves **33% lower fertility** than DarijaBERT-ar (1.183 vs 1.761). Our 110K achieves **27% lower** than DarijaBERT-az (1.155 vs 1.575). Even our 32K tokenizer outperforms DarijaBERT-az despite using 3.4x fewer vocabulary slots. DarijaBERT-mix, despite its massive 160K vocabulary (F = 1.414), still underperforms our 32K tokenizer—vocabulary size alone cannot compensate for suboptimal training architecture. --- ## Evaluation Metrics ### Compression & Fairness | Metric | Definition | Direction | |---|---|---| | **Fertility** (F) | Tokens per word, averaged over test set | Lower is better | | **CPT** | Grapheme clusters per token (Unicode-aware) | Higher is better | | **Disparity** (ΔF) | Relative cross-script gap: \|Far − Faz\| / max(Far, Faz) | Lower is better | | **Exact Match** | Fraction of texts that round-trip perfectly through encode/decode | Higher is better | | **Gini** | Vocabulary usage inequality (0 = uniform, 1 = concentrated) | Lower is better | Overall fertility is word-count-weighted: F ≈ 0.65·Far + 0.35·Faz ### Morphological Fidelity (Arabic-script only) | Metric | Definition | Direction | |---|---|---| | **μe** | Edit distance between tokenizer boundaries and Farasa morpheme boundaries | Lower is better | | **μc-F1** | Whether words sharing morphemes also share tokens (KMeans + TF-IDF) | Higher is better | ### Statistical Rigor All fertility and CPT values include **bootstrap 95% confidence intervals** (500 resamples). CI width is ≤ 0.006 for all configurations. --- ## Visualizations ### Fertility by Algorithm and Vocabulary Size ![Fertility Comparison](plots/fertility_overall_comparison_v2.png) ### Cross-Script Disparity ![Disparity Comparison](plots/fertility_disparity_comparison_v2.png) ### External Tokenizer Comparison ![External Comparison](plots/external_comparison.png) --- ## Key Findings 1. **Darija-specific tokenization dramatically improves compression.** Our tokenizers achieve 27--33% lower fertility than DarijaBERT at matching vocabulary sizes. 2. **Cross-script fairness is achievable.** Shared Unigram at 80K reaches ΔF = 0.015 (near-zero disparity) without concatenation. Concatenated architectures maintain ΔF ≤ 0.094 across all non-BBPE algorithms. 3. **Vocabulary size saturates early.** Moving from 32K to 110K yields only marginal gains. The training corpus exhausts merge candidates around 80K. 4. **BBPE requires concatenation.** Shared BBPE exhibits extreme cross-script disparity (ΔF = 0.219–0.243). Concatenation reduces this 4x. 5. **MSA tokenizers transfer poorly to Darija.** They produce 40--50% higher fertility, allocating vocabulary to MSA patterns absent in Darija. --- ## Methodology | Stage | Details | |---|---| | **Dataset** | `OiQ/daa-pairs` — 112,814 Moroccan Darija sentence triplets (Arabic, Arabizi, Mixed) from 12,695 unique sources | | **Split** | 80/10/10 train/validation/test with stratified sampling | | **Pre-tokenization** | Metaspace (BPE/Unigram/WordPiece) or ByteLevel (BBPE) | | **Training** | HuggingFace `tokenizers` library with matched pre-tokenizer/decoder pairs | | **Evaluation** | 14 metrics on 33,846 test texts (11,282 per script x 3 scripts) | | **Morphological** | Farasa segmenter for gold-standard Arabic morpheme boundaries | | **Export** | Raw + `transformers`-compatible via `PreTrainedTokenizerFast` | --- ## Code & Reproducibility All scripts, evaluation code, and documentation are included in this repository. See [`code.md`](code.md) for a complete guide to every script and its outputs. ### Repository Structure ``` daa-tokenizers/ ├── README.md # This file ├── code.md # Script documentation ├── script.py # Master benchmark pipeline ├── eval_test_set.py # Test-set evaluation ├── eval_morph_large.py # Morphological metrics (80K/110K) ├── bootstrap_test_set.py # Bootstrap confidence intervals ├── eval_all_externals.py # External tokenizer comparison ├── eval_codeswitch_and_new_baselines.py ├── eval_doda_independent.py # DODa validation ├── regen_figures.py # Figure generation ├── verify_arithmetic.py # Numeric claims verification ├── results/ │ ├── test_set_results.csv # Primary results (40 tokenizers) │ ├── external_comparison.csv # External comparison │ ├── morph_large_vocab_results.csv# Morphological metrics (80K/110K) │ ├── bootstrap_ci_test_set.csv # Bootstrap 95% CIs │ ├── tokenizers/ # 60 raw tokenizer JSONs │ ├── transformers_tokenizers/ # Transformers-compatible exports │ ├── corpora/ # Train/test text splits │ ├── morphology/ # Farasa segmentations cache │ └── plots/ # All visualization PNGs └── plots/ # Paper figures ``` --- ## Citation ```bibtex @misc{laamiri2026daa-tokenizers, title = {Darija Subword Tokenizer Benchmark}, author = {Laamiri, Ouail and Berrada, Ismail and Belfadil, Anas}, year = {2026}, url = {https://huggingface.co/OiQ/daa-tokenizers}, note = {In collaboration with UM6P College of Computing} } ``` --- ## License MIT License — see [LICENSE](LICENSE) for details. ## Acknowledgments This work was developed in collaboration with the [UM6P College of Computing](https://www.um6p.ma/en/um6p-college-computing), Mohammed VI Polytechnic University, Ben Guerir, Morocco. We thank the HuggingFace community for providing the infrastructure to host these resources.