| --- |
| title: Darija Subword Tokenizer Benchmark |
| license: mit |
| tags: |
| - tokenizer |
| - moroccan-darija |
| - arabic |
| - bpe |
| - unigram |
| - wordpiece |
| - bbpe |
| - benchmark |
| language: |
| - ar |
| - lat |
| size_categories: |
| - n>100K |
| --- |
| |
| # Darija Subword Tokenizer Benchmark |
|
|
| <p align="center"> |
| <strong>In collaboration with</strong><br> |
| <a href="https://www.um6p.ma/en/um6p-college-computing">UM6P College of Computing</a> |
| </p> |
|
|
| --- |
|
|
| ## Overview |
|
|
| The first systematic subword tokenizer benchmark for **Moroccan Darija**, a low-resource dialect written concurrently in Arabic script and Arabizi (Latin script). We train and evaluate **40 tokenizer configurations** spanning four algorithms, two architectures, and five vocabulary sizes (8K--110K) on 112,814 parallel sentence pairs from [`OiQ/daa-pairs`](https://huggingface.co/datasets/OiQ/daa-pairs). |
|
|
| Our tokenizers achieve **27--33% lower fertility** than existing Darija tokenizers (DarijaBERT) at matching vocabulary sizes, and **40--50% lower fertility** than MSA-trained tokenizers. All 40 configurations maintain β₯99% exact reconstruction. |
|
|
| --- |
|
|
| ## Tokenizers |
|
|
| | Architecture | Description | Algorithms | Vocab Sizes | Count | |
| |---|---|---|---|---| |
| | **Shared** | Single vocabulary trained on mixed Arabic + Arabizi corpus | BPE, Unigram, WordPiece, BBPE | 8K, 16K, 32K, 80K, 110K | 20 | |
| | **Concatenated** | Separate per-script vocabularies (V/2 each) with ID shifting | BPE, Unigram, WordPiece, BBPE | 8K, 16K, 32K, 80K, 110K | 20 | |
|
|
| All tokenizers are released in both raw (HuggingFace `tokenizers`) and `transformers`-compatible formats. The 80K and 110K sizes match DarijaBERT's vocabulary sizes for direct comparison. |
|
|
| --- |
|
|
| ## Quick Start |
|
|
| ```python |
| from transformers import AutoTokenizer |
| |
| # Load a shared tokenizer |
| tok = AutoTokenizer.from_pretrained( |
| "OiQ/daa-tokenizers", |
| subfolder="transformers_tokenizers/shared_bpe_32000" |
| ) |
| |
| # Tokenize Arabic-script text |
| text_ar = "Ω
Ψ§Ψ¨ΩΨ§Ψ΄ ΩΩΨΉΨ±Ω Ψ΄ΩΩ ΩΨ―ΩΨ±Ψ Ψ¨ΩΩ Ψ§ΩΩΨ§ΩΩΩ ΩΨ¨ΩΩ ΩΩΩΨ―Ψ§ΨͺΩ." |
| print(tok.encode(text_ar)) |
| |
| # Load a concatenated tokenizer (separate Arabic and Arabizi sub-tokenizers) |
| tok_ar = AutoTokenizer.from_pretrained( |
| "OiQ/daa-tokenizers", |
| subfolder="transformers_tokenizers/concat_bpe_32000_tokenizer_ar" |
| ) |
| tok_az = AutoTokenizer.from_pretrained( |
| "OiQ/daa-tokenizers", |
| subfolder="transformers_tokenizers/concat_bpe_32000_tokenizer_az" |
| ) |
| |
| text_az = "wash kayn shi jdid?" |
| print(tok_az.encode(text_az)) |
| ``` |
|
|
| --- |
|
|
| ## Key Results |
|
|
| ### Best Tokenizer per Vocabulary Size |
|
|
| | Vocab | Configuration | Algorithm | Fertility β | Disparity β | Exact Match | |
| |---|---|---|---|---|---| |
| | 8K | Shared | WordPiece | **1.572** | 0.164 | 99.9% | |
| | 16K | Shared | WordPiece | **1.402** | 0.138 | 99.9% | |
| | 32K | Shared | WordPiece | **1.274** | 0.099 | 99.9% | |
| | 80K | Shared | WordPiece | **1.171** | 0.049 | 99.9% | |
| | 110K | Concat | WordPiece | **1.155** | 0.093 | 99.6% | |
|
|
| ### Comparison with Existing Tokenizers |
|
|
| | Tokenizer | Vocab | Fertility β | Disparity β | EM (Ar) | EM (Az) | |
| |---|---|---|---|---|---| |
| | **Ours: concat WP 110K** | **110K** | **1.155** | 0.093 | **99.9%** | **99.6%** | |
| | **Ours: concat WP 80K** | **80K** | **1.183** | 0.090 | **99.9%** | **99.6%** | |
| | **Ours: concat BPE 32K** | **32K** | **1.307** | 0.084 | **99.9%** | **99.6%** | |
| | DarijaBERT-ar | 80K | 1.761 | 0.410 | 13.7% | 8.0% | |
| | DarijaBERT-az | 110K | 1.575 | 0.055 | 14.8% | 8.0% | |
| | DarijaBERT-mix | 160K | 1.414 | 0.149 | 14.8% | 8.0% | |
| | CaMeLBERT-MSA | 30K | 2.289 | 0.427 | 29.9% | 38.9% | |
| | Aranizer-SP-86k | 86K | 1.918 | 0.368 | 99.8% | 99.6% | |
| | Qwen2.5-Darija | 152K | 2.307 | 0.040 | 100.0% | 100.0% | |
|
|
| At matching vocabulary sizes, our 80K tokenizer achieves **33% lower fertility** than DarijaBERT-ar (1.183 vs 1.761). Our 110K achieves **27% lower** than DarijaBERT-az (1.155 vs 1.575). Even our 32K tokenizer outperforms DarijaBERT-az despite using 3.4x fewer vocabulary slots. DarijaBERT-mix, despite its massive 160K vocabulary (F = 1.414), still underperforms our 32K tokenizerβvocabulary size alone cannot compensate for suboptimal training architecture. |
|
|
| --- |
|
|
| ## Evaluation Metrics |
|
|
| ### Compression & Fairness |
|
|
| | Metric | Definition | Direction | |
| |---|---|---| |
| | **Fertility** (F) | Tokens per word, averaged over test set | Lower is better | |
| | **CPT** | Grapheme clusters per token (Unicode-aware) | Higher is better | |
| | **Disparity** (ΞF) | Relative cross-script gap: \|F<sub>ar</sub> β F<sub>az</sub>\| / max(F<sub>ar</sub>, F<sub>az</sub>) | Lower is better | |
| | **Exact Match** | Fraction of texts that round-trip perfectly through encode/decode | Higher is better | |
| | **Gini** | Vocabulary usage inequality (0 = uniform, 1 = concentrated) | Lower is better | |
|
|
| Overall fertility is word-count-weighted: F β 0.65Β·F<sub>ar</sub> + 0.35Β·F<sub>az</sub> |
|
|
| ### Morphological Fidelity (Arabic-script only) |
|
|
| | Metric | Definition | Direction | |
| |---|---|---| |
| | **ΞΌ<sub>e</sub>** | Edit distance between tokenizer boundaries and Farasa morpheme boundaries | Lower is better | |
| | **ΞΌ<sub>c</sub>-F1** | Whether words sharing morphemes also share tokens (KMeans + TF-IDF) | Higher is better | |
|
|
| ### Statistical Rigor |
|
|
| All fertility and CPT values include **bootstrap 95% confidence intervals** (500 resamples). CI width is β€ 0.006 for all configurations. |
|
|
| --- |
|
|
| ## Visualizations |
|
|
| ### Fertility by Algorithm and Vocabulary Size |
|  |
|
|
| ### Cross-Script Disparity |
|  |
|
|
| ### External Tokenizer Comparison |
|  |
|
|
| --- |
|
|
| ## Key Findings |
|
|
| 1. **Darija-specific tokenization dramatically improves compression.** Our tokenizers achieve 27--33% lower fertility than DarijaBERT at matching vocabulary sizes. |
|
|
| 2. **Cross-script fairness is achievable.** Shared Unigram at 80K reaches ΞF = 0.015 (near-zero disparity) without concatenation. Concatenated architectures maintain ΞF β€ 0.094 across all non-BBPE algorithms. |
|
|
| 3. **Vocabulary size saturates early.** Moving from 32K to 110K yields only marginal gains. The training corpus exhausts merge candidates around 80K. |
|
|
| 4. **BBPE requires concatenation.** Shared BBPE exhibits extreme cross-script disparity (ΞF = 0.219β0.243). Concatenation reduces this 4x. |
|
|
| 5. **MSA tokenizers transfer poorly to Darija.** They produce 40--50% higher fertility, allocating vocabulary to MSA patterns absent in Darija. |
|
|
| --- |
|
|
| ## Methodology |
|
|
| | Stage | Details | |
| |---|---| |
| | **Dataset** | `OiQ/daa-pairs` β 112,814 Moroccan Darija sentence triplets (Arabic, Arabizi, Mixed) from 12,695 unique sources | |
| | **Split** | 80/10/10 train/validation/test with stratified sampling | |
| | **Pre-tokenization** | Metaspace (BPE/Unigram/WordPiece) or ByteLevel (BBPE) | |
| | **Training** | HuggingFace `tokenizers` library with matched pre-tokenizer/decoder pairs | |
| | **Evaluation** | 14 metrics on 33,846 test texts (11,282 per script x 3 scripts) | |
| | **Morphological** | Farasa segmenter for gold-standard Arabic morpheme boundaries | |
| | **Export** | Raw + `transformers`-compatible via `PreTrainedTokenizerFast` | |
|
|
| --- |
|
|
| ## Code & Reproducibility |
|
|
| All scripts, evaluation code, and documentation are included in this repository. See [`code.md`](code.md) for a complete guide to every script and its outputs. |
|
|
| ### Repository Structure |
|
|
| ``` |
| daa-tokenizers/ |
| βββ README.md # This file |
| βββ code.md # Script documentation |
| βββ script.py # Master benchmark pipeline |
| βββ eval_test_set.py # Test-set evaluation |
| βββ eval_morph_large.py # Morphological metrics (80K/110K) |
| βββ bootstrap_test_set.py # Bootstrap confidence intervals |
| βββ eval_all_externals.py # External tokenizer comparison |
| βββ eval_codeswitch_and_new_baselines.py |
| βββ eval_doda_independent.py # DODa validation |
| βββ regen_figures.py # Figure generation |
| βββ verify_arithmetic.py # Numeric claims verification |
| βββ results/ |
| β βββ test_set_results.csv # Primary results (40 tokenizers) |
| β βββ external_comparison.csv # External comparison |
| β βββ morph_large_vocab_results.csv# Morphological metrics (80K/110K) |
| β βββ bootstrap_ci_test_set.csv # Bootstrap 95% CIs |
| β βββ tokenizers/ # 60 raw tokenizer JSONs |
| β βββ transformers_tokenizers/ # Transformers-compatible exports |
| β βββ corpora/ # Train/test text splits |
| β βββ morphology/ # Farasa segmentations cache |
| β βββ plots/ # All visualization PNGs |
| βββ plots/ # Paper figures |
| ``` |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{laamiri2026daa-tokenizers, |
| title = {Darija Subword Tokenizer Benchmark}, |
| author = {Laamiri, Ouail and Berrada, Ismail and Belfadil, Anas}, |
| year = {2026}, |
| url = {https://huggingface.co/OiQ/daa-tokenizers}, |
| note = {In collaboration with UM6P College of Computing} |
| } |
| ``` |
|
|
| --- |
|
|
| ## License |
|
|
| MIT License β see [LICENSE](LICENSE) for details. |
|
|
| ## Acknowledgments |
|
|
| This work was developed in collaboration with the [UM6P College of Computing](https://www.um6p.ma/en/um6p-college-computing), Mohammed VI Polytechnic University, Ben Guerir, Morocco. We thank the HuggingFace community for providing the infrastructure to host these resources. |
|
|