---
title: Darija Subword Tokenizer Benchmark
license: mit
tags:
  - tokenizer
  - moroccan-darija
  - arabic
  - bpe
  - unigram
  - wordpiece
  - bbpe
  - benchmark
language:
  - ar
  - lat
size_categories:
  - n>100K
---

# Darija Subword Tokenizer Benchmark

<p align="center">
  <strong>In collaboration with</strong><br>
  <a href="https://www.um6p.ma/en/um6p-college-computing">UM6P College of Computing</a>
</p>

---

## Overview

The first systematic subword tokenizer benchmark for **Moroccan Darija**, a low-resource dialect written concurrently in Arabic script and Arabizi (Latin script). We train and evaluate **40 tokenizer configurations** spanning four algorithms, two architectures, and five vocabulary sizes (8K--110K) on 112,814 parallel sentence pairs from [`OiQ/daa-pairs`](https://huggingface.co/datasets/OiQ/daa-pairs).

Our tokenizers achieve **27--33% lower fertility** than existing Darija tokenizers (DarijaBERT) at matching vocabulary sizes, and **40--50% lower fertility** than MSA-trained tokenizers. All 40 configurations maintain ≥99% exact reconstruction.

---

## Tokenizers

| Architecture | Description | Algorithms | Vocab Sizes | Count |
|---|---|---|---|---|
| **Shared** | Single vocabulary trained on mixed Arabic + Arabizi corpus | BPE, Unigram, WordPiece, BBPE | 8K, 16K, 32K, 80K, 110K | 20 |
| **Concatenated** | Separate per-script vocabularies (V/2 each) with ID shifting | BPE, Unigram, WordPiece, BBPE | 8K, 16K, 32K, 80K, 110K | 20 |

All tokenizers are released in both raw (HuggingFace `tokenizers`) and `transformers`-compatible formats. The 80K and 110K sizes match DarijaBERT's vocabulary sizes for direct comparison.

---

## Quick Start

```python
from transformers import AutoTokenizer

# Load a shared tokenizer
tok = AutoTokenizer.from_pretrained(
    "OiQ/daa-tokenizers",
    subfolder="transformers_tokenizers/shared_bpe_32000"
)

# Tokenize Arabic-script text
text_ar = "مابقاش كيعرف شنو يدير، بين القانون وبين وليداتو."
print(tok.encode(text_ar))

# Load a concatenated tokenizer (separate Arabic and Arabizi sub-tokenizers)
tok_ar = AutoTokenizer.from_pretrained(
    "OiQ/daa-tokenizers",
    subfolder="transformers_tokenizers/concat_bpe_32000_tokenizer_ar"
)
tok_az = AutoTokenizer.from_pretrained(
    "OiQ/daa-tokenizers",
    subfolder="transformers_tokenizers/concat_bpe_32000_tokenizer_az"
)

text_az = "wash kayn shi jdid?"
print(tok_az.encode(text_az))
```

---

## Key Results

### Best Tokenizer per Vocabulary Size

| Vocab | Configuration | Algorithm | Fertility ↓ | Disparity ↓ | Exact Match |
|---|---|---|---|---|---|
| 8K | Shared | WordPiece | **1.572** | 0.164 | 99.9% |
| 16K | Shared | WordPiece | **1.402** | 0.138 | 99.9% |
| 32K | Shared | WordPiece | **1.274** | 0.099 | 99.9% |
| 80K | Shared | WordPiece | **1.171** | 0.049 | 99.9% |
| 110K | Concat | WordPiece | **1.155** | 0.093 | 99.6% |

### Comparison with Existing Tokenizers

| Tokenizer | Vocab | Fertility ↓ | Disparity ↓ | EM (Ar) | EM (Az) |
|---|---|---|---|---|---|
| **Ours: concat WP 110K** | **110K** | **1.155** | 0.093 | **99.9%** | **99.6%** |
| **Ours: concat WP 80K** | **80K** | **1.183** | 0.090 | **99.9%** | **99.6%** |
| **Ours: concat BPE 32K** | **32K** | **1.307** | 0.084 | **99.9%** | **99.6%** |
| DarijaBERT-ar | 80K | 1.761 | 0.410 | 13.7% | 8.0% |
| DarijaBERT-az | 110K | 1.575 | 0.055 | 14.8% | 8.0% |
| DarijaBERT-mix | 160K | 1.414 | 0.149 | 14.8% | 8.0% |
| CaMeLBERT-MSA | 30K | 2.289 | 0.427 | 29.9% | 38.9% |
| Aranizer-SP-86k | 86K | 1.918 | 0.368 | 99.8% | 99.6% |
| Qwen2.5-Darija | 152K | 2.307 | 0.040 | 100.0% | 100.0% |

At matching vocabulary sizes, our 80K tokenizer achieves **33% lower fertility** than DarijaBERT-ar (1.183 vs 1.761). Our 110K achieves **27% lower** than DarijaBERT-az (1.155 vs 1.575). Even our 32K tokenizer outperforms DarijaBERT-az despite using 3.4x fewer vocabulary slots. DarijaBERT-mix, despite its massive 160K vocabulary (F = 1.414), still underperforms our 32K tokenizer—vocabulary size alone cannot compensate for suboptimal training architecture.

---

## Evaluation Metrics

### Compression & Fairness

| Metric | Definition | Direction |
|---|---|---|
| **Fertility** (F) | Tokens per word, averaged over test set | Lower is better |
| **CPT** | Grapheme clusters per token (Unicode-aware) | Higher is better |
| **Disparity** (ΔF) | Relative cross-script gap: \|F<sub>ar</sub> − F<sub>az</sub>\| / max(F<sub>ar</sub>, F<sub>az</sub>) | Lower is better |
| **Exact Match** | Fraction of texts that round-trip perfectly through encode/decode | Higher is better |
| **Gini** | Vocabulary usage inequality (0 = uniform, 1 = concentrated) | Lower is better |

Overall fertility is word-count-weighted: F ≈ 0.65·F<sub>ar</sub> + 0.35·F<sub>az</sub>

### Morphological Fidelity (Arabic-script only)

| Metric | Definition | Direction |
|---|---|---|
| **μ<sub>e</sub>** | Edit distance between tokenizer boundaries and Farasa morpheme boundaries | Lower is better |
| **μ<sub>c</sub>-F1** | Whether words sharing morphemes also share tokens (KMeans + TF-IDF) | Higher is better |

### Statistical Rigor

All fertility and CPT values include **bootstrap 95% confidence intervals** (500 resamples). CI width is ≤ 0.006 for all configurations.

---

## Visualizations

### Fertility by Algorithm and Vocabulary Size
![Fertility Comparison](plots/fertility_overall_comparison_v2.png)

### Cross-Script Disparity
![Disparity Comparison](plots/fertility_disparity_comparison_v2.png)

### External Tokenizer Comparison
![External Comparison](plots/external_comparison.png)

---

## Key Findings

1. **Darija-specific tokenization dramatically improves compression.** Our tokenizers achieve 27--33% lower fertility than DarijaBERT at matching vocabulary sizes.

2. **Cross-script fairness is achievable.** Shared Unigram at 80K reaches ΔF = 0.015 (near-zero disparity) without concatenation. Concatenated architectures maintain ΔF ≤ 0.094 across all non-BBPE algorithms.

3. **Vocabulary size saturates early.** Moving from 32K to 110K yields only marginal gains. The training corpus exhausts merge candidates around 80K.

4. **BBPE requires concatenation.** Shared BBPE exhibits extreme cross-script disparity (ΔF = 0.219–0.243). Concatenation reduces this 4x.

5. **MSA tokenizers transfer poorly to Darija.** They produce 40--50% higher fertility, allocating vocabulary to MSA patterns absent in Darija.

---

## Methodology

| Stage | Details |
|---|---|
| **Dataset** | `OiQ/daa-pairs` — 112,814 Moroccan Darija sentence triplets (Arabic, Arabizi, Mixed) from 12,695 unique sources |
| **Split** | 80/10/10 train/validation/test with stratified sampling |
| **Pre-tokenization** | Metaspace (BPE/Unigram/WordPiece) or ByteLevel (BBPE) |
| **Training** | HuggingFace `tokenizers` library with matched pre-tokenizer/decoder pairs |
| **Evaluation** | 14 metrics on 33,846 test texts (11,282 per script x 3 scripts) |
| **Morphological** | Farasa segmenter for gold-standard Arabic morpheme boundaries |
| **Export** | Raw + `transformers`-compatible via `PreTrainedTokenizerFast` |

---

## Code & Reproducibility

All scripts, evaluation code, and documentation are included in this repository. See [`code.md`](code.md) for a complete guide to every script and its outputs.

### Repository Structure

```
daa-tokenizers/
├── README.md                        # This file
├── code.md                          # Script documentation
├── script.py                        # Master benchmark pipeline
├── eval_test_set.py                 # Test-set evaluation
├── eval_morph_large.py              # Morphological metrics (80K/110K)
├── bootstrap_test_set.py            # Bootstrap confidence intervals
├── eval_all_externals.py            # External tokenizer comparison
├── eval_codeswitch_and_new_baselines.py
├── eval_doda_independent.py         # DODa validation
├── regen_figures.py                 # Figure generation
├── verify_arithmetic.py             # Numeric claims verification
├── results/
│   ├── test_set_results.csv         # Primary results (40 tokenizers)
│   ├── external_comparison.csv      # External comparison
│   ├── morph_large_vocab_results.csv# Morphological metrics (80K/110K)
│   ├── bootstrap_ci_test_set.csv    # Bootstrap 95% CIs
│   ├── tokenizers/                  # 60 raw tokenizer JSONs
│   ├── transformers_tokenizers/     # Transformers-compatible exports
│   ├── corpora/                     # Train/test text splits
│   ├── morphology/                  # Farasa segmentations cache
│   └── plots/                       # All visualization PNGs
└── plots/                           # Paper figures
```

---

## Citation

```bibtex
@misc{laamiri2026daa-tokenizers,
  title     = {Darija Subword Tokenizer Benchmark},
  author    = {Laamiri, Ouail and Berrada, Ismail and Belfadil, Anas},
  year      = {2026},
  url       = {https://huggingface.co/OiQ/daa-tokenizers},
  note      = {In collaboration with UM6P College of Computing}
}
```

---

## License

MIT License — see [LICENSE](LICENSE) for details.

## Acknowledgments

This work was developed in collaboration with the [UM6P College of Computing](https://www.um6p.ma/en/um6p-college-computing), Mohammed VI Polytechnic University, Ben Guerir, Morocco. We thank the HuggingFace community for providing the infrastructure to host these resources.