---
title: Darija Subword Tokenizer Benchmark
license: mit
tags:
- tokenizer
- moroccan-darija
- arabic
- bpe
- unigram
- wordpiece
- bbpe
- benchmark
language:
- ar
- lat
size_categories:
- n>100K
---
# Darija Subword Tokenizer Benchmark
In collaboration with
UM6P College of Computing
---
## Overview
The first systematic subword tokenizer benchmark for **Moroccan Darija**, a low-resource dialect written concurrently in Arabic script and Arabizi (Latin script). We train and evaluate **40 tokenizer configurations** spanning four algorithms, two architectures, and five vocabulary sizes (8K--110K) on 112,814 parallel sentence pairs from [`OiQ/daa-pairs`](https://huggingface.co/datasets/OiQ/daa-pairs).
Our tokenizers achieve **27--33% lower fertility** than existing Darija tokenizers (DarijaBERT) at matching vocabulary sizes, and **40--50% lower fertility** than MSA-trained tokenizers. All 40 configurations maintain ≥99% exact reconstruction.
---
## Tokenizers
| Architecture | Description | Algorithms | Vocab Sizes | Count |
|---|---|---|---|---|
| **Shared** | Single vocabulary trained on mixed Arabic + Arabizi corpus | BPE, Unigram, WordPiece, BBPE | 8K, 16K, 32K, 80K, 110K | 20 |
| **Concatenated** | Separate per-script vocabularies (V/2 each) with ID shifting | BPE, Unigram, WordPiece, BBPE | 8K, 16K, 32K, 80K, 110K | 20 |
All tokenizers are released in both raw (HuggingFace `tokenizers`) and `transformers`-compatible formats. The 80K and 110K sizes match DarijaBERT's vocabulary sizes for direct comparison.
---
## Quick Start
```python
from transformers import AutoTokenizer
# Load a shared tokenizer
tok = AutoTokenizer.from_pretrained(
"OiQ/daa-tokenizers",
subfolder="transformers_tokenizers/shared_bpe_32000"
)
# Tokenize Arabic-script text
text_ar = "مابقاش كيعرف شنو يدير، بين القانون وبين وليداتو."
print(tok.encode(text_ar))
# Load a concatenated tokenizer (separate Arabic and Arabizi sub-tokenizers)
tok_ar = AutoTokenizer.from_pretrained(
"OiQ/daa-tokenizers",
subfolder="transformers_tokenizers/concat_bpe_32000_tokenizer_ar"
)
tok_az = AutoTokenizer.from_pretrained(
"OiQ/daa-tokenizers",
subfolder="transformers_tokenizers/concat_bpe_32000_tokenizer_az"
)
text_az = "wash kayn shi jdid?"
print(tok_az.encode(text_az))
```
---
## Key Results
### Best Tokenizer per Vocabulary Size
| Vocab | Configuration | Algorithm | Fertility ↓ | Disparity ↓ | Exact Match |
|---|---|---|---|---|---|
| 8K | Shared | WordPiece | **1.572** | 0.164 | 99.9% |
| 16K | Shared | WordPiece | **1.402** | 0.138 | 99.9% |
| 32K | Shared | WordPiece | **1.274** | 0.099 | 99.9% |
| 80K | Shared | WordPiece | **1.171** | 0.049 | 99.9% |
| 110K | Concat | WordPiece | **1.155** | 0.093 | 99.6% |
### Comparison with Existing Tokenizers
| Tokenizer | Vocab | Fertility ↓ | Disparity ↓ | EM (Ar) | EM (Az) |
|---|---|---|---|---|---|
| **Ours: concat WP 110K** | **110K** | **1.155** | 0.093 | **99.9%** | **99.6%** |
| **Ours: concat WP 80K** | **80K** | **1.183** | 0.090 | **99.9%** | **99.6%** |
| **Ours: concat BPE 32K** | **32K** | **1.307** | 0.084 | **99.9%** | **99.6%** |
| DarijaBERT-ar | 80K | 1.761 | 0.410 | 13.7% | 8.0% |
| DarijaBERT-az | 110K | 1.575 | 0.055 | 14.8% | 8.0% |
| DarijaBERT-mix | 160K | 1.414 | 0.149 | 14.8% | 8.0% |
| CaMeLBERT-MSA | 30K | 2.289 | 0.427 | 29.9% | 38.9% |
| Aranizer-SP-86k | 86K | 1.918 | 0.368 | 99.8% | 99.6% |
| Qwen2.5-Darija | 152K | 2.307 | 0.040 | 100.0% | 100.0% |
At matching vocabulary sizes, our 80K tokenizer achieves **33% lower fertility** than DarijaBERT-ar (1.183 vs 1.761). Our 110K achieves **27% lower** than DarijaBERT-az (1.155 vs 1.575). Even our 32K tokenizer outperforms DarijaBERT-az despite using 3.4x fewer vocabulary slots. DarijaBERT-mix, despite its massive 160K vocabulary (F = 1.414), still underperforms our 32K tokenizer—vocabulary size alone cannot compensate for suboptimal training architecture.
---
## Evaluation Metrics
### Compression & Fairness
| Metric | Definition | Direction |
|---|---|---|
| **Fertility** (F) | Tokens per word, averaged over test set | Lower is better |
| **CPT** | Grapheme clusters per token (Unicode-aware) | Higher is better |
| **Disparity** (ΔF) | Relative cross-script gap: \|Far − Faz\| / max(Far, Faz) | Lower is better |
| **Exact Match** | Fraction of texts that round-trip perfectly through encode/decode | Higher is better |
| **Gini** | Vocabulary usage inequality (0 = uniform, 1 = concentrated) | Lower is better |
Overall fertility is word-count-weighted: F ≈ 0.65·Far + 0.35·Faz
### Morphological Fidelity (Arabic-script only)
| Metric | Definition | Direction |
|---|---|---|
| **μe** | Edit distance between tokenizer boundaries and Farasa morpheme boundaries | Lower is better |
| **μc-F1** | Whether words sharing morphemes also share tokens (KMeans + TF-IDF) | Higher is better |
### Statistical Rigor
All fertility and CPT values include **bootstrap 95% confidence intervals** (500 resamples). CI width is ≤ 0.006 for all configurations.
---
## Visualizations
### Fertility by Algorithm and Vocabulary Size

### Cross-Script Disparity

### External Tokenizer Comparison

---
## Key Findings
1. **Darija-specific tokenization dramatically improves compression.** Our tokenizers achieve 27--33% lower fertility than DarijaBERT at matching vocabulary sizes.
2. **Cross-script fairness is achievable.** Shared Unigram at 80K reaches ΔF = 0.015 (near-zero disparity) without concatenation. Concatenated architectures maintain ΔF ≤ 0.094 across all non-BBPE algorithms.
3. **Vocabulary size saturates early.** Moving from 32K to 110K yields only marginal gains. The training corpus exhausts merge candidates around 80K.
4. **BBPE requires concatenation.** Shared BBPE exhibits extreme cross-script disparity (ΔF = 0.219–0.243). Concatenation reduces this 4x.
5. **MSA tokenizers transfer poorly to Darija.** They produce 40--50% higher fertility, allocating vocabulary to MSA patterns absent in Darija.
---
## Methodology
| Stage | Details |
|---|---|
| **Dataset** | `OiQ/daa-pairs` — 112,814 Moroccan Darija sentence triplets (Arabic, Arabizi, Mixed) from 12,695 unique sources |
| **Split** | 80/10/10 train/validation/test with stratified sampling |
| **Pre-tokenization** | Metaspace (BPE/Unigram/WordPiece) or ByteLevel (BBPE) |
| **Training** | HuggingFace `tokenizers` library with matched pre-tokenizer/decoder pairs |
| **Evaluation** | 14 metrics on 33,846 test texts (11,282 per script x 3 scripts) |
| **Morphological** | Farasa segmenter for gold-standard Arabic morpheme boundaries |
| **Export** | Raw + `transformers`-compatible via `PreTrainedTokenizerFast` |
---
## Code & Reproducibility
All scripts, evaluation code, and documentation are included in this repository. See [`code.md`](code.md) for a complete guide to every script and its outputs.
### Repository Structure
```
daa-tokenizers/
├── README.md # This file
├── code.md # Script documentation
├── script.py # Master benchmark pipeline
├── eval_test_set.py # Test-set evaluation
├── eval_morph_large.py # Morphological metrics (80K/110K)
├── bootstrap_test_set.py # Bootstrap confidence intervals
├── eval_all_externals.py # External tokenizer comparison
├── eval_codeswitch_and_new_baselines.py
├── eval_doda_independent.py # DODa validation
├── regen_figures.py # Figure generation
├── verify_arithmetic.py # Numeric claims verification
├── results/
│ ├── test_set_results.csv # Primary results (40 tokenizers)
│ ├── external_comparison.csv # External comparison
│ ├── morph_large_vocab_results.csv# Morphological metrics (80K/110K)
│ ├── bootstrap_ci_test_set.csv # Bootstrap 95% CIs
│ ├── tokenizers/ # 60 raw tokenizer JSONs
│ ├── transformers_tokenizers/ # Transformers-compatible exports
│ ├── corpora/ # Train/test text splits
│ ├── morphology/ # Farasa segmentations cache
│ └── plots/ # All visualization PNGs
└── plots/ # Paper figures
```
---
## Citation
```bibtex
@misc{laamiri2026daa-tokenizers,
title = {Darija Subword Tokenizer Benchmark},
author = {Laamiri, Ouail and Berrada, Ismail and Belfadil, Anas},
year = {2026},
url = {https://huggingface.co/OiQ/daa-tokenizers},
note = {In collaboration with UM6P College of Computing}
}
```
---
## License
MIT License — see [LICENSE](LICENSE) for details.
## Acknowledgments
This work was developed in collaboration with the [UM6P College of Computing](https://www.um6p.ma/en/um6p-college-computing), Mohammed VI Polytechnic University, Ben Guerir, Morocco. We thank the HuggingFace community for providing the infrastructure to host these resources.