Upload README.md with huggingface_hub

b7fbcec verified 9 days ago

9.58 kB

	---
	title: Darija Subword Tokenizer Benchmark
	license: mit
	tags:
	- tokenizer
	- moroccan-darija
	- arabic
	- bpe
	- unigram
	- wordpiece
	- bbpe
	- benchmark
	language:
	- ar
	- lat
	size_categories:
	- n>100K
	---

	# Darija Subword Tokenizer Benchmark

	<p align="center">
	<strong>In collaboration with</strong><br>
	<a href="https://www.um6p.ma/en/um6p-college-computing">UM6P College of Computing</a>
	</p>

	---

	## Overview

	The first systematic subword tokenizer benchmark for Moroccan Darija, a low-resource dialect written concurrently in Arabic script and Arabizi (Latin script). We train and evaluate 40 tokenizer configurations spanning four algorithms, two architectures, and five vocabulary sizes (8K--110K) on 112,814 parallel sentence pairs from [`OiQ/daa-pairs`](https://huggingface.co/datasets/OiQ/daa-pairs).

	Our tokenizers achieve 27--33% lower fertility than existing Darija tokenizers (DarijaBERT) at matching vocabulary sizes, and 40--50% lower fertility than MSA-trained tokenizers. All 40 configurations maintain ≥99% exact reconstruction.

	---

	## Tokenizers

	\| Architecture \| Description \| Algorithms \| Vocab Sizes \| Count \|
	\|---\|---\|---\|---\|---\|
	\| Shared \| Single vocabulary trained on mixed Arabic + Arabizi corpus \| BPE, Unigram, WordPiece, BBPE \| 8K, 16K, 32K, 80K, 110K \| 20 \|
	\| Concatenated \| Separate per-script vocabularies (V/2 each) with ID shifting \| BPE, Unigram, WordPiece, BBPE \| 8K, 16K, 32K, 80K, 110K \| 20 \|

	All tokenizers are released in both raw (HuggingFace `tokenizers`) and `transformers`-compatible formats. The 80K and 110K sizes match DarijaBERT's vocabulary sizes for direct comparison.

	---

	## Quick Start

	```python
	from transformers import AutoTokenizer

	# Load a shared tokenizer
	tok = AutoTokenizer.from_pretrained(
	"OiQ/daa-tokenizers",
	subfolder="transformers_tokenizers/shared_bpe_32000"
	)

	# Tokenize Arabic-script text
	text_ar = "مابقاش كيعرف شنو يدير، بين القانون وبين وليداتو."
	print(tok.encode(text_ar))

	# Load a concatenated tokenizer (separate Arabic and Arabizi sub-tokenizers)
	tok_ar = AutoTokenizer.from_pretrained(
	"OiQ/daa-tokenizers",
	subfolder="transformers_tokenizers/concat_bpe_32000_tokenizer_ar"
	)
	tok_az = AutoTokenizer.from_pretrained(
	"OiQ/daa-tokenizers",
	subfolder="transformers_tokenizers/concat_bpe_32000_tokenizer_az"
	)

	text_az = "wash kayn shi jdid?"
	print(tok_az.encode(text_az))
	```

	---

	## Key Results

	### Best Tokenizer per Vocabulary Size

	\| Vocab \| Configuration \| Algorithm \| Fertility ↓ \| Disparity ↓ \| Exact Match \|
	\|---\|---\|---\|---\|---\|---\|
	\| 8K \| Shared \| WordPiece \| 1.572 \| 0.164 \| 99.9% \|
	\| 16K \| Shared \| WordPiece \| 1.402 \| 0.138 \| 99.9% \|
	\| 32K \| Shared \| WordPiece \| 1.274 \| 0.099 \| 99.9% \|
	\| 80K \| Shared \| WordPiece \| 1.171 \| 0.049 \| 99.9% \|
	\| 110K \| Concat \| WordPiece \| 1.155 \| 0.093 \| 99.6% \|

	### Comparison with Existing Tokenizers

	\| Tokenizer \| Vocab \| Fertility ↓ \| Disparity ↓ \| EM (Ar) \| EM (Az) \|
	\|---\|---\|---\|---\|---\|---\|
	\| Ours: concat WP 110K \| 110K \| 1.155 \| 0.093 \| 99.9% \| 99.6% \|
	\| Ours: concat WP 80K \| 80K \| 1.183 \| 0.090 \| 99.9% \| 99.6% \|
	\| Ours: concat BPE 32K \| 32K \| 1.307 \| 0.084 \| 99.9% \| 99.6% \|
	\| DarijaBERT-ar \| 80K \| 1.761 \| 0.410 \| 13.7% \| 8.0% \|
	\| DarijaBERT-az \| 110K \| 1.575 \| 0.055 \| 14.8% \| 8.0% \|
	\| DarijaBERT-mix \| 160K \| 1.414 \| 0.149 \| 14.8% \| 8.0% \|
	\| CaMeLBERT-MSA \| 30K \| 2.289 \| 0.427 \| 29.9% \| 38.9% \|
	\| Aranizer-SP-86k \| 86K \| 1.918 \| 0.368 \| 99.8% \| 99.6% \|
	\| Qwen2.5-Darija \| 152K \| 2.307 \| 0.040 \| 100.0% \| 100.0% \|

	At matching vocabulary sizes, our 80K tokenizer achieves 33% lower fertility than DarijaBERT-ar (1.183 vs 1.761). Our 110K achieves 27% lower than DarijaBERT-az (1.155 vs 1.575). Even our 32K tokenizer outperforms DarijaBERT-az despite using 3.4x fewer vocabulary slots. DarijaBERT-mix, despite its massive 160K vocabulary (F = 1.414), still underperforms our 32K tokenizer—vocabulary size alone cannot compensate for suboptimal training architecture.

	---

	## Evaluation Metrics

	### Compression & Fairness

	\| Metric \| Definition \| Direction \|
	\|---\|---\|---\|
	\| Fertility (F) \| Tokens per word, averaged over test set \| Lower is better \|
	\| CPT \| Grapheme clusters per token (Unicode-aware) \| Higher is better \|
	\| Disparity (ΔF) \| Relative cross-script gap: \\|F<sub>ar</sub> − F<sub>az</sub>\\| / max(F<sub>ar</sub>, F<sub>az</sub>) \| Lower is better \|
	\| Exact Match \| Fraction of texts that round-trip perfectly through encode/decode \| Higher is better \|
	\| Gini \| Vocabulary usage inequality (0 = uniform, 1 = concentrated) \| Lower is better \|

	Overall fertility is word-count-weighted: F ≈ 0.65·F<sub>ar</sub> + 0.35·F<sub>az</sub>

	### Morphological Fidelity (Arabic-script only)

	\| Metric \| Definition \| Direction \|
	\|---\|---\|---\|
	\| μ<sub>e</sub> \| Edit distance between tokenizer boundaries and Farasa morpheme boundaries \| Lower is better \|
	\| μ<sub>c</sub>-F1 \| Whether words sharing morphemes also share tokens (KMeans + TF-IDF) \| Higher is better \|

	### Statistical Rigor

	All fertility and CPT values include bootstrap 95% confidence intervals (500 resamples). CI width is ≤ 0.006 for all configurations.

	---

	## Visualizations

	### Fertility by Algorithm and Vocabulary Size
	![Fertility Comparison](plots/fertility_overall_comparison_v2.png)

	### Cross-Script Disparity
	![Disparity Comparison](plots/fertility_disparity_comparison_v2.png)

	### External Tokenizer Comparison
	![External Comparison](plots/external_comparison.png)

	---

	## Key Findings

	1. Darija-specific tokenization dramatically improves compression. Our tokenizers achieve 27--33% lower fertility than DarijaBERT at matching vocabulary sizes.

	2. Cross-script fairness is achievable. Shared Unigram at 80K reaches ΔF = 0.015 (near-zero disparity) without concatenation. Concatenated architectures maintain ΔF ≤ 0.094 across all non-BBPE algorithms.

	3. Vocabulary size saturates early. Moving from 32K to 110K yields only marginal gains. The training corpus exhausts merge candidates around 80K.

	4. BBPE requires concatenation. Shared BBPE exhibits extreme cross-script disparity (ΔF = 0.219–0.243). Concatenation reduces this 4x.

	5. MSA tokenizers transfer poorly to Darija. They produce 40--50% higher fertility, allocating vocabulary to MSA patterns absent in Darija.

	---

	## Methodology

	\| Stage \| Details \|
	\|---\|---\|
	\| Dataset \| `OiQ/daa-pairs` — 112,814 Moroccan Darija sentence triplets (Arabic, Arabizi, Mixed) from 12,695 unique sources \|
	\| Split \| 80/10/10 train/validation/test with stratified sampling \|
	\| Pre-tokenization \| Metaspace (BPE/Unigram/WordPiece) or ByteLevel (BBPE) \|
	\| Training \| HuggingFace `tokenizers` library with matched pre-tokenizer/decoder pairs \|
	\| Evaluation \| 14 metrics on 33,846 test texts (11,282 per script x 3 scripts) \|
	\| Morphological \| Farasa segmenter for gold-standard Arabic morpheme boundaries \|
	\| Export \| Raw + `transformers`-compatible via `PreTrainedTokenizerFast` \|

	---

	## Code & Reproducibility

	All scripts, evaluation code, and documentation are included in this repository. See [`code.md`](code.md) for a complete guide to every script and its outputs.

	### Repository Structure

	```
	daa-tokenizers/
	├── README.md # This file
	├── code.md # Script documentation
	├── script.py # Master benchmark pipeline
	├── eval_test_set.py # Test-set evaluation
	├── eval_morph_large.py # Morphological metrics (80K/110K)
	├── bootstrap_test_set.py # Bootstrap confidence intervals
	├── eval_all_externals.py # External tokenizer comparison
	├── eval_codeswitch_and_new_baselines.py
	├── eval_doda_independent.py # DODa validation
	├── regen_figures.py # Figure generation
	├── verify_arithmetic.py # Numeric claims verification
	├── results/
	│ ├── test_set_results.csv # Primary results (40 tokenizers)
	│ ├── external_comparison.csv # External comparison
	│ ├── morph_large_vocab_results.csv# Morphological metrics (80K/110K)
	│ ├── bootstrap_ci_test_set.csv # Bootstrap 95% CIs
	│ ├── tokenizers/ # 60 raw tokenizer JSONs
	│ ├── transformers_tokenizers/ # Transformers-compatible exports
	│ ├── corpora/ # Train/test text splits
	│ ├── morphology/ # Farasa segmentations cache
	│ └── plots/ # All visualization PNGs
	└── plots/ # Paper figures
	```

	---

	## Citation

	```bibtex
	@misc{laamiri2026daa-tokenizers,
	title = {Darija Subword Tokenizer Benchmark},
	author = {Laamiri, Ouail and Berrada, Ismail and Belfadil, Anas},
	year = {2026},
	url = {https://huggingface.co/OiQ/daa-tokenizers},
	note = {In collaboration with UM6P College of Computing}
	}
	```

	---

	## License

	MIT License — see [LICENSE](LICENSE) for details.

	## Acknowledgments

	This work was developed in collaboration with the [UM6P College of Computing](https://www.um6p.ma/en/um6p-college-computing), Mohammed VI Polytechnic University, Ben Guerir, Morocco. We thank the HuggingFace community for providing the infrastructure to host these resources.