|
|
--- |
|
|
tags: |
|
|
- arabic |
|
|
- tokenizer |
|
|
- morphology |
|
|
- nlp |
|
|
- dialect |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- ar |
|
|
datasets: |
|
|
- dataflare/arabic-dialect-corpus |
|
|
- dataflare/egypt-legal-corpus |
|
|
--- |
|
|
|
|
|
# DF-Arc |
|
|
|
|
|
**DF-Arc** is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining **Morphological Pre-tokenization** with **PMI-based Phrase Merging**. |
|
|
|
|
|
It achieves near 1:1 fertility (1.26) and high semantic density. |
|
|
|
|
|
## Key Highlights |
|
|
|
|
|
- **Architecture**: Unigram SentencePiece (compatible with `LlamaTokenizer`). |
|
|
- **Vocab Size**: 64,000 tokens. |
|
|
- **Baked-in Logic**: Rules for morphology (prefixes) and identity (God/Prophet names) are built into the vocabulary. No custom code needed. |
|
|
- **Dialect Native**: Trained on Egyptian dialogue, songs, and feedback corpora. |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Model | Fertility | Total Tokens | Total Words | |
|
|
|-------|-----------|--------------|-------------| |
|
|
| DF-Arc | 1.260 | 144,734 | 114,882 | |
|
|
| GPT-4 (cl100k) | 3.689 | 423,743 | 114,882 | |
|
|
| AraBERT v2 | 1.555 | 178,609 | 114,882 | |
|
|
| AraT5 | 1.193 | 137,107 | 114,882 | |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc") |
|
|
text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا" |
|
|
|
|
|
print(tokenizer.tokenize(text)) |
|
|
# Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا'] |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{df_arc, |
|
|
title={DF-Arc: The Arabic Token Tax & Morphology-Aware Tokenization}, |
|
|
author={Dataflare Lab}, |
|
|
year={2026}, |
|
|
publisher={Hugging Face} |
|
|
} |
|
|
``` |