Arabic
arabic
tokenizer
morphology
nlp
dialect
File size: 1,648 Bytes
6443890
 
 
 
 
 
bedd199
6443890
 
 
60eb243
 
073d643
6443890
3b90e9e
d83bb67
3b90e9e
073d643
3b90e9e
073d643
 
 
 
 
 
 
 
bedd199
 
073d643
 
 
 
 
 
 
3b90e9e
 
 
 
 
 
073d643
bedd199
073d643
 
bedd199
3b90e9e
 
 
073d643
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
tags:
- arabic
- tokenizer
- morphology
- nlp
- dialect
license: apache-2.0
language:
- ar
datasets:
- dataflare/arabic-dialect-corpus
- dataflare/egypt-legal-corpus
---

# DF-Arc

**DF-Arc** is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining **Morphological Pre-tokenization** with **PMI-based Phrase Merging**.

It achieves near 1:1 fertility (1.26) and high semantic density.

## Key Highlights

- **Architecture**: Unigram SentencePiece (compatible with `LlamaTokenizer`).
- **Vocab Size**: 64,000 tokens.
- **Baked-in Logic**: Rules for morphology (prefixes) and identity (God/Prophet names) are built into the vocabulary. No custom code needed.
- **Dialect Native**: Trained on Egyptian dialogue, songs, and feedback corpora.

## Performance

| Model | Fertility | Total Tokens | Total Words |
|-------|-----------|--------------|-------------|
| DF-Arc | 1.260 | 144,734 | 114,882 |
| GPT-4 (cl100k) | 3.689 | 423,743 | 114,882 |
| AraBERT v2 | 1.555 | 178,609 | 114,882 |
| AraT5 | 1.193 | 137,107 | 114,882 |

## Usage

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc")
text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا"

print(tokenizer.tokenize(text))
# Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا']
```

## Citation

```bibtex
@misc{df_arc,
  title={DF-Arc: The Arabic Token Tax & Morphology-Aware Tokenization},
  author={Dataflare Lab},
  year={2026},
  publisher={Hugging Face}
}
```