Arabic
arabic
tokenizer
morphology
nlp
dialect
df-arc / README.md
fr3on's picture
Update README.md
d83bb67 verified
metadata
tags:
  - arabic
  - tokenizer
  - morphology
  - nlp
  - dialect
license: apache-2.0
language:
  - ar
datasets:
  - dataflare/arabic-dialect-corpus
  - dataflare/egypt-legal-corpus

DF-Arc

DF-Arc is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining Morphological Pre-tokenization with PMI-based Phrase Merging.

It achieves near 1:1 fertility (1.26) and high semantic density.

Key Highlights

  • Architecture: Unigram SentencePiece (compatible with LlamaTokenizer).
  • Vocab Size: 64,000 tokens.
  • Baked-in Logic: Rules for morphology (prefixes) and identity (God/Prophet names) are built into the vocabulary. No custom code needed.
  • Dialect Native: Trained on Egyptian dialogue, songs, and feedback corpora.

Performance

Model Fertility Total Tokens Total Words
DF-Arc 1.260 144,734 114,882
GPT-4 (cl100k) 3.689 423,743 114,882
AraBERT v2 1.555 178,609 114,882
AraT5 1.193 137,107 114,882

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc")
text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا"

print(tokenizer.tokenize(text))
# Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا']

Citation

@misc{df_arc,
  title={DF-Arc: The Arabic Token Tax & Morphology-Aware Tokenization},
  author={Dataflare Lab},
  year={2026},
  publisher={Hugging Face}
}