df-arc / README.md

fr3on

Update README.md

d83bb67 verified 2 days ago

preview code

raw

history blame contribute delete

1.65 kB

metadata

tags:
  - arabic
  - tokenizer
  - morphology
  - nlp
  - dialect
license: apache-2.0
language:
  - ar
datasets:
  - dataflare/arabic-dialect-corpus
  - dataflare/egypt-legal-corpus

DF-Arc

DF-Arc is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining Morphological Pre-tokenization with PMI-based Phrase Merging.

It achieves near 1:1 fertility (1.26) and high semantic density.

Key Highlights

Architecture: Unigram SentencePiece (compatible with LlamaTokenizer).
Vocab Size: 64,000 tokens.
Baked-in Logic: Rules for morphology (prefixes) and identity (God/Prophet names) are built into the vocabulary. No custom code needed.
Dialect Native: Trained on Egyptian dialogue, songs, and feedback corpora.

Performance

Model	Fertility	Total Tokens	Total Words
DF-Arc	1.260	144,734	114,882
GPT-4 (cl100k)	3.689	423,743	114,882
AraBERT v2	1.555	178,609	114,882
AraT5	1.193	137,107	114,882

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc")
text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا"

print(tokenizer.tokenize(text))
# Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا']

Citation

@misc{df_arc,
  title={DF-Arc: The Arabic Token Tax & Morphology-Aware Tokenization},
  author={Dataflare Lab},
  year={2026},
  publisher={Hugging Face}
}