--- tags: - arabic - tokenizer - morphology - nlp - dialect license: apache-2.0 language: - ar datasets: - dataflare/arabic-dialect-corpus - dataflare/egypt-legal-corpus --- # DF-Arc **DF-Arc** is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining **Morphological Pre-tokenization** with **PMI-based Phrase Merging**. It achieves near 1:1 fertility (1.26) and high semantic density. ## Key Highlights - **Architecture**: Unigram SentencePiece (compatible with `LlamaTokenizer`). - **Vocab Size**: 64,000 tokens. - **Baked-in Logic**: Rules for morphology (prefixes) and identity (God/Prophet names) are built into the vocabulary. No custom code needed. - **Dialect Native**: Trained on Egyptian dialogue, songs, and feedback corpora. ## Performance | Model | Fertility | Total Tokens | Total Words | |-------|-----------|--------------|-------------| | DF-Arc | 1.260 | 144,734 | 114,882 | | GPT-4 (cl100k) | 3.689 | 423,743 | 114,882 | | AraBERT v2 | 1.555 | 178,609 | 114,882 | | AraT5 | 1.193 | 137,107 | 114,882 | ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc") text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا" print(tokenizer.tokenize(text)) # Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا'] ``` ## Citation ```bibtex @misc{df_arc, title={DF-Arc: The Arabic Token Tax & Morphology-Aware Tokenization}, author={Dataflare Lab}, year={2026}, publisher={Hugging Face} } ```