dataflare
/

df-arc

@@ -10,44 +10,51 @@ language:
 - ar
 datasets:
 - dataflare/arabic-dialect-corpus
-- fr3on/egyptian-dialogue
-- fr3on/egyptian-songs
-- fr3on/arabic-feedback-corpus
 ---
-# DF-Arc v1.1: Morphology-Aware Arabic Tokenizer
-DF-Arc is a specialized tokenizer for Arabic LLMs that minimizes the "Arabic Token Tax". By combining **Morphological Pre-tokenization** with **PMI-based Phrase Merging**, it achieves near 1:1 fertility (0.83 fertility on dialects), preserving semantic coherence better than GPT-4 or standard BERT tokenizers.
-## New in v1.1
-- **PMI-Powered Phrase Merging**: Learning phrases based on statistical coupling (Pointwise Mutual Information) rather than just frequency.
-- **Embedded Protections**: Built-in protection for sensitive entities (e.g., "Allah", "Mohamed") and common trademarks without external files.
-- **Enhanced Dialect Support**: Trained on a broader corpus including Egyptian dialogue, songs, and feedback datasets.
-- **Self-Contained**: No extra config files needed; just load and go.
 ## Performance
-| Model | Fertility (lower is better) | Efficiency vs GPT-4 |
-|-------|-----------------------------|---------------------|
-| **DF-Arc v1.1** | **0.83** | **+77.6%** |
-| GPT-4 (cl100k) | 3.69 | Baseline |
-| AraBERT v2 | 1.56 | - |
 ## Usage
 ```python
 from transformers import AutoTokenizer
-# trust_remote_code=True is required for custom logic
-tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc", trust_remote_code=True)
-# Example: Dialectal + MSA
 text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا"
-tokens = tokenizer.tokenize(text)
-print(tokens)
 # Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا']
-# Note "الله" preserved, phrases like "بسم الله" handled naturally.
 ```
 ## Citation
-If you use DF-Arc, please cite our paper:
-*The Arabic Token Tax: Quantifying Tokenization Inefficiency in Large Language Models* (Dataflare Lab, 2026).

 - ar
 datasets:
 - dataflare/arabic-dialect-corpus
+- dataflare/egypt-legal-corpus
 ---
+# DF-Arc v1.1
+**DF-Arc** is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining **Morphological Pre-tokenization** with **PMI-based Phrase Merging**.
+It achieves near 1:1 fertility (1.26) and high semantic density.
+## Key Highlights
+- **Architecture**: Unigram SentencePiece (compatible with `LlamaTokenizer`).
+- **Vocab Size**: 64,000 tokens.
+- **Baked-in Logic**: Rules for morphology (prefixes) and identity (God/Prophet names) are built into the vocabulary. No custom code needed.
+- **Dialect Native**: Trained on Egyptian dialogue, songs, and feedback corpora.
 ## Performance
+| Model | Fertility | Total Tokens | Total Words |
+|-------|-----------|--------------|-------------|
+| DF-Arc | 1.260 | 144,734 | 114,882 |
+| GPT-4 (cl100k) | 3.689 | 423,743 | 114,882 |
+| AraBERT v2 | 1.555 | 178,609 | 114,882 |
+| AraT5 | 1.193 | 137,107 | 114,882 |
+| Granite (3B) | 3.689 | 423,743 | 114,882 |
 ## Usage
 ```python
 from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc")
 text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا"
+print(tokenizer.tokenize(text))
 # Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا']
 ```
 ## Citation
+```bibtex
+@misc{df_arc,
+  title={DF-Arc: The Arabic Token Tax & Morphology-Aware Tokenization},
+  author={Dataflare Lab},
+  year={2026},
+  publisher={Hugging Face}
+}
+```