Arabic
arabic
tokenizer
morphology
nlp
dialect
fr3on commited on
Commit
073d643
·
verified ·
1 Parent(s): 7956c14

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -24
README.md CHANGED
@@ -10,44 +10,51 @@ language:
10
  - ar
11
  datasets:
12
  - dataflare/arabic-dialect-corpus
13
- - fr3on/egyptian-dialogue
14
- - fr3on/egyptian-songs
15
- - fr3on/arabic-feedback-corpus
16
  ---
17
 
18
- # DF-Arc v1.1: Morphology-Aware Arabic Tokenizer
19
 
20
- DF-Arc is a specialized tokenizer for Arabic LLMs that minimizes the "Arabic Token Tax". By combining **Morphological Pre-tokenization** with **PMI-based Phrase Merging**, it achieves near 1:1 fertility (0.83 fertility on dialects), preserving semantic coherence better than GPT-4 or standard BERT tokenizers.
21
 
22
- ## New in v1.1
23
- - **PMI-Powered Phrase Merging**: Learning phrases based on statistical coupling (Pointwise Mutual Information) rather than just frequency.
24
- - **Embedded Protections**: Built-in protection for sensitive entities (e.g., "Allah", "Mohamed") and common trademarks without external files.
25
- - **Enhanced Dialect Support**: Trained on a broader corpus including Egyptian dialogue, songs, and feedback datasets.
26
- - **Self-Contained**: No extra config files needed; just load and go.
 
 
 
27
 
28
  ## Performance
29
- | Model | Fertility (lower is better) | Efficiency vs GPT-4 |
30
- |-------|-----------------------------|---------------------|
31
- | **DF-Arc v1.1** | **0.83** | **+77.6%** |
32
- | GPT-4 (cl100k) | 3.69 | Baseline |
33
- | AraBERT v2 | 1.56 | - |
 
 
 
34
 
35
  ## Usage
36
 
37
  ```python
38
  from transformers import AutoTokenizer
39
 
40
- # trust_remote_code=True is required for custom logic
41
- tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc", trust_remote_code=True)
42
-
43
- # Example: Dialectal + MSA
44
  text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا"
45
- tokens = tokenizer.tokenize(text)
46
- print(tokens)
47
  # Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا']
48
- # Note "الله" preserved, phrases like "بسم الله" handled naturally.
49
  ```
50
 
51
  ## Citation
52
- If you use DF-Arc, please cite our paper:
53
- *The Arabic Token Tax: Quantifying Tokenization Inefficiency in Large Language Models* (Dataflare Lab, 2026).
 
 
 
 
 
 
 
 
10
  - ar
11
  datasets:
12
  - dataflare/arabic-dialect-corpus
13
+ - dataflare/egypt-legal-corpus
 
 
14
  ---
15
 
16
+ # DF-Arc v1.1
17
 
18
+ **DF-Arc** is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining **Morphological Pre-tokenization** with **PMI-based Phrase Merging**.
19
 
20
+ It achieves near 1:1 fertility (1.26) and high semantic density.
21
+
22
+ ## Key Highlights
23
+
24
+ - **Architecture**: Unigram SentencePiece (compatible with `LlamaTokenizer`).
25
+ - **Vocab Size**: 64,000 tokens.
26
+ - **Baked-in Logic**: Rules for morphology (prefixes) and identity (God/Prophet names) are built into the vocabulary. No custom code needed.
27
+ - **Dialect Native**: Trained on Egyptian dialogue, songs, and feedback corpora.
28
 
29
  ## Performance
30
+
31
+ | Model | Fertility | Total Tokens | Total Words |
32
+ |-------|-----------|--------------|-------------|
33
+ | DF-Arc | 1.260 | 144,734 | 114,882 |
34
+ | GPT-4 (cl100k) | 3.689 | 423,743 | 114,882 |
35
+ | AraBERT v2 | 1.555 | 178,609 | 114,882 |
36
+ | AraT5 | 1.193 | 137,107 | 114,882 |
37
+ | Granite (3B) | 3.689 | 423,743 | 114,882 |
38
 
39
  ## Usage
40
 
41
  ```python
42
  from transformers import AutoTokenizer
43
 
44
+ tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc")
 
 
 
45
  text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا"
46
+
47
+ print(tokenizer.tokenize(text))
48
  # Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا']
 
49
  ```
50
 
51
  ## Citation
52
+
53
+ ```bibtex
54
+ @misc{df_arc,
55
+ title={DF-Arc: The Arabic Token Tax & Morphology-Aware Tokenization},
56
+ author={Dataflare Lab},
57
+ year={2026},
58
+ publisher={Hugging Face}
59
+ }
60
+ ```