Arabic
arabic
tokenizer
morphology
nlp
dialect
fr3on commited on
Commit
bedd199
·
verified ·
1 Parent(s): bc06a6e

Update README for v1.1 release

Browse files
Files changed (1) hide show
  1. README.md +23 -7
README.md CHANGED
@@ -4,32 +4,48 @@ tags:
4
  - tokenizer
5
  - morphology
6
  - nlp
 
7
  license: apache-2.0
8
  language:
9
  - ar
10
  datasets:
11
  - dataflare/arabic-dialect-corpus
 
 
 
12
  ---
13
 
14
- # DF-Arc: Morphology-Aware Arabic Tokenizer
15
 
16
- DF-Arc is a specialized tokenizer for Arabic LLMs that achieves **1.0 fertility** (one token per word) on average, eliminating the "Arabic Token Tax".
17
 
18
- ## Features
19
- - **Morphological Pre-tokenization**: Splits words into prefix-stem-suffix units.
20
- - **Phrase Merging**: Automatically merges common multi-word expressions (e.g., "in the name of God") into single tokens.
21
- - **Dialect Support**: Optimized for Egyptian, Gulf, and Levantine dialects.
 
 
 
 
 
 
 
 
22
 
23
  ## Usage
24
 
25
  ```python
26
  from transformers import AutoTokenizer
27
 
 
28
  tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc", trust_remote_code=True)
29
 
30
- text = "الكتابة بالعربية ممتعة جدا"
 
31
  tokens = tokenizer.tokenize(text)
32
  print(tokens)
 
 
33
  ```
34
 
35
  ## Citation
 
4
  - tokenizer
5
  - morphology
6
  - nlp
7
+ - dialect
8
  license: apache-2.0
9
  language:
10
  - ar
11
  datasets:
12
  - dataflare/arabic-dialect-corpus
13
+ - fr3on/egyptian-dialogue
14
+ - fr3on/egyptian-songs
15
+ - fr3on/arabic-feedback-corpus
16
  ---
17
 
18
+ # DF-Arc v1.1: Morphology-Aware Arabic Tokenizer
19
 
20
+ DF-Arc is a specialized tokenizer for Arabic LLMs that minimizes the "Arabic Token Tax". By combining **Morphological Pre-tokenization** with **PMI-based Phrase Merging**, it achieves near 1:1 fertility (0.83 fertility on dialects), preserving semantic coherence better than GPT-4 or standard BERT tokenizers.
21
 
22
+ ## New in v1.1
23
+ - **PMI-Powered Phrase Merging**: Learning phrases based on statistical coupling (Pointwise Mutual Information) rather than just frequency.
24
+ - **Embedded Protections**: Built-in protection for sensitive entities (e.g., "Allah", "Mohamed") and common trademarks without external files.
25
+ - **Enhanced Dialect Support**: Trained on a broader corpus including Egyptian dialogue, songs, and feedback datasets.
26
+ - **Self-Contained**: No extra config files needed; just load and go.
27
+
28
+ ## Performance
29
+ | Model | Fertility (lower is better) | Efficiency vs GPT-4 |
30
+ |-------|-----------------------------|---------------------|
31
+ | **DF-Arc v1.1** | **0.83** | **+77.6%** |
32
+ | GPT-4 (cl100k) | 3.69 | Baseline |
33
+ | AraBERT v2 | 1.56 | - |
34
 
35
  ## Usage
36
 
37
  ```python
38
  from transformers import AutoTokenizer
39
 
40
+ # trust_remote_code=True is required for custom logic
41
  tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc", trust_remote_code=True)
42
 
43
+ # Example: Dialectal + MSA
44
+ text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا"
45
  tokens = tokenizer.tokenize(text)
46
  print(tokens)
47
+ # Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا']
48
+ # Note "الله" preserved, phrases like "بسم الله" handled naturally.
49
  ```
50
 
51
  ## Citation