Arabic
arabic
tokenizer
morphology
nlp
dialect
fr3on commited on
Commit
1b8fc9e
·
verified ·
1 Parent(s): 3b90e9e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -12
README.md CHANGED
@@ -1,13 +1,15 @@
1
- ---
2
- tags:
3
- - arabic
4
- - tokenizer
5
- - morphology
6
- - nlp
7
- license: apache-2.0
8
- language:
9
- - ar
10
- ---
 
 
11
 
12
  # DF-Arc: Morphology-Aware Arabic Tokenizer
13
 
@@ -25,11 +27,11 @@ from transformers import AutoTokenizer
25
 
26
  tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc", trust_remote_code=True)
27
 
28
- text = "والكتابة بالعربية ممتعة جدا"
29
  tokens = tokenizer.tokenize(text)
30
  print(tokens)
31
  ```
32
 
33
  ## Citation
34
  If you use DF-Arc, please cite our paper:
35
- *The Arabic Token Tax: Quantifying Tokenization Inefficiency in Large Language Models* (Dataflare Lab, 2026).
 
1
+ ---
2
+ tags:
3
+ - arabic
4
+ - tokenizer
5
+ - morphology
6
+ - nlp
7
+ license: apache-2.0
8
+ language:
9
+ - ar
10
+ datasets:
11
+ - dataflare/arabic-dialect-corpus
12
+ ---
13
 
14
  # DF-Arc: Morphology-Aware Arabic Tokenizer
15
 
 
27
 
28
  tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc", trust_remote_code=True)
29
 
30
+ text = "الكتابة بالعربية ممتعة جدا"
31
  tokens = tokenizer.tokenize(text)
32
  print(tokens)
33
  ```
34
 
35
  ## Citation
36
  If you use DF-Arc, please cite our paper:
37
+ *The Arabic Token Tax: Quantifying Tokenization Inefficiency in Large Language Models* (Dataflare Lab, 2026).