varuni commited on
Commit
ab1f7bf
·
verified ·
1 Parent(s): ba86d5d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -69
README.md CHANGED
@@ -1,99 +1,76 @@
1
- ---
2
- license: apache-2.0
3
- ---
4
- # Overview
5
-
6
- This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing tokenization imbalance in token count.
7
-
8
- # Available Vocabulary Sizes
9
-
10
- This repository includes **three tokenizer variants**:
11
-
12
- | Vocabulary | File |
13
- |------------|-------|
14
- | **5k** | `GAT_Kannada_8k.json` |
15
- | **16k** | `GAT_Kannada_16k.json` |
16
- | **32k** | `GAT_Kannada_32k.json` |
17
-
18
- # Why Grapheme-Aware Tokenization?
19
-
20
- Kannada is an **Abugida** script where a single grapheme may be composed of:
21
-
22
- - multiple consonants
23
- - halant (virama)
24
- - dependent vowel signs (diacritics)
25
-
26
- For example:
27
-
28
- ಕ್ರಿ is one grapheme but consists of multiple Unicode codepoints.
29
-
30
- ### BPE/SentencePiece/WordPiece Problem:
31
- They split Kannada graphemes incorrectly:
32
 
33
- ಕ್ + ರ್ + ಿ (3–4 fragments)
34
 
35
- ### GAT Solution:
36
- A custom grapheme parser merges characters into **one atomic unit**:
 
37
 
38
- ಕ್ರಿ 1 grapheme
39
-
40
- This improves token stability, compression, and efficiency.
41
-
42
- GAT uses a rule-based finite-state parser that handles:
43
 
44
  - consonants
45
  - vowels
46
  - halants
47
- - vowel diacritics
48
- - signs (anusvara, visarga)
49
 
50
  <p align="center">
51
  <img src="./GAT-algo.png" width="650"/>
52
  </p>
53
 
54
- This pre-tokenized output is then passed to **Byte Pair Encoding (BPE)** to learn statistically meaningful merges on top of linguistically meaningful units.
 
 
 
 
55
 
56
- For tokenizer training, we utilize a **composite corpus** consisting of:
57
 
58
- 1. **[Samanantar dataset](https://github.com/AI4Bharat/indic-parallel-corpus)** by AI4Bharat
59
- 2. **[Kannada-Instruct dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset)** by Cognitive Lab
60
 
61
- Together, these datasets amount to approximately **4.5 million Kannada sentences**
62
 
63
- # Tokenizer Metrics
64
 
65
- These metrics evaluate tokenizer quality independent of any NLP model.
66
 
67
- ## Compression Ratio (CR)**
68
- Higher = better
69
 
70
-
71
- ### **Results Across Vocab Sizes**
72
 
73
- | Tokenizer | Vocab | CR | FS |
74
- |-----------|-------|------|------|--------------------|
75
- | **GAT (ours)** | 8k | 3.588 | 2.168 |
76
- | SentencePiece | 8k | 3.100 | 2.445 |
77
- | BPE | 8k | 3.300 | 2.711 | 16,081 |
78
- | WordPiece | 8k | 2.343 | 3.486 |
79
- | **GAT (ours)** | 16k | 2.400 | 1.986 |
80
- | SentencePiece | 16k | 3.78 | 1.917 |
81
- | BPE | 16k | 3.840 | 3.940 | 347,656 |
82
- | WordPiece | 16k | 3.243 | 2.676 |
83
- | **GAT (ours)** | 32k | 4.806 | 1.827 |
84
- | SentencePiece | 32k | 3.855 | 1.675 |
85
- | BPE | 32k | 3.512 | 1.769 |
86
- | WordPiece | 32k | 3.143 | 1.708 |
87
 
88
- # Usage Example
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
  ### Load the 32k tokenizer
 
91
  ```python
92
  from transformers import PreTrainedTokenizerFast
93
 
94
  tokenizer = PreTrainedTokenizerFast.from_pretrained(
95
- "varuni/GAT-K",
96
- tokenizer_file="GAT_Kannada_32k.json"
97
  )
98
 
99
  text = "ನಿಮ್ಮ ಹೆಸರು ಏನು?"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
 
2
+ This results in:
3
 
4
+ - stable semantic units
5
+ - better compression
6
+ - more efficient tokenization
7
 
8
+ GAT uses a rule-based finite-state parser that correctly handles:
 
 
 
 
9
 
10
  - consonants
11
  - vowels
12
  - halants
13
+ - vowel signs
14
+ - anusvara & visarga
15
 
16
  <p align="center">
17
  <img src="./GAT-algo.png" width="650"/>
18
  </p>
19
 
20
+ After grapheme segmentation, **Byte Pair Encoding (BPE)** is applied to learn higher-level merges.
21
+
22
+ ---
23
+
24
+ # 📚 Training Data
25
 
26
+ Tokenizer training uses a **composite 4.5M-sentence Kannada corpus**:
27
 
28
+ 1. **Samanantar Dataset** (AI4Bharat)
29
+ 2. **Kannada-Instruct Dataset** (Cognitive Lab)
30
 
31
+ This provides broad coverage of conversational, literary, and instruction-following Kannada.
32
 
33
+ ---
34
 
35
+ # 📊 Tokenizer Metrics
36
 
37
+ These metrics evaluate tokenizer quality independent of any downstream NLP model.
 
38
 
39
+ ## **Compression Ratio (CR)**
40
+ Higher = better (larger text compressed into fewer bytes)
41
 
42
+ ## **Fertility Score (FS)**
43
+ Lower = better (#tokens produced per grapheme/character)
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
+ ### **Results Across Vocabulary Sizes**
46
+
47
+ | Tokenizer | Vocab | CR | FS |
48
+ |-----------|-------|-------|-------|
49
+ | **GAT (ours)** | 8k | **3.588** | 2.168 |
50
+ | SentencePiece | 8k | 3.100 | 2.445 |
51
+ | BPE | 8k | 3.300 | 2.711 |
52
+ | WordPiece | 8k | 2.343 | 3.486 |
53
+ | **GAT (ours)** | 16k | **3.930** | 1.986 |
54
+ | SentencePiece | 16k | 3.780 | 1.917 |
55
+ | BPE | 16k | 3.540 | 3.940 |
56
+ | WordPiece | 16k | 3.243 | 2.676 |
57
+ | **GAT (ours)** | 32k | **4.806** | 1.827 |
58
+ | SentencePiece | 32k | 3.855 | 1.675 |
59
+ | BPE | 32k | 3.512 | 1.769 |
60
+ | WordPiece | 32k | 3.143 | 1.708 |
61
+
62
+ ---
63
+
64
+ # 💻 Usage Example
65
 
66
  ### Load the 32k tokenizer
67
+
68
  ```python
69
  from transformers import PreTrainedTokenizerFast
70
 
71
  tokenizer = PreTrainedTokenizerFast.from_pretrained(
72
+ "varuni/GAT-K",
73
+ tokenizer_file="GAT_Kannada_32k.json"
74
  )
75
 
76
  text = "ನಿಮ್ಮ ಹೆಸರು ಏನು?"