varuni commited on
Commit
b52a444
·
verified ·
1 Parent(s): ab1f7bf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -0
README.md CHANGED
@@ -1,4 +1,52 @@
1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  This results in:
3
 
4
  - stable semantic units
 
1
 
2
+ ---
3
+ license: apache-2.0
4
+ ---
5
+
6
+ # 🪶 Overview
7
+
8
+ This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE, SentencePiece, or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing subword fragmentation.
9
+
10
+ ---
11
+
12
+ # 🔤 Available Vocabulary Sizes
13
+
14
+ This repository includes **three tokenizer variants**:
15
+
16
+ | Vocabulary | File |
17
+ |------------|-------|
18
+ | **5k** | `GAT_Kannada_8k.json` |
19
+ | **16k** | `GAT_Kannada_16k.json` |
20
+ | **32k** | `GAT_Kannada_32k.json` *(recommended)* |
21
+
22
+ ---
23
+
24
+ # ✨ Why Grapheme-Aware Tokenization?
25
+
26
+ Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:
27
+
28
+ - multiple consonants
29
+ - a halant (virama)
30
+ - vowel diacritics (ಮಾತ್ರೆಗಳು)
31
+
32
+ For example:
33
+
34
+ ಕ್ರಿ
35
+
36
+ is **one grapheme**, but composed of multiple Unicode codepoints.
37
+
38
+ ಕ್ + ರ್ + ಿ → 3–4 fragments
39
+
40
+
41
+ ### ✅ GAT Solution
42
+
43
+ GAT applies a **custom grapheme parser** that merges the components into **one atomic unit**:
44
+
45
+
46
+ ### ❌ Problem with BPE / SentencePiece / WordPiece
47
+
48
+ These tokenizers operate at the byte or character level:
49
+
50
  This results in:
51
 
52
  - stable semantic units