varuni commited on
Commit
7e536e5
·
verified ·
1 Parent(s): b52a444

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -13
README.md CHANGED
@@ -3,13 +3,13 @@
3
  license: apache-2.0
4
  ---
5
 
6
- # 🪶 Overview
7
 
8
  This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE, SentencePiece, or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing subword fragmentation.
9
 
10
  ---
11
 
12
- # 🔤 Available Vocabulary Sizes
13
 
14
  This repository includes **three tokenizer variants**:
15
 
@@ -21,7 +21,7 @@ This repository includes **three tokenizer variants**:
21
 
22
  ---
23
 
24
- # Why Grapheme-Aware Tokenization?
25
 
26
  Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:
27
 
@@ -37,13 +37,7 @@ is **one grapheme**, but composed of multiple Unicode codepoints.
37
 
38
  ಕ್ + ರ್ + ಿ → 3–4 fragments
39
 
40
-
41
- ### ✅ GAT Solution
42
-
43
- GAT applies a **custom grapheme parser** that merges the components into **one atomic unit**:
44
-
45
-
46
- ### ❌ Problem with BPE / SentencePiece / WordPiece
47
 
48
  These tokenizers operate at the byte or character level:
49
 
@@ -51,7 +45,11 @@ This results in:
51
 
52
  - stable semantic units
53
  - better compression
54
- - more efficient tokenization
 
 
 
 
55
 
56
  GAT uses a rule-based finite-state parser that correctly handles:
57
 
@@ -69,7 +67,7 @@ After grapheme segmentation, **Byte Pair Encoding (BPE)** is applied to learn hi
69
 
70
  ---
71
 
72
- # 📚 Training Data
73
 
74
  Tokenizer training uses a **composite 4.5M-sentence Kannada corpus**:
75
 
@@ -80,7 +78,7 @@ This provides broad coverage of conversational, literary, and instruction-follow
80
 
81
  ---
82
 
83
- # 📊 Tokenizer Metrics
84
 
85
  These metrics evaluate tokenizer quality independent of any downstream NLP model.
86
 
 
3
  license: apache-2.0
4
  ---
5
 
6
+ # Overview
7
 
8
  This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE, SentencePiece, or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing subword fragmentation.
9
 
10
  ---
11
 
12
+ # Available Vocabulary Sizes
13
 
14
  This repository includes **three tokenizer variants**:
15
 
 
21
 
22
  ---
23
 
24
+ # Why Grapheme-Aware Tokenization?
25
 
26
  Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:
27
 
 
37
 
38
  ಕ್ + ರ್ + ಿ → 3–4 fragments
39
 
40
+ ### Problem with BPE / SentencePiece / WordPiece
 
 
 
 
 
 
41
 
42
  These tokenizers operate at the byte or character level:
43
 
 
45
 
46
  - stable semantic units
47
  - better compression
48
+ - more efficient tokenization
49
+ -
50
+ ### GAT Solution
51
+
52
+ GAT applies a **custom grapheme parser** that merges the components into **one atomic unit**:
53
 
54
  GAT uses a rule-based finite-state parser that correctly handles:
55
 
 
67
 
68
  ---
69
 
70
+ # Training Data
71
 
72
  Tokenizer training uses a **composite 4.5M-sentence Kannada corpus**:
73
 
 
78
 
79
  ---
80
 
81
+ # Tokenizer Metrics
82
 
83
  These metrics evaluate tokenizer quality independent of any downstream NLP model.
84