Update README.md
Browse files
README.md
CHANGED
|
@@ -3,13 +3,13 @@
|
|
| 3 |
license: apache-2.0
|
| 4 |
---
|
| 5 |
|
| 6 |
-
#
|
| 7 |
|
| 8 |
This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE, SentencePiece, or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing subword fragmentation.
|
| 9 |
|
| 10 |
---
|
| 11 |
|
| 12 |
-
#
|
| 13 |
|
| 14 |
This repository includes **three tokenizer variants**:
|
| 15 |
|
|
@@ -21,7 +21,7 @@ This repository includes **three tokenizer variants**:
|
|
| 21 |
|
| 22 |
---
|
| 23 |
|
| 24 |
-
#
|
| 25 |
|
| 26 |
Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:
|
| 27 |
|
|
@@ -37,13 +37,7 @@ is **one grapheme**, but composed of multiple Unicode codepoints.
|
|
| 37 |
|
| 38 |
ಕ್ + ರ್ + ಿ → 3–4 fragments
|
| 39 |
|
| 40 |
-
|
| 41 |
-
### ✅ GAT Solution
|
| 42 |
-
|
| 43 |
-
GAT applies a **custom grapheme parser** that merges the components into **one atomic unit**:
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
### ❌ Problem with BPE / SentencePiece / WordPiece
|
| 47 |
|
| 48 |
These tokenizers operate at the byte or character level:
|
| 49 |
|
|
@@ -51,7 +45,11 @@ This results in:
|
|
| 51 |
|
| 52 |
- stable semantic units
|
| 53 |
- better compression
|
| 54 |
-
- more efficient tokenization
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
GAT uses a rule-based finite-state parser that correctly handles:
|
| 57 |
|
|
@@ -69,7 +67,7 @@ After grapheme segmentation, **Byte Pair Encoding (BPE)** is applied to learn hi
|
|
| 69 |
|
| 70 |
---
|
| 71 |
|
| 72 |
-
#
|
| 73 |
|
| 74 |
Tokenizer training uses a **composite 4.5M-sentence Kannada corpus**:
|
| 75 |
|
|
@@ -80,7 +78,7 @@ This provides broad coverage of conversational, literary, and instruction-follow
|
|
| 80 |
|
| 81 |
---
|
| 82 |
|
| 83 |
-
#
|
| 84 |
|
| 85 |
These metrics evaluate tokenizer quality independent of any downstream NLP model.
|
| 86 |
|
|
|
|
| 3 |
license: apache-2.0
|
| 4 |
---
|
| 5 |
|
| 6 |
+
# Overview
|
| 7 |
|
| 8 |
This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE, SentencePiece, or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing subword fragmentation.
|
| 9 |
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# Available Vocabulary Sizes
|
| 13 |
|
| 14 |
This repository includes **three tokenizer variants**:
|
| 15 |
|
|
|
|
| 21 |
|
| 22 |
---
|
| 23 |
|
| 24 |
+
# Why Grapheme-Aware Tokenization?
|
| 25 |
|
| 26 |
Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:
|
| 27 |
|
|
|
|
| 37 |
|
| 38 |
ಕ್ + ರ್ + ಿ → 3–4 fragments
|
| 39 |
|
| 40 |
+
### Problem with BPE / SentencePiece / WordPiece
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
These tokenizers operate at the byte or character level:
|
| 43 |
|
|
|
|
| 45 |
|
| 46 |
- stable semantic units
|
| 47 |
- better compression
|
| 48 |
+
- more efficient tokenization
|
| 49 |
+
-
|
| 50 |
+
### GAT Solution
|
| 51 |
+
|
| 52 |
+
GAT applies a **custom grapheme parser** that merges the components into **one atomic unit**:
|
| 53 |
|
| 54 |
GAT uses a rule-based finite-state parser that correctly handles:
|
| 55 |
|
|
|
|
| 67 |
|
| 68 |
---
|
| 69 |
|
| 70 |
+
# Training Data
|
| 71 |
|
| 72 |
Tokenizer training uses a **composite 4.5M-sentence Kannada corpus**:
|
| 73 |
|
|
|
|
| 78 |
|
| 79 |
---
|
| 80 |
|
| 81 |
+
# Tokenizer Metrics
|
| 82 |
|
| 83 |
These metrics evaluate tokenizer quality independent of any downstream NLP model.
|
| 84 |
|