Update README.md
Browse files
README.md
CHANGED
|
@@ -1,4 +1,52 @@
|
|
| 1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
This results in:
|
| 3 |
|
| 4 |
- stable semantic units
|
|
|
|
| 1 |
|
| 2 |
+
---
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# 🪶 Overview
|
| 7 |
+
|
| 8 |
+
This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE, SentencePiece, or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing subword fragmentation.
|
| 9 |
+
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# 🔤 Available Vocabulary Sizes
|
| 13 |
+
|
| 14 |
+
This repository includes **three tokenizer variants**:
|
| 15 |
+
|
| 16 |
+
| Vocabulary | File |
|
| 17 |
+
|------------|-------|
|
| 18 |
+
| **5k** | `GAT_Kannada_8k.json` |
|
| 19 |
+
| **16k** | `GAT_Kannada_16k.json` |
|
| 20 |
+
| **32k** | `GAT_Kannada_32k.json` *(recommended)* |
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
# ✨ Why Grapheme-Aware Tokenization?
|
| 25 |
+
|
| 26 |
+
Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:
|
| 27 |
+
|
| 28 |
+
- multiple consonants
|
| 29 |
+
- a halant (virama)
|
| 30 |
+
- vowel diacritics (ಮಾತ್ರೆಗಳು)
|
| 31 |
+
|
| 32 |
+
For example:
|
| 33 |
+
|
| 34 |
+
ಕ್ರಿ
|
| 35 |
+
|
| 36 |
+
is **one grapheme**, but composed of multiple Unicode codepoints.
|
| 37 |
+
|
| 38 |
+
ಕ್ + ರ್ + ಿ → 3–4 fragments
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
### ✅ GAT Solution
|
| 42 |
+
|
| 43 |
+
GAT applies a **custom grapheme parser** that merges the components into **one atomic unit**:
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
### ❌ Problem with BPE / SentencePiece / WordPiece
|
| 47 |
+
|
| 48 |
+
These tokenizers operate at the byte or character level:
|
| 49 |
+
|
| 50 |
This results in:
|
| 51 |
|
| 52 |
- stable semantic units
|