varuni
/

GAT-K

varuni commited on Nov 29, 2025

Commit

b52a444

verified ·

1 Parent(s): ab1f7bf

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,4 +1,52 @@
 This results in:
 - stable semantic units

+---
+license: apache-2.0
+---
+# 🪶 Overview
+This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE, SentencePiece, or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing subword fragmentation.
+---
+# 🔤 Available Vocabulary Sizes
+This repository includes **three tokenizer variants**:
+| Vocabulary | File |
+|------------|-------|
+| **5k**  | `GAT_Kannada_8k.json` |
+| **16k** | `GAT_Kannada_16k.json` |
+| **32k** | `GAT_Kannada_32k.json` *(recommended)* |
+---
+# ✨ Why Grapheme-Aware Tokenization?
+Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:
+- multiple consonants
+- a halant (virama)
+- vowel diacritics (ಮಾತ್ರೆಗಳು)
+For example:
+ಕ್ರಿ
+is **one grapheme**, but composed of multiple Unicode codepoints.
+ಕ್ + ರ್ + ಿ → 3–4 fragments
+### ✅ GAT Solution
+GAT applies a **custom grapheme parser** that merges the components into **one atomic unit**:
+### ❌ Problem with BPE / SentencePiece / WordPiece
+These tokenizers operate at the byte or character level:
 This results in:
 - stable semantic units