varuni
/

GAT-K

Model card Files Files and versions

xet

Community

varuni commited on Nov 29, 2025

Commit

7e536e5

verified ·

1 Parent(s): b52a444

Update README.md

Browse files

Files changed (1) hide show

README.md +11 -13

README.md CHANGED Viewed

@@ -3,13 +3,13 @@
 license: apache-2.0
 ---
-# 🪶 Overview
 This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE, SentencePiece, or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing subword fragmentation.
 ---
-# 🔤 Available Vocabulary Sizes
 This repository includes **three tokenizer variants**:
@@ -21,7 +21,7 @@ This repository includes **three tokenizer variants**:
 ---
-# ✨ Why Grapheme-Aware Tokenization?
 Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:
@@ -37,13 +37,7 @@ is **one grapheme**, but composed of multiple Unicode codepoints.
 ಕ್ + ರ್ + ಿ → 3–4 fragments
-### ✅ GAT Solution
-GAT applies a **custom grapheme parser** that merges the components into **one atomic unit**:
-### ❌ Problem with BPE / SentencePiece / WordPiece
 These tokenizers operate at the byte or character level:
@@ -51,7 +45,11 @@ This results in:
 - stable semantic units
 - better compression
-- more efficient tokenization
 GAT uses a rule-based finite-state parser that correctly handles:
@@ -69,7 +67,7 @@ After grapheme segmentation, **Byte Pair Encoding (BPE)** is applied to learn hi
 ---
-# 📚 Training Data
 Tokenizer training uses a **composite 4.5M-sentence Kannada corpus**:
@@ -80,7 +78,7 @@ This provides broad coverage of conversational, literary, and instruction-follow
 ---
-# 📊 Tokenizer Metrics
 These metrics evaluate tokenizer quality independent of any downstream NLP model.

 license: apache-2.0
 ---
+# Overview
 This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE, SentencePiece, or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing subword fragmentation.
 ---
+# Available Vocabulary Sizes
 This repository includes **three tokenizer variants**:
 ---
+# Why Grapheme-Aware Tokenization?
 Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:
 ಕ್ + ರ್ + ಿ → 3–4 fragments
+### Problem with BPE / SentencePiece / WordPiece
 These tokenizers operate at the byte or character level:
 - stable semantic units
 - better compression
+- more efficient tokenization
+-
+### GAT Solution
+GAT applies a **custom grapheme parser** that merges the components into **one atomic unit**:
 GAT uses a rule-based finite-state parser that correctly handles:
 ---
+# Training Data
 Tokenizer training uses a **composite 4.5M-sentence Kannada corpus**:
 ---
+# Tokenizer Metrics
 These metrics evaluate tokenizer quality independent of any downstream NLP model.