varuni
/

GAT-K

Model card Files Files and versions

xet

Community

varuni commited on Nov 29, 2025

Commit

ab1f7bf

verified ·

1 Parent(s): ba86d5d

Update README.md

Browse files

Files changed (1) hide show

README.md +46 -69

README.md CHANGED Viewed

@@ -1,99 +1,76 @@
----
-license: apache-2.0
----
-# Overview
-This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing tokenization imbalance in token count.
-# Available Vocabulary Sizes
-This repository includes **three tokenizer variants**:
-| Vocabulary | File |
-|------------|-------|
-| **5k** | `GAT_Kannada_8k.json` |
-| **16k** | `GAT_Kannada_16k.json` |
-| **32k** | `GAT_Kannada_32k.json` |
-# Why Grapheme-Aware Tokenization?
-Kannada is an **Abugida** script where a single grapheme may be composed of:
-- multiple consonants
-- halant (virama)
-- dependent vowel signs (diacritics)
-For example:
-ಕ್ರಿ is one grapheme but consists of multiple Unicode codepoints.
-### BPE/SentencePiece/WordPiece Problem:
-They split Kannada graphemes incorrectly:
-ಕ್ + ರ್ + ಿ (3–4 fragments)
-### GAT Solution:
-A custom grapheme parser merges characters into **one atomic unit**:
-ಕ್ರಿ → 1 grapheme
-This improves token stability, compression, and efficiency.
-GAT uses a rule-based finite-state parser that handles:
 - consonants
 - vowels
 - halants
-- vowel diacritics
-- signs (anusvara, visarga)
 <p align="center">
   <img src="./GAT-algo.png" width="650"/>
 </p>
-This pre-tokenized output is then passed to **Byte Pair Encoding (BPE)** to learn statistically meaningful merges on top of linguistically meaningful units.
-For tokenizer training, we utilize a **composite corpus** consisting of:
-1. **[Samanantar dataset](https://github.com/AI4Bharat/indic-parallel-corpus)** by AI4Bharat
-2. **[Kannada-Instruct dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset)** by Cognitive Lab
-Together, these datasets amount to approximately **4.5 million Kannada sentences**
-# Tokenizer Metrics
-These metrics evaluate tokenizer quality independent of any NLP model.
-## Compression Ratio (CR)**
-Higher = better
-### **Results Across Vocab Sizes**
-| Tokenizer | Vocab | CR | FS  |
-|-----------|-------|------|------|--------------------|
-| **GAT (ours)** | 8k | 3.588 | 2.168 |
-| SentencePiece | 8k | 3.100 | 2.445 |
-| BPE | 8k | 3.300 | 2.711 | 16,081 |
-| WordPiece | 8k | 2.343 | 3.486 |
-| **GAT (ours)** | 16k | 2.400 | 1.986 |
-| SentencePiece | 16k | 3.78 | 1.917 |
-| BPE | 16k | 3.840 | 3.940 | 347,656 |
-| WordPiece | 16k | 3.243 | 2.676 |
-| **GAT (ours)** | 32k | 4.806 | 1.827 |
-| SentencePiece | 32k | 3.855 | 1.675 |
-| BPE | 32k | 3.512 | 1.769 |
-| WordPiece | 32k | 3.143 | 1.708 |
-# Usage Example
 ### Load the 32k tokenizer
 ```python
 from transformers import PreTrainedTokenizerFast
 tokenizer = PreTrainedTokenizerFast.from_pretrained(
-  "varuni/GAT-K",
-  tokenizer_file="GAT_Kannada_32k.json"
 )
 text = "ನಿಮ್ಮ ಹೆಸರು ಏನು?"

+This results in:
+- stable semantic units
+- better compression
+- more efficient tokenization
+GAT uses a rule-based finite-state parser that correctly handles:
 - consonants
 - vowels
 - halants
+- vowel signs
+- anusvara & visarga
 <p align="center">
   <img src="./GAT-algo.png" width="650"/>
 </p>
+After grapheme segmentation, **Byte Pair Encoding (BPE)** is applied to learn higher-level merges.
+---
+# 📚 Training Data
+Tokenizer training uses a **composite 4.5M-sentence Kannada corpus**:
+1. **Samanantar Dataset** (AI4Bharat)
+2. **Kannada-Instruct Dataset** (Cognitive Lab)
+This provides broad coverage of conversational, literary, and instruction-following Kannada.
+---
+# 📊 Tokenizer Metrics
+These metrics evaluate tokenizer quality independent of any downstream NLP model.
+## **Compression Ratio (CR)**
+Higher = better (larger text compressed into fewer bytes)
+## **Fertility Score (FS)**
+Lower = better (#tokens produced per grapheme/character)
+### **Results Across Vocabulary Sizes**
+| Tokenizer | Vocab | CR    | FS    |
+|-----------|-------|-------|-------|
+| **GAT (ours)**      | 8k  | **3.588** | 2.168 |
+| SentencePiece       | 8k  | 3.100 | 2.445 |
+| BPE                 | 8k  | 3.300 | 2.711 |
+| WordPiece           | 8k  | 2.343 | 3.486 |
+| **GAT (ours)**      | 16k | **3.930** | 1.986 |
+| SentencePiece       | 16k | 3.780 | 1.917 |
+| BPE                 | 16k | 3.540 | 3.940 |
+| WordPiece           | 16k | 3.243 | 2.676 |
+| **GAT (ours)**      | 32k | **4.806** | 1.827 |
+| SentencePiece       | 32k | 3.855 | 1.675 |
+| BPE                 | 32k | 3.512 | 1.769 |
+| WordPiece           | 32k | 3.143 | 1.708 |
+---
+# 💻 Usage Example
 ### Load the 32k tokenizer
 ```python
 from transformers import PreTrainedTokenizerFast
 tokenizer = PreTrainedTokenizerFast.from_pretrained(
+    "varuni/GAT-K",
+    tokenizer_file="GAT_Kannada_32k.json"
 )
 text = "ನಿಮ್ಮ ಹೆಸರು ಏನು?"