GAT-K / README.md
varuni's picture
updated related work
6a71af3 verified
---
license: apache-2.0
---
# Overview
This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE, SentencePiece, or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing subword fragmentation.
---
# Available Vocabulary Sizes
This repository includes **three tokenizer variants**:
| Vocabulary | File |
|------------|-------|
| **8k** | `GAT_Kannada_8k.json` |
| **16k** | `GAT_Kannada_16k.json` |
| **32k** | `GAT_Kannada_32k.json` *(recommended)* |
---
# Why Grapheme-Aware Preprocessing?
Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:
- multiple consonants
- a halant (virama)
- vowel diacritics (matra)
For example:
ಕ್ರಿ
is **one grapheme**, but composed of multiple Unicode codepoints.
ಕ್ + ರ್ + ಿ → 3–4 fragments
### Problem with BPE / SentencePiece / WordPiece
These tokenizers operate at the byte or character level:
This results in:
- stable semantic units
- better compression
- more efficient tokenization
-
### GAT Solution
GAT applies a **custom grapheme parser** that merges the components into **one atomic unit**:
GAT uses a rule-based finite-state parser that correctly handles:
- consonants
- vowels
- halants
- vowel signs
- anusvara & visarga
<p align="center">
<img src="./GAT-algo.png" width="650"/>
</p>
After grapheme segmentation, **Byte Pair Encoding (BPE)** is applied to learn higher-level merges.
---
# Training Data
Tokenizer training uses a **composite 4.5M-sentence Kannada corpus**:
1. **Samanantar Dataset** (AI4Bharat)
2. **Kannada-Instruct Dataset** (Cognitive Lab)
This provides broad coverage of conversational, literary, and instruction-following Kannada.
---
# Tokenizer Metrics
These metrics evaluate tokenizer quality independent of any downstream NLP model.
## **Compression Ratio (CR)**
Higher = better (larger text compressed into fewer bytes)
## **Fertility Score (FS)**
Lower = better (#tokens produced per grapheme/character)
### **Results for CR and FS**
GAT consistently showed better compression ratio and fertility score across vocab sizes .
---
CR : 3.5 -> 3.9 -> 4.8
---
FS : 2.1 -> 1.8 -> 1.6
---
# 💻 Usage Example
### Load the 32k tokenizer
```python
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained(
"varuni/GAT-K",
tokenizer_file="GAT_Kannada_32k.json"
)
text = "ನಿಮ್ಮ ಹೆಸರು ಏನು?"
print(tokenizer.encode(text))
```
# Related work :
- M. Velayuthan and K. Sarveswaran, “Egalitarian Language Representation in Language Models: It All Begins with Tokenizers,” COLING 2025.
arXiv:2409.11501 [cs.CL]. DOI: https://doi.org/10.48550/arXiv.2409.11501
- Unicode Normalization and Grapheme Parsing of Indic Languages, 2023. [Online]. Available: https://arxiv.org/abs/2306.01743
- M. K. H. and A. Giri, “Orthographic Syllable Pair Encoding for Language Modelling Tasks in Indic Languages,” in 2023 IEEE MIT Undergraduate Research Technology Conference (URTC), Cambridge, MA, USA, pp. 1–6, 2023.
DOI: https://doi.org/10.1109/URTC60662.2023.10534970
- R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016.
[Online]. Available: https://arxiv.org/abs/1508.07909