---
license: apache-2.0
---

# Overview

This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE, SentencePiece, or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing subword fragmentation.

---

# Available Vocabulary Sizes

This repository includes **three tokenizer variants**:

| Vocabulary | File |
|------------|-------|
| **8k**  | `GAT_Kannada_8k.json` |
| **16k** | `GAT_Kannada_16k.json` |
| **32k** | `GAT_Kannada_32k.json` *(recommended)* |

---

# Why Grapheme-Aware Preprocessing?

Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:

- multiple consonants  
- a halant (virama)  
- vowel diacritics (matra)

For example:

ಕ್ರಿ 

is **one grapheme**, but composed of multiple Unicode codepoints.

ಕ್ + ರ್ + ಿ → 3–4 fragments

### Problem with BPE / SentencePiece / WordPiece

These tokenizers operate at the byte or character level:

This results in:

- stable semantic units  
- better compression  
- more efficient tokenization
- 
### GAT Solution

GAT applies a **custom grapheme parser** that merges the components into **one atomic unit**:

GAT uses a rule-based finite-state parser that correctly handles:

- consonants  
- vowels  
- halants  
- vowel signs  
- anusvara & visarga  

<p align="center">
  <img src="./GAT-algo.png" width="650"/>
</p>

After grapheme segmentation, **Byte Pair Encoding (BPE)** is applied to learn higher-level merges.

---

# Training Data

Tokenizer training uses a **composite 4.5M-sentence Kannada corpus**:

1. **Samanantar Dataset** (AI4Bharat)  
2. **Kannada-Instruct Dataset** (Cognitive Lab)  

This provides broad coverage of conversational, literary, and instruction-following Kannada.

---

# Tokenizer Metrics

These metrics evaluate tokenizer quality independent of any downstream NLP model.

## **Compression Ratio (CR)**  
Higher = better (larger text compressed into fewer bytes)

## **Fertility Score (FS)**  
Lower = better (#tokens produced per grapheme/character)

### **Results for CR and FS**

GAT consistently showed better compression ratio and fertility score across vocab sizes . 

---
CR : 3.5 -> 3.9 -> 4.8

---
FS : 2.1 -> 1.8 -> 1.6 

---

# 💻 Usage Example

### Load the 32k tokenizer

```python
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "varuni/GAT-K",
    tokenizer_file="GAT_Kannada_32k.json"
)

text = "ನಿಮ್ಮ ಹೆಸರು ಏನು?"
print(tokenizer.encode(text))
```
# Related work : 

- M. Velayuthan and K. Sarveswaran, “Egalitarian Language Representation in Language Models: It All Begins with Tokenizers,” COLING 2025.  
  arXiv:2409.11501 [cs.CL]. DOI: https://doi.org/10.48550/arXiv.2409.11501

- Unicode Normalization and Grapheme Parsing of Indic Languages, 2023. [Online]. Available: https://arxiv.org/abs/2306.01743

- M. K. H. and A. Giri, “Orthographic Syllable Pair Encoding for Language Modelling Tasks in Indic Languages,” in 2023 IEEE MIT Undergraduate Research Technology Conference (URTC), Cambridge, MA, USA, pp. 1–6, 2023.  
  DOI: https://doi.org/10.1109/URTC60662.2023.10534970
 
- R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016.
  [Online]. Available: https://arxiv.org/abs/1508.07909