File size: 3,650 Bytes
917fe37 b52a444 7e536e5 b52a444 7e536e5 b52a444 30569e9 b52a444 73e50db b52a444 752438a b52a444 7e536e5 b52a444 ab1f7bf 917fe37 ab1f7bf 7e536e5 917fe37 ab1f7bf 917fe37 ab1f7bf 917fe37 ab1f7bf 7e536e5 917fe37 ab1f7bf 8ce31dc ab1f7bf 8ce31dc ab1f7bf 917fe37 ab1f7bf 917fe37 7e536e5 917fe37 ab1f7bf 917fe37 ab1f7bf 917fe37 ab1f7bf 917fe37 77eb1a4 cb9ceaf 77eb1a4 cb9ceaf 77eb1a4 ab1f7bf 917fe37 ab1f7bf 917fe37 ab1f7bf 917fe37 f3149d7 30569e9 73e50db 30569e9 73e50db 6a71af3 30569e9 6a71af3 73e50db | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
---
license: apache-2.0
---
# Overview
This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE, SentencePiece, or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing subword fragmentation.
---
# Available Vocabulary Sizes
This repository includes **three tokenizer variants**:
| Vocabulary | File |
|------------|-------|
| **8k** | `GAT_Kannada_8k.json` |
| **16k** | `GAT_Kannada_16k.json` |
| **32k** | `GAT_Kannada_32k.json` *(recommended)* |
---
# Why Grapheme-Aware Preprocessing?
Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:
- multiple consonants
- a halant (virama)
- vowel diacritics (matra)
For example:
ಕ್ರಿ
is **one grapheme**, but composed of multiple Unicode codepoints.
ಕ್ + ರ್ + ಿ → 3–4 fragments
### Problem with BPE / SentencePiece / WordPiece
These tokenizers operate at the byte or character level:
This results in:
- stable semantic units
- better compression
- more efficient tokenization
-
### GAT Solution
GAT applies a **custom grapheme parser** that merges the components into **one atomic unit**:
GAT uses a rule-based finite-state parser that correctly handles:
- consonants
- vowels
- halants
- vowel signs
- anusvara & visarga
<p align="center">
<img src="./GAT-algo.png" width="650"/>
</p>
After grapheme segmentation, **Byte Pair Encoding (BPE)** is applied to learn higher-level merges.
---
# Training Data
Tokenizer training uses a **composite 4.5M-sentence Kannada corpus**:
1. **Samanantar Dataset** (AI4Bharat)
2. **Kannada-Instruct Dataset** (Cognitive Lab)
This provides broad coverage of conversational, literary, and instruction-following Kannada.
---
# Tokenizer Metrics
These metrics evaluate tokenizer quality independent of any downstream NLP model.
## **Compression Ratio (CR)**
Higher = better (larger text compressed into fewer bytes)
## **Fertility Score (FS)**
Lower = better (#tokens produced per grapheme/character)
### **Results for CR and FS**
GAT consistently showed better compression ratio and fertility score across vocab sizes .
---
CR : 3.5 -> 3.9 -> 4.8
---
FS : 2.1 -> 1.8 -> 1.6
---
# 💻 Usage Example
### Load the 32k tokenizer
```python
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained(
"varuni/GAT-K",
tokenizer_file="GAT_Kannada_32k.json"
)
text = "ನಿಮ್ಮ ಹೆಸರು ಏನು?"
print(tokenizer.encode(text))
```
# Related work :
- M. Velayuthan and K. Sarveswaran, “Egalitarian Language Representation in Language Models: It All Begins with Tokenizers,” COLING 2025.
arXiv:2409.11501 [cs.CL]. DOI: https://doi.org/10.48550/arXiv.2409.11501
- Unicode Normalization and Grapheme Parsing of Indic Languages, 2023. [Online]. Available: https://arxiv.org/abs/2306.01743
- M. K. H. and A. Giri, “Orthographic Syllable Pair Encoding for Language Modelling Tasks in Indic Languages,” in 2023 IEEE MIT Undergraduate Research Technology Conference (URTC), Cambridge, MA, USA, pp. 1–6, 2023.
DOI: https://doi.org/10.1109/URTC60662.2023.10534970
- R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016.
[Online]. Available: https://arxiv.org/abs/1508.07909
|