--- license: apache-2.0 --- # Overview This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE, SentencePiece, or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing subword fragmentation. --- # Available Vocabulary Sizes This repository includes **three tokenizer variants**: | Vocabulary | File | |------------|-------| | **8k** | `GAT_Kannada_8k.json` | | **16k** | `GAT_Kannada_16k.json` | | **32k** | `GAT_Kannada_32k.json` *(recommended)* | --- # Why Grapheme-Aware Preprocessing? Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of: - multiple consonants - a halant (virama) - vowel diacritics (matra) For example: ಕ್ರಿ is **one grapheme**, but composed of multiple Unicode codepoints. ಕ್ + ರ್ + ಿ → 3–4 fragments ### Problem with BPE / SentencePiece / WordPiece These tokenizers operate at the byte or character level: This results in: - stable semantic units - better compression - more efficient tokenization - ### GAT Solution GAT applies a **custom grapheme parser** that merges the components into **one atomic unit**: GAT uses a rule-based finite-state parser that correctly handles: - consonants - vowels - halants - vowel signs - anusvara & visarga

After grapheme segmentation, **Byte Pair Encoding (BPE)** is applied to learn higher-level merges. --- # Training Data Tokenizer training uses a **composite 4.5M-sentence Kannada corpus**: 1. **Samanantar Dataset** (AI4Bharat) 2. **Kannada-Instruct Dataset** (Cognitive Lab) This provides broad coverage of conversational, literary, and instruction-following Kannada. --- # Tokenizer Metrics These metrics evaluate tokenizer quality independent of any downstream NLP model. ## **Compression Ratio (CR)** Higher = better (larger text compressed into fewer bytes) ## **Fertility Score (FS)** Lower = better (#tokens produced per grapheme/character) ### **Results for CR and FS** GAT consistently showed better compression ratio and fertility score across vocab sizes . --- CR : 3.5 -> 3.9 -> 4.8 --- FS : 2.1 -> 1.8 -> 1.6 --- # 💻 Usage Example ### Load the 32k tokenizer ```python from transformers import PreTrainedTokenizerFast tokenizer = PreTrainedTokenizerFast.from_pretrained( "varuni/GAT-K", tokenizer_file="GAT_Kannada_32k.json" ) text = "ನಿಮ್ಮ ಹೆಸರು ಏನು?" print(tokenizer.encode(text)) ``` # Related work : - M. Velayuthan and K. Sarveswaran, “Egalitarian Language Representation in Language Models: It All Begins with Tokenizers,” COLING 2025. arXiv:2409.11501 [cs.CL]. DOI: https://doi.org/10.48550/arXiv.2409.11501 - Unicode Normalization and Grapheme Parsing of Indic Languages, 2023. [Online]. Available: https://arxiv.org/abs/2306.01743 - M. K. H. and A. Giri, “Orthographic Syllable Pair Encoding for Language Modelling Tasks in Indic Languages,” in 2023 IEEE MIT Undergraduate Research Technology Conference (URTC), Cambridge, MA, USA, pp. 1–6, 2023. DOI: https://doi.org/10.1109/URTC60662.2023.10534970 - R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016. [Online]. Available: https://arxiv.org/abs/1508.07909