|
|
| --- |
| license: apache-2.0 |
| --- |
| |
| # Overview |
|
|
| This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE, SentencePiece, or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing subword fragmentation. |
|
|
| --- |
|
|
| # Available Vocabulary Sizes |
|
|
| This repository includes **three tokenizer variants**: |
|
|
| | Vocabulary | File | |
| |------------|-------| |
| | **8k** | `GAT_Kannada_8k.json` | |
| | **16k** | `GAT_Kannada_16k.json` | |
| | **32k** | `GAT_Kannada_32k.json` *(recommended)* | |
|
|
| --- |
|
|
| # Why Grapheme-Aware Preprocessing? |
|
|
| Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of: |
|
|
| - multiple consonants |
| - a halant (virama) |
| - vowel diacritics (matra) |
|
|
| For example: |
|
|
| ಕ್ರಿ |
|
|
| is **one grapheme**, but composed of multiple Unicode codepoints. |
|
|
| ಕ್ + ರ್ + ಿ → 3–4 fragments |
|
|
| ### Problem with BPE / SentencePiece / WordPiece |
|
|
| These tokenizers operate at the byte or character level: |
|
|
| This results in: |
|
|
| - stable semantic units |
| - better compression |
| - more efficient tokenization |
| - |
| ### GAT Solution |
|
|
| GAT applies a **custom grapheme parser** that merges the components into **one atomic unit**: |
|
|
| GAT uses a rule-based finite-state parser that correctly handles: |
|
|
| - consonants |
| - vowels |
| - halants |
| - vowel signs |
| - anusvara & visarga |
|
|
| <p align="center"> |
| <img src="./GAT-algo.png" width="650"/> |
| </p> |
|
|
| After grapheme segmentation, **Byte Pair Encoding (BPE)** is applied to learn higher-level merges. |
|
|
| --- |
|
|
| # Training Data |
|
|
| Tokenizer training uses a **composite 4.5M-sentence Kannada corpus**: |
|
|
| 1. **Samanantar Dataset** (AI4Bharat) |
| 2. **Kannada-Instruct Dataset** (Cognitive Lab) |
|
|
| This provides broad coverage of conversational, literary, and instruction-following Kannada. |
|
|
| --- |
|
|
| # Tokenizer Metrics |
|
|
| These metrics evaluate tokenizer quality independent of any downstream NLP model. |
|
|
| ## **Compression Ratio (CR)** |
| Higher = better (larger text compressed into fewer bytes) |
|
|
| ## **Fertility Score (FS)** |
| Lower = better (#tokens produced per grapheme/character) |
|
|
| ### **Results for CR and FS** |
|
|
| GAT consistently showed better compression ratio and fertility score across vocab sizes . |
|
|
| --- |
| CR : 3.5 -> 3.9 -> 4.8 |
|
|
| --- |
| FS : 2.1 -> 1.8 -> 1.6 |
|
|
| --- |
|
|
| # 💻 Usage Example |
|
|
| ### Load the 32k tokenizer |
|
|
| ```python |
| from transformers import PreTrainedTokenizerFast |
| |
| tokenizer = PreTrainedTokenizerFast.from_pretrained( |
| "varuni/GAT-K", |
| tokenizer_file="GAT_Kannada_32k.json" |
| ) |
| |
| text = "ನಿಮ್ಮ ಹೆಸರು ಏನು?" |
| print(tokenizer.encode(text)) |
| ``` |
| # Related work : |
|
|
| - M. Velayuthan and K. Sarveswaran, “Egalitarian Language Representation in Language Models: It All Begins with Tokenizers,” COLING 2025. |
| arXiv:2409.11501 [cs.CL]. DOI: https://doi.org/10.48550/arXiv.2409.11501 |
|
|
| - Unicode Normalization and Grapheme Parsing of Indic Languages, 2023. [Online]. Available: https://arxiv.org/abs/2306.01743 |
|
|
| - M. K. H. and A. Giri, “Orthographic Syllable Pair Encoding for Language Modelling Tasks in Indic Languages,” in 2023 IEEE MIT Undergraduate Research Technology Conference (URTC), Cambridge, MA, USA, pp. 1–6, 2023. |
| DOI: https://doi.org/10.1109/URTC60662.2023.10534970 |
| |
| - R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016. |
| [Online]. Available: https://arxiv.org/abs/1508.07909 |
|
|
|
|
|
|