varuni
/

GAT-K

varuni commited on Nov 10, 2025

Commit

8ce31dc

verified ·

1 Parent(s): 3ac27d4

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,3 +1,13 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+## Overview
+This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing tokenization imbalance across languages.
+For tokenizer training, we utilize a **composite corpus** consisting of:
+1. **[Samanantar dataset](https://github.com/AI4Bharat/indic-parallel-corpus)** by AI4Bharat
+2. **[Kannada-Instruct dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset)** by Cognitive Lab
+Together, these datasets amount to approximately **4.5 million Kannada sentences**, covering a wide range of topics including general conversation, instruction-following data, and parallel text.