varuni commited on
Commit
19ef440
·
verified ·
1 Parent(s): 8ce31dc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -3,11 +3,11 @@ license: apache-2.0
3
  ---
4
  ## Overview
5
 
6
- This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing tokenization imbalance across languages.
7
 
8
  For tokenizer training, we utilize a **composite corpus** consisting of:
9
 
10
  1. **[Samanantar dataset](https://github.com/AI4Bharat/indic-parallel-corpus)** by AI4Bharat
11
  2. **[Kannada-Instruct dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset)** by Cognitive Lab
12
 
13
- Together, these datasets amount to approximately **4.5 million Kannada sentences**, covering a wide range of topics including general conversation, instruction-following data, and parallel text.
 
3
  ---
4
  ## Overview
5
 
6
+ This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing tokenization imbalance in token count.
7
 
8
  For tokenizer training, we utilize a **composite corpus** consisting of:
9
 
10
  1. **[Samanantar dataset](https://github.com/AI4Bharat/indic-parallel-corpus)** by AI4Bharat
11
  2. **[Kannada-Instruct dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset)** by Cognitive Lab
12
 
13
+ Together, these datasets amount to approximately **4.5 million Kannada sentences**