Update README.md
Browse files
README.md
CHANGED
|
@@ -3,11 +3,11 @@ license: apache-2.0
|
|
| 3 |
---
|
| 4 |
## Overview
|
| 5 |
|
| 6 |
-
This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing tokenization imbalance
|
| 7 |
|
| 8 |
For tokenizer training, we utilize a **composite corpus** consisting of:
|
| 9 |
|
| 10 |
1. **[Samanantar dataset](https://github.com/AI4Bharat/indic-parallel-corpus)** by AI4Bharat
|
| 11 |
2. **[Kannada-Instruct dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset)** by Cognitive Lab
|
| 12 |
|
| 13 |
-
Together, these datasets amount to approximately **4.5 million Kannada sentences**
|
|
|
|
| 3 |
---
|
| 4 |
## Overview
|
| 5 |
|
| 6 |
+
This repository contains a **Grapheme-Aware Tokenizer (GAT)** specifically trained for **Kannada**, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE or WordPiece, this tokenizer operates at the **grapheme level**, improving representation fidelity and reducing tokenization imbalance in token count.
|
| 7 |
|
| 8 |
For tokenizer training, we utilize a **composite corpus** consisting of:
|
| 9 |
|
| 10 |
1. **[Samanantar dataset](https://github.com/AI4Bharat/indic-parallel-corpus)** by AI4Bharat
|
| 11 |
2. **[Kannada-Instruct dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset)** by Cognitive Lab
|
| 12 |
|
| 13 |
+
Together, these datasets amount to approximately **4.5 million Kannada sentences**
|