prior work
Browse files
README.md
CHANGED
|
@@ -21,7 +21,7 @@ This repository includes **three tokenizer variants**:
|
|
| 21 |
|
| 22 |
---
|
| 23 |
|
| 24 |
-
# Why Grapheme-Aware
|
| 25 |
|
| 26 |
Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:
|
| 27 |
|
|
@@ -114,3 +114,11 @@ tokenizer = PreTrainedTokenizerFast.from_pretrained(
|
|
| 114 |
|
| 115 |
text = "ನಿಮ್ಮ ಹೆಸರು ಏನು?"
|
| 116 |
print(tokenizer.encode(text))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
---
|
| 23 |
|
| 24 |
+
# Why Grapheme-Aware Preprocessing?
|
| 25 |
|
| 26 |
Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:
|
| 27 |
|
|
|
|
| 114 |
|
| 115 |
text = "ನಿಮ್ಮ ಹೆಸರು ಏನು?"
|
| 116 |
print(tokenizer.encode(text))
|
| 117 |
+
|
| 118 |
+
Related work :
|
| 119 |
+
|
| 120 |
+
- M. Velayuthan and K. Sarveswaran, “Egalitarian Language Representation in Language Models: It All Begins with Tokenizers,” COLING 2025.
|
| 121 |
+
arXiv:2409.11501 [cs.CL], DOI: https://doi.org/10.48550/arXiv.2409.11501
|
| 122 |
+
- M. K. H and A. Giri, "Orthographic Syllable Pair Encoding for Language modelling tasks in Indic Languages," 2023 IEEE MIT Undergraduate Research Technology Conference (URTC), Cambridge, MA, USA, 2023, pp. 1-6, doi: 10.1109/URTC60662.2023.10534970. keywords: {Shape;Encoding;Data models;Compounds;Task analysis;Tokenization;Indic Languages;Language modelling;Large Language Models},
|
| 123 |
+
|
| 124 |
+
|