varuni
/

GAT-K

varuni commited on Apr 18

Commit

73e50db

verified ·

1 Parent(s): 752438a

prior work

Files changed (1) hide show

README.md CHANGED Viewed

@@ -21,7 +21,7 @@ This repository includes **three tokenizer variants**:
 ---
-# Why Grapheme-Aware Tokenization?
 Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:
@@ -114,3 +114,11 @@ tokenizer = PreTrainedTokenizerFast.from_pretrained(
 text = "ನಿಮ್ಮ ಹೆಸರು ಏನು?"
 print(tokenizer.encode(text))

 ---
+# Why Grapheme-Aware Preprocessing?
 Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:
 text = "ನಿಮ್ಮ ಹೆಸರು ಏನು?"
 print(tokenizer.encode(text))
+Related work :
+- M. Velayuthan and K. Sarveswaran, “Egalitarian Language Representation in Language Models: It All Begins with Tokenizers,” COLING 2025.
+  arXiv:2409.11501 [cs.CL], DOI: https://doi.org/10.48550/arXiv.2409.11501
+- M. K. H and A. Giri, "Orthographic Syllable Pair Encoding for Language modelling tasks in Indic Languages," 2023 IEEE MIT Undergraduate Research Technology Conference (URTC), Cambridge, MA, USA, 2023, pp. 1-6, doi: 10.1109/URTC60662.2023.10534970. keywords: {Shape;Encoding;Data models;Compounds;Task analysis;Tokenization;Indic Languages;Language modelling;Large Language Models},