varuni commited on
Commit
73e50db
·
verified ·
1 Parent(s): 752438a

prior work

Browse files
Files changed (1) hide show
  1. README.md +9 -1
README.md CHANGED
@@ -21,7 +21,7 @@ This repository includes **three tokenizer variants**:
21
 
22
  ---
23
 
24
- # Why Grapheme-Aware Tokenization?
25
 
26
  Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:
27
 
@@ -114,3 +114,11 @@ tokenizer = PreTrainedTokenizerFast.from_pretrained(
114
 
115
  text = "ನಿಮ್ಮ ಹೆಸರು ಏನು?"
116
  print(tokenizer.encode(text))
 
 
 
 
 
 
 
 
 
21
 
22
  ---
23
 
24
+ # Why Grapheme-Aware Preprocessing?
25
 
26
  Kannada is an **Abugida** script where a single grapheme (akshara) may be composed of:
27
 
 
114
 
115
  text = "ನಿಮ್ಮ ಹೆಸರು ಏನು?"
116
  print(tokenizer.encode(text))
117
+
118
+ Related work :
119
+
120
+ - M. Velayuthan and K. Sarveswaran, “Egalitarian Language Representation in Language Models: It All Begins with Tokenizers,” COLING 2025.
121
+ arXiv:2409.11501 [cs.CL], DOI: https://doi.org/10.48550/arXiv.2409.11501
122
+ - M. K. H and A. Giri, "Orthographic Syllable Pair Encoding for Language modelling tasks in Indic Languages," 2023 IEEE MIT Undergraduate Research Technology Conference (URTC), Cambridge, MA, USA, 2023, pp. 1-6, doi: 10.1109/URTC60662.2023.10534970. keywords: {Shape;Encoding;Data models;Compounds;Task analysis;Tokenization;Indic Languages;Language modelling;Large Language Models},
123
+
124
+