Mike369williams commited on
Commit
79dd5bc
·
verified ·
1 Parent(s): 6c37197

Create tokenizer/README.md

Browse files
Files changed (1) hide show
  1. tokenizer/README.md +19 -0
tokenizer/README.md ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Sanchari Tokenizer
2
+
3
+ This folder contains scripts and placeholder artifacts for the Sanchari tokenizer.
4
+
5
+ The tokenizer is based on SentencePiece (Unigram or BPE) with a ~50k vocabulary optimized for:
6
+ - English (India)
7
+ - Hindi
8
+ - Telugu
9
+ - Mixed-script content
10
+ - Code + instruction-level text
11
+
12
+ Tokenization goals:
13
+ - Normalize Unicode (NFKC)
14
+ - Efficient segmentation for Indic languages
15
+ - Stable handling of whitespace, punctuation, emojis, and mixed-language text
16
+
17
+ Final tokenizer files (`sanchari_spm.model` and `sanchari_spm.vocab`) will be generated after dataset aggregation.
18
+
19
+ This version contains **placeholders only** for investor preview.