Sanchari / tokenizer /README.md
Mike369williams's picture
Create tokenizer/README.md
79dd5bc verified
# Sanchari Tokenizer
This folder contains scripts and placeholder artifacts for the Sanchari tokenizer.
The tokenizer is based on SentencePiece (Unigram or BPE) with a ~50k vocabulary optimized for:
- English (India)
- Hindi
- Telugu
- Mixed-script content
- Code + instruction-level text
Tokenization goals:
- Normalize Unicode (NFKC)
- Efficient segmentation for Indic languages
- Stable handling of whitespace, punctuation, emojis, and mixed-language text
Final tokenizer files (`sanchari_spm.model` and `sanchari_spm.vocab`) will be generated after dataset aggregation.
This version contains **placeholders only** for investor preview.