Sanchari Tokenizer
This folder contains scripts and placeholder artifacts for the Sanchari tokenizer.
The tokenizer is based on SentencePiece (Unigram or BPE) with a ~50k vocabulary optimized for:
- English (India)
- Hindi
- Telugu
- Mixed-script content
- Code + instruction-level text
Tokenization goals:
- Normalize Unicode (NFKC)
- Efficient segmentation for Indic languages
- Stable handling of whitespace, punctuation, emojis, and mixed-language text
Final tokenizer files (sanchari_spm.model and sanchari_spm.vocab) will be generated after dataset aggregation.
This version contains placeholders only for investor preview.