| # Sanchari Tokenizer | |
| This folder contains scripts and placeholder artifacts for the Sanchari tokenizer. | |
| The tokenizer is based on SentencePiece (Unigram or BPE) with a ~50k vocabulary optimized for: | |
| - English (India) | |
| - Hindi | |
| - Telugu | |
| - Mixed-script content | |
| - Code + instruction-level text | |
| Tokenization goals: | |
| - Normalize Unicode (NFKC) | |
| - Efficient segmentation for Indic languages | |
| - Stable handling of whitespace, punctuation, emojis, and mixed-language text | |
| Final tokenizer files (`sanchari_spm.model` and `sanchari_spm.vocab`) will be generated after dataset aggregation. | |
| This version contains **placeholders only** for investor preview. |