# Sanchari Tokenizer This folder contains scripts and placeholder artifacts for the Sanchari tokenizer. The tokenizer is based on SentencePiece (Unigram or BPE) with a ~50k vocabulary optimized for: - English (India) - Hindi - Telugu - Mixed-script content - Code + instruction-level text Tokenization goals: - Normalize Unicode (NFKC) - Efficient segmentation for Indic languages - Stable handling of whitespace, punctuation, emojis, and mixed-language text Final tokenizer files (`sanchari_spm.model` and `sanchari_spm.vocab`) will be generated after dataset aggregation. This version contains **placeholders only** for investor preview.