# Sanchari Tokenizer

This folder contains scripts and placeholder artifacts for the Sanchari tokenizer.

The tokenizer is based on SentencePiece (Unigram or BPE) with a ~50k vocabulary optimized for:
- English (India)
- Hindi
- Telugu
- Mixed-script content
- Code + instruction-level text

Tokenization goals:
- Normalize Unicode (NFKC)
- Efficient segmentation for Indic languages
- Stable handling of whitespace, punctuation, emojis, and mixed-language text

Final tokenizer files (`sanchari_spm.model` and `sanchari_spm.vocab`) will be generated after dataset aggregation.

This version contains **placeholders only** for investor preview.