Sanchari / tokenizer /README.md
Mike369williams's picture
Create tokenizer/README.md
79dd5bc verified

Sanchari Tokenizer

This folder contains scripts and placeholder artifacts for the Sanchari tokenizer.

The tokenizer is based on SentencePiece (Unigram or BPE) with a ~50k vocabulary optimized for:

  • English (India)
  • Hindi
  • Telugu
  • Mixed-script content
  • Code + instruction-level text

Tokenization goals:

  • Normalize Unicode (NFKC)
  • Efficient segmentation for Indic languages
  • Stable handling of whitespace, punctuation, emojis, and mixed-language text

Final tokenizer files (sanchari_spm.model and sanchari_spm.vocab) will be generated after dataset aggregation.

This version contains placeholders only for investor preview.