Mike369williams
/

Sanchari

Text Generation

foundation-model

instruction-following

investor-preview

Model card Files Files and versions

Mike369williams commited on Dec 7, 2025

Commit

79dd5bc

·

verified ·

1 Parent(s): 6c37197

Create tokenizer/README.md

Files changed (1) hide show

tokenizer/README.md +19 -0

tokenizer/README.md ADDED Viewed

	@@ -0,0 +1,19 @@

+# Sanchari Tokenizer
+This folder contains scripts and placeholder artifacts for the Sanchari tokenizer.
+The tokenizer is based on SentencePiece (Unigram or BPE) with a ~50k vocabulary optimized for:
+- English (India)
+- Hindi
+- Telugu
+- Mixed-script content
+- Code + instruction-level text
+Tokenization goals:
+- Normalize Unicode (NFKC)
+- Efficient segmentation for Indic languages
+- Stable handling of whitespace, punctuation, emojis, and mixed-language text
+Final tokenizer files (`sanchari_spm.model` and `sanchari_spm.vocab`) will be generated after dataset aggregation.
+This version contains **placeholders only** for investor preview.