Mike369williams
/

Sanchari

Text Generation

foundation-model

instruction-following

investor-preview

Model card Files Files and versions

Mike369williams commited on Dec 7, 2025

Commit

6c37197

·

verified ·

1 Parent(s): 908b171

Create architecture.md

Files changed (1) hide show

architecture.md +59 -0

architecture.md ADDED Viewed

	@@ -0,0 +1,59 @@

+# Sanchari — Technical Blueprint (Full — Investor Grade)
+> Version: v0.1
+> Purpose: Complete technical blueprint to develop Sanchari-S → Sanchari-M → Sanchari-L
+> Target audience: engineers, infrastructure teams, investors
+---
+## Summary (one line)
+Build a practical, India-focused multilingual instruction-following LM family (S: ~200–350M, M: ~1–3B, L: 7B+) using modern efficient training (PyTorch + DeepSpeed/Accelerate + FlashAttention), with explicit data provenance, safety audits, and production deployment targets.
+---
+## 1. Design principles (what we deliver and why)
+1. **Practicality first** — Sanchari-S must be cheap to train and fast to run for API/mobile use.
+2. **Indian language competence** — prioritize English (Indian), Hindi, Telugu, mixed-script.
+3. **Safety & governance** — every training stage includes PII scrubbing and red-team testing before any checkpoint release.
+4. **Modular pipeline** — tokenizer, preprocessing, training config, adapter-based instruction tuning.
+5. **Deliverables at each phase** — private checkpoints for investors (under NDA), evaluation reports, HF repo + demo.
+---
+## 2. Model architecture & approach
+### 2.1 Base model family
+- **Sanchari-S:** ~200–350M parameters. Decoder-only transformer (GPT-like). Primary use: fast inference, API & mobile clients.
+- **Sanchari-M:** ~1–3B parameters. Better instruction following and multi-turn coherence.
+- **Sanchari-L:** ~7B+ parameters. Full foundation for enterprise applications.
+### 2.2 Model topology (recommended)
+- **Type:** Decoder-only transformer.
+- **Layer scaling (example recipes):**
+  - Sanchari-S: 24 layers × 8 heads × 2048 hidden (example ≈ 300M)
+  - Sanchari-M: 36 layers × 16 heads × 4096 hidden (example ≈ 1.3B)
+  - Sanchari-L: 48–80 layers × 32 heads × 6144 hidden (approx 7B)
+- **Attention:** FlashAttention2 compatible (for GPU memory and speed).
+- **Norms and embeddings:** RMSNorm or LayerNorm, rotary positional embeddings (RoPE) for stable long-range.
+- **Efficiency:** Provide LoRA adapters and QLoRA options for cost-effective instruction tuning.
+> Rationale: decoder-only is industry standard for instruction-following and easier to deploy as an API.
+---
+## 3. Tokenizer & text processing
+### 3.1 Tokenizer strategy
+- **Tokenizer:** SentencePiece Unigram or BPE with ~50k vocabulary, trained on mixed Indic + English corpus.
+- **Features required:**
+  - Mixed script normalization (normalize Unicode, NFKC)
+  - Preserve whitespace tokens for code-like text
+  - Subword segmentation for Indic scripts
+- **Tooling:** `sentencepiece` or `tokenizers` (Hugging Face), with training script.
+### 3.2 Tokenizer commands (example)
+```bash
+# install
+pip install sentencepiece tokenizers
+# train sentencepiece
+spm_train --input=data/all_texts.txt --model_prefix=sanchari_spm --vocab_size=50000 --model_type=unigram