Create architecture.md
Browse files- architecture.md +59 -0
architecture.md
ADDED
|
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Sanchari β Technical Blueprint (Full β Investor Grade)
|
| 2 |
+
|
| 3 |
+
> Version: v0.1
|
| 4 |
+
> Purpose: Complete technical blueprint to develop Sanchari-S β Sanchari-M β Sanchari-L
|
| 5 |
+
> Target audience: engineers, infrastructure teams, investors
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Summary (one line)
|
| 10 |
+
Build a practical, India-focused multilingual instruction-following LM family (S: ~200β350M, M: ~1β3B, L: 7B+) using modern efficient training (PyTorch + DeepSpeed/Accelerate + FlashAttention), with explicit data provenance, safety audits, and production deployment targets.
|
| 11 |
+
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
## 1. Design principles (what we deliver and why)
|
| 15 |
+
1. **Practicality first** β Sanchari-S must be cheap to train and fast to run for API/mobile use.
|
| 16 |
+
2. **Indian language competence** β prioritize English (Indian), Hindi, Telugu, mixed-script.
|
| 17 |
+
3. **Safety & governance** β every training stage includes PII scrubbing and red-team testing before any checkpoint release.
|
| 18 |
+
4. **Modular pipeline** β tokenizer, preprocessing, training config, adapter-based instruction tuning.
|
| 19 |
+
5. **Deliverables at each phase** β private checkpoints for investors (under NDA), evaluation reports, HF repo + demo.
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## 2. Model architecture & approach
|
| 24 |
+
|
| 25 |
+
### 2.1 Base model family
|
| 26 |
+
- **Sanchari-S:** ~200β350M parameters. Decoder-only transformer (GPT-like). Primary use: fast inference, API & mobile clients.
|
| 27 |
+
- **Sanchari-M:** ~1β3B parameters. Better instruction following and multi-turn coherence.
|
| 28 |
+
- **Sanchari-L:** ~7B+ parameters. Full foundation for enterprise applications.
|
| 29 |
+
|
| 30 |
+
### 2.2 Model topology (recommended)
|
| 31 |
+
- **Type:** Decoder-only transformer.
|
| 32 |
+
- **Layer scaling (example recipes):**
|
| 33 |
+
- Sanchari-S: 24 layers Γ 8 heads Γ 2048 hidden (example β 300M)
|
| 34 |
+
- Sanchari-M: 36 layers Γ 16 heads Γ 4096 hidden (example β 1.3B)
|
| 35 |
+
- Sanchari-L: 48β80 layers Γ 32 heads Γ 6144 hidden (approx 7B)
|
| 36 |
+
- **Attention:** FlashAttention2 compatible (for GPU memory and speed).
|
| 37 |
+
- **Norms and embeddings:** RMSNorm or LayerNorm, rotary positional embeddings (RoPE) for stable long-range.
|
| 38 |
+
- **Efficiency:** Provide LoRA adapters and QLoRA options for cost-effective instruction tuning.
|
| 39 |
+
|
| 40 |
+
> Rationale: decoder-only is industry standard for instruction-following and easier to deploy as an API.
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## 3. Tokenizer & text processing
|
| 45 |
+
### 3.1 Tokenizer strategy
|
| 46 |
+
- **Tokenizer:** SentencePiece Unigram or BPE with ~50k vocabulary, trained on mixed Indic + English corpus.
|
| 47 |
+
- **Features required:**
|
| 48 |
+
- Mixed script normalization (normalize Unicode, NFKC)
|
| 49 |
+
- Preserve whitespace tokens for code-like text
|
| 50 |
+
- Subword segmentation for Indic scripts
|
| 51 |
+
- **Tooling:** `sentencepiece` or `tokenizers` (Hugging Face), with training script.
|
| 52 |
+
|
| 53 |
+
### 3.2 Tokenizer commands (example)
|
| 54 |
+
```bash
|
| 55 |
+
# install
|
| 56 |
+
pip install sentencepiece tokenizers
|
| 57 |
+
|
| 58 |
+
# train sentencepiece
|
| 59 |
+
spm_train --input=data/all_texts.txt --model_prefix=sanchari_spm --vocab_size=50000 --model_type=unigram
|