Mike369williams commited on
Commit
6c37197
Β·
verified Β·
1 Parent(s): 908b171

Create architecture.md

Browse files
Files changed (1) hide show
  1. architecture.md +59 -0
architecture.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Sanchari β€” Technical Blueprint (Full β€” Investor Grade)
2
+
3
+ > Version: v0.1
4
+ > Purpose: Complete technical blueprint to develop Sanchari-S β†’ Sanchari-M β†’ Sanchari-L
5
+ > Target audience: engineers, infrastructure teams, investors
6
+
7
+ ---
8
+
9
+ ## Summary (one line)
10
+ Build a practical, India-focused multilingual instruction-following LM family (S: ~200–350M, M: ~1–3B, L: 7B+) using modern efficient training (PyTorch + DeepSpeed/Accelerate + FlashAttention), with explicit data provenance, safety audits, and production deployment targets.
11
+
12
+ ---
13
+
14
+ ## 1. Design principles (what we deliver and why)
15
+ 1. **Practicality first** β€” Sanchari-S must be cheap to train and fast to run for API/mobile use.
16
+ 2. **Indian language competence** β€” prioritize English (Indian), Hindi, Telugu, mixed-script.
17
+ 3. **Safety & governance** β€” every training stage includes PII scrubbing and red-team testing before any checkpoint release.
18
+ 4. **Modular pipeline** β€” tokenizer, preprocessing, training config, adapter-based instruction tuning.
19
+ 5. **Deliverables at each phase** β€” private checkpoints for investors (under NDA), evaluation reports, HF repo + demo.
20
+
21
+ ---
22
+
23
+ ## 2. Model architecture & approach
24
+
25
+ ### 2.1 Base model family
26
+ - **Sanchari-S:** ~200–350M parameters. Decoder-only transformer (GPT-like). Primary use: fast inference, API & mobile clients.
27
+ - **Sanchari-M:** ~1–3B parameters. Better instruction following and multi-turn coherence.
28
+ - **Sanchari-L:** ~7B+ parameters. Full foundation for enterprise applications.
29
+
30
+ ### 2.2 Model topology (recommended)
31
+ - **Type:** Decoder-only transformer.
32
+ - **Layer scaling (example recipes):**
33
+ - Sanchari-S: 24 layers Γ— 8 heads Γ— 2048 hidden (example β‰ˆ 300M)
34
+ - Sanchari-M: 36 layers Γ— 16 heads Γ— 4096 hidden (example β‰ˆ 1.3B)
35
+ - Sanchari-L: 48–80 layers Γ— 32 heads Γ— 6144 hidden (approx 7B)
36
+ - **Attention:** FlashAttention2 compatible (for GPU memory and speed).
37
+ - **Norms and embeddings:** RMSNorm or LayerNorm, rotary positional embeddings (RoPE) for stable long-range.
38
+ - **Efficiency:** Provide LoRA adapters and QLoRA options for cost-effective instruction tuning.
39
+
40
+ > Rationale: decoder-only is industry standard for instruction-following and easier to deploy as an API.
41
+
42
+ ---
43
+
44
+ ## 3. Tokenizer & text processing
45
+ ### 3.1 Tokenizer strategy
46
+ - **Tokenizer:** SentencePiece Unigram or BPE with ~50k vocabulary, trained on mixed Indic + English corpus.
47
+ - **Features required:**
48
+ - Mixed script normalization (normalize Unicode, NFKC)
49
+ - Preserve whitespace tokens for code-like text
50
+ - Subword segmentation for Indic scripts
51
+ - **Tooling:** `sentencepiece` or `tokenizers` (Hugging Face), with training script.
52
+
53
+ ### 3.2 Tokenizer commands (example)
54
+ ```bash
55
+ # install
56
+ pip install sentencepiece tokenizers
57
+
58
+ # train sentencepiece
59
+ spm_train --input=data/all_texts.txt --model_prefix=sanchari_spm --vocab_size=50000 --model_type=unigram