datasysdev commited on
Commit
be15a5d
Β·
verified Β·
1 Parent(s): 34f1bb7

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +89 -0
README.md ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - pytorch
5
+ - transformer
6
+ - mamba
7
+ - hybrid
8
+ - matryoshka
9
+ - nanochat
10
+ - adaptive-compute
11
+ pipeline_tag: text-generation
12
+ ---
13
+
14
+ # πŸŒ€ Adamba: Adaptive Mamba
15
+
16
+ > **Ad**aptive **Mamba**: Elastic compute with dynamic Matryoshka scaling
17
+
18
+ **Fork of [karpathy/nanochat](https://github.com/karpathy/nanochat)**
19
+
20
+ ## Available Checkpoints
21
+
22
+ | Variant | Parameters | Dim | Features | Status | Download |
23
+ |---------|------------|-----|----------|--------|----------|
24
+ | phase1_6b_base | 6.4B | 2048 | mamba_integration | βœ… | [Download](./checkpoints/phase1_6b_base.pt) |
25
+ | phase2_6b_matryoshka | 6.4B | 2048 | matryoshka, early_exit | ⏳ | β€” |
26
+ | phase3_9b_matryoshka | 9.3B | 2560 | matryoshka, early_exit | ⏳ | β€” |
27
+ | phase3_20b_matryoshka | 20B | 4096 | matryoshka, early_exit | ⏳ | β€” |
28
+ | sft_20b | 20B | 4096 | matryoshka, early_exit, sft | ⏳ | β€” |
29
+ | rl_20b | 20B | 4096 | matryoshka, early_exit, rl_agent | ⏳ | β€” |
30
+
31
+ ## Architecture Overview
32
+
33
+ Adamba combines three efficiency techniques:
34
+
35
+ | Technique | Implementation | Purpose |
36
+ |-----------|----------------|---------|
37
+ | **Matryoshka (MRL)** | Width: 128 β†’ 4096 per layer | Elastic compute |
38
+ | **Early Exit** | ConfidenceGate per layer | Skip when confident |
39
+ | **Static SSM** | Mamba at full dim | Stable memory backbone |
40
+
41
+ ```
42
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
43
+ β”‚ PROMPT β†’ LayerDimPredictor β†’ [dim per layer] β”‚
44
+ β”‚ β”‚
45
+ β”‚ Attention + MLP: Dynamic (Matryoshka sliced) β”‚
46
+ β”‚ Mamba: Static (full dim) β”‚
47
+ β”‚ β”‚
48
+ β”‚ Gate > 0.95 β†’ EXIT EARLY β”‚
49
+ β”‚ Gate < 0.50 β†’ EXPAND remaining layers β”‚
50
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
51
+ ```
52
+
53
+ ## Training Pipeline
54
+
55
+ ```
56
+ nanochat-d32 (1.9B)
57
+ ↓ Surgery (add 32 Mamba layers)
58
+ Phase 1: 6.4B (dim=2048) ← Mamba integration
59
+ ↓ Enable Matryoshka
60
+ Phase 2: 6.4B (dim=2048) ← Full training
61
+ ↓ Progressive expand
62
+ Phase 3: 9.3B β†’ 20B (dim=4096)
63
+ ↓ Fine-tuning
64
+ SFT: Instruction tuning
65
+ RL: Agent capabilities
66
+ ```
67
+
68
+ ## Model Details
69
+
70
+ - **Base**: [karpathy/nanochat-d32](https://huggingface.co/karpathy/nanochat-d32)
71
+ - **Architecture**: 64 blocks (32 Attention + 32 Mamba interleaved)
72
+ - **Vocabulary**: 65,536 tokens
73
+ - **Matryoshka Dims**: [128, 256, 512, 1024, 2048, 4096]
74
+
75
+ ## Usage
76
+
77
+ ```python
78
+ # Coming soon - inference code
79
+ # See: https://github.com/unixsysdev/adamba
80
+ ```
81
+
82
+ ## Links
83
+
84
+ - πŸ“‚ **GitHub**: [unixsysdev/adamba](https://github.com/unixsysdev/adamba)
85
+ - πŸ“Š **Training**: [WandB](https://wandb.ai/dalletest123/nano-fractal)
86
+
87
+ ## License
88
+
89
+ Apache 2.0