π Adamba: Adaptive Mamba
Adaptive Mamba: Elastic compute with dynamic Matryoshka scaling
Project location unixsysdev/adamba
Available Checkpoints
| Variant | Parameters | Dim | Features | Status | Download |
|---|---|---|---|---|---|
| phase1_6b_base | 6.4B | 2048 | mamba_integration | β | Download |
| phase2_6b_matryoshka | 6.4B | 2048 | matryoshka, early_exit | β³ | β |
| phase3_9b_matryoshka | 9.3B | 2560 | matryoshka, early_exit | β³ | β |
| phase3_20b_matryoshka | 20B | 4096 | matryoshka, early_exit | β³ | β |
| sft_20b | 20B | 4096 | matryoshka, early_exit, sft | β³ | β |
| rl_20b | 20B | 4096 | matryoshka, early_exit, rl_agent | β³ | β |
Architecture Overview
Adamba combines three efficiency techniques:
| Technique | Implementation | Purpose |
|---|---|---|
| Matryoshka (MRL) | Width: 128 β 4096 per layer | Elastic compute |
| Early Exit | ConfidenceGate per layer | Skip when confident |
| Static SSM | Mamba at full dim | Stable memory backbone |
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β PROMPT β LayerDimPredictor β [dim per layer] β
β β
β Attention + MLP: Dynamic (Matryoshka sliced) β
β Mamba: Static (full dim) β
β β
β Gate > 0.95 β EXIT EARLY β
β Gate < 0.50 β EXPAND remaining layers β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Training Pipeline
nanochat-d32 (1.9B)
β Surgery (add 32 Mamba layers)
Phase 1: 6.4B (dim=2048) β Mamba integration
β Enable Matryoshka
Phase 2: 6.4B (dim=2048) β Full training
β Progressive expand
Phase 3: 9.3B β 20B (dim=4096)
β Fine-tuning
SFT: Instruction tuning
RL: Agent capabilities
Model Details
- Base: karpathy/nanochat-d32
- Architecture: 64 blocks (32 Attention + 32 Mamba interleaved)
- Vocabulary: 65,536 tokens
- Matryoshka Dims: [128, 256, 512, 1024, 2048, 4096]
Usage
# Coming soon - inference code
# See: https://github.com/unixsysdev/adamba
Links
- π GitHub: unixsysdev/adamba
- π Training: WandB
License
Apache 2.0