|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- pytorch |
|
|
- transformer |
|
|
- mamba |
|
|
- hybrid |
|
|
- matryoshka |
|
|
- nanochat |
|
|
- adaptive-compute |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# π Adamba: Adaptive Mamba |
|
|
|
|
|
> **Ad**aptive **Mamba**: Elastic compute with dynamic Matryoshka scaling |
|
|
|
|
|
**Project location [unixsysdev/adamba](https://github.com/unixsysdev/adamba)** |
|
|
|
|
|
## Available Checkpoints |
|
|
|
|
|
| Variant | Parameters | Dim | Features | Status | Download | |
|
|
|---------|------------|-----|----------|--------|----------| |
|
|
| phase1_6b_base | 6.4B | 2048 | mamba_integration | β
| [Download](./checkpoints/phase1_6b_base.pt) | |
|
|
| phase2_6b_matryoshka | 6.4B | 2048 | matryoshka, early_exit | β³ | β | |
|
|
| phase3_9b_matryoshka | 9.3B | 2560 | matryoshka, early_exit | β³ | β | |
|
|
| phase3_20b_matryoshka | 20B | 4096 | matryoshka, early_exit | β³ | β | |
|
|
| sft_20b | 20B | 4096 | matryoshka, early_exit, sft | β³ | β | |
|
|
| rl_20b | 20B | 4096 | matryoshka, early_exit, rl_agent | β³ | β | |
|
|
|
|
|
## Architecture Overview |
|
|
|
|
|
Adamba combines three efficiency techniques: |
|
|
|
|
|
| Technique | Implementation | Purpose | |
|
|
|-----------|----------------|---------| |
|
|
| **Matryoshka (MRL)** | Width: 128 β 4096 per layer | Elastic compute | |
|
|
| **Early Exit** | ConfidenceGate per layer | Skip when confident | |
|
|
| **Static SSM** | Mamba at full dim | Stable memory backbone | |
|
|
|
|
|
``` |
|
|
βββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
β PROMPT β LayerDimPredictor β [dim per layer] β |
|
|
β β |
|
|
β Attention + MLP: Dynamic (Matryoshka sliced) β |
|
|
β Mamba: Static (full dim) β |
|
|
β β |
|
|
β Gate > 0.95 β EXIT EARLY β |
|
|
β Gate < 0.50 β EXPAND remaining layers β |
|
|
βββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
``` |
|
|
|
|
|
## Training Pipeline |
|
|
|
|
|
``` |
|
|
nanochat-d32 (1.9B) |
|
|
β Surgery (add 32 Mamba layers) |
|
|
Phase 1: 6.4B (dim=2048) β Mamba integration |
|
|
β Enable Matryoshka |
|
|
Phase 2: 6.4B (dim=2048) β Full training |
|
|
β Progressive expand |
|
|
Phase 3: 9.3B β 20B (dim=4096) |
|
|
β Fine-tuning |
|
|
SFT: Instruction tuning |
|
|
RL: Agent capabilities |
|
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base**: [karpathy/nanochat-d32](https://huggingface.co/karpathy/nanochat-d32) |
|
|
- **Architecture**: 64 blocks (32 Attention + 32 Mamba interleaved) |
|
|
- **Vocabulary**: 65,536 tokens |
|
|
- **Matryoshka Dims**: [128, 256, 512, 1024, 2048, 4096] |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
# Coming soon - inference code |
|
|
# See: https://github.com/unixsysdev/adamba |
|
|
``` |
|
|
|
|
|
## Links |
|
|
|
|
|
- π **GitHub**: [unixsysdev/adamba](https://github.com/unixsysdev/adamba) |
|
|
- π **Training**: [WandB](https://wandb.ai/dalletest123/nano-fractal) |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|