datasysdev
/

adamba

Text Generation

adaptive-compute

Model card Files Files and versions

adamba / README.md

datasysdev's picture

Update README.md

801ff0e verified 4 days ago

|

history blame contribute delete

2.92 kB

	---
	license: apache-2.0
	tags:
	- pytorch
	- transformer
	- mamba
	- hybrid
	- matryoshka
	- nanochat
	- adaptive-compute
	pipeline_tag: text-generation
	---

	# 🌀 Adamba: Adaptive Mamba

	> Adaptive Mamba: Elastic compute with dynamic Matryoshka scaling

	Project location [unixsysdev/adamba](https://github.com/unixsysdev/adamba)

	## Available Checkpoints

	\| Variant \| Parameters \| Dim \| Features \| Status \| Download \|
	\|---------\|------------\|-----\|----------\|--------\|----------\|
	\| phase1_6b_base \| 6.4B \| 2048 \| mamba_integration \| ✅ \| [Download](./checkpoints/phase1_6b_base.pt) \|
	\| phase2_6b_matryoshka \| 6.4B \| 2048 \| matryoshka, early_exit \| ⏳ \| — \|
	\| phase3_9b_matryoshka \| 9.3B \| 2560 \| matryoshka, early_exit \| ⏳ \| — \|
	\| phase3_20b_matryoshka \| 20B \| 4096 \| matryoshka, early_exit \| ⏳ \| — \|
	\| sft_20b \| 20B \| 4096 \| matryoshka, early_exit, sft \| ⏳ \| — \|
	\| rl_20b \| 20B \| 4096 \| matryoshka, early_exit, rl_agent \| ⏳ \| — \|

	## Architecture Overview

	Adamba combines three efficiency techniques:

	\| Technique \| Implementation \| Purpose \|
	\|-----------\|----------------\|---------\|
	\| Matryoshka (MRL) \| Width: 128 → 4096 per layer \| Elastic compute \|
	\| Early Exit \| ConfidenceGate per layer \| Skip when confident \|
	\| Static SSM \| Mamba at full dim \| Stable memory backbone \|

	```
	┌─────────────────────────────────────────────────┐
	│ PROMPT → LayerDimPredictor → [dim per layer] │
	│ │
	│ Attention + MLP: Dynamic (Matryoshka sliced) │
	│ Mamba: Static (full dim) │
	│ │
	│ Gate > 0.95 → EXIT EARLY │
	│ Gate < 0.50 → EXPAND remaining layers │
	└─────────────────────────────────────────────────┘
	```

	## Training Pipeline

	```
	nanochat-d32 (1.9B)
	↓ Surgery (add 32 Mamba layers)
	Phase 1: 6.4B (dim=2048) ← Mamba integration
	↓ Enable Matryoshka
	Phase 2: 6.4B (dim=2048) ← Full training
	↓ Progressive expand
	Phase 3: 9.3B → 20B (dim=4096)
	↓ Fine-tuning
	SFT: Instruction tuning
	RL: Agent capabilities
	```

	## Model Details

	- Base: [karpathy/nanochat-d32](https://huggingface.co/karpathy/nanochat-d32)
	- Architecture: 64 blocks (32 Attention + 32 Mamba interleaved)
	- Vocabulary: 65,536 tokens
	- Matryoshka Dims: [128, 256, 512, 1024, 2048, 4096]

	## Usage

	```python
	# Coming soon - inference code
	# See: https://github.com/unixsysdev/adamba
	```

	## Links

	- 📂 GitHub: [unixsysdev/adamba](https://github.com/unixsysdev/adamba)
	- 📊 Training: [WandB](https://wandb.ai/dalletest123/nano-fractal)

	## License

	Apache 2.0