Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- pytorch
|
| 5 |
+
- transformer
|
| 6 |
+
- mamba
|
| 7 |
+
- moe
|
| 8 |
+
- hybrid
|
| 9 |
+
- matryoshka
|
| 10 |
+
- gpt-oss
|
| 11 |
+
- adaptive-compute
|
| 12 |
+
pipeline_tag: text-generation
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# π GPT-OSS Adamba: Hybrid MoE + Mamba
|
| 16 |
+
|
| 17 |
+
> **21.9B** parameters | **32 experts** | **Mamba-enhanced** reasoning backbone
|
| 18 |
+
|
| 19 |
+
π **[GitHub](https://github.com/unixsysdev/adamba)** | π€ **[Original Adamba](https://huggingface.co/datasysdev/adamba)**
|
| 20 |
+
|
| 21 |
+
## Available Checkpoints
|
| 22 |
+
|
| 23 |
+
| Variant | Parameters | Dim | Features | Status | Download |
|
| 24 |
+
|---------|------------|-----|----------|--------|----------|
|
| 25 |
+
| gptoss_phase1 | 21.9B | 2880 | mamba_integration, moe_32experts | β
| [Download](./checkpoints/gptoss_phase1.pt) |
|
| 26 |
+
| gptoss_phase2 | 21.9B | 2880 | matryoshka, early_exit, moe_32experts | β³ | β |
|
| 27 |
+
| gptoss_phase3 | 30B+ | 4096 | matryoshka, early_exit, moe_32experts, expansion | β³ | β |
|
| 28 |
+
| gptoss_sft | 21.9B | 2880 | matryoshka, moe_32experts, sft | β³ | β |
|
| 29 |
+
|
| 30 |
+
## Architecture
|
| 31 |
+
|
| 32 |
+
Built on [OpenAI GPT-OSS 20B](https://huggingface.co/openai/gpt-oss-20b) with Mamba integration:
|
| 33 |
+
|
| 34 |
+
| Component | Spec |
|
| 35 |
+
|-----------|------|
|
| 36 |
+
| **Base Model** | GPT-OSS 20B MoE |
|
| 37 |
+
| **Hidden Dim** | 2880 |
|
| 38 |
+
| **Attention** | 24 layers (sliding + full alternating) |
|
| 39 |
+
| **Mamba** | 12 layers (interleaved 2:1) |
|
| 40 |
+
| **MoE** | 32 experts, top-4 routing |
|
| 41 |
+
| **Vocab** | 201,088 tokens |
|
| 42 |
+
| **Total Blocks** | 36 (24 Attn + 12 Mamba) |
|
| 43 |
+
|
| 44 |
+
```
|
| 45 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 46 |
+
β GPT-OSS 20B (Attention + MoE) β
|
| 47 |
+
β β Surgery (inject 12 Mamba layers) β
|
| 48 |
+
β Hybrid: A-A-M-A-A-M-... pattern β
|
| 49 |
+
β β Phase 1 (train Mamba only) β
|
| 50 |
+
β Mamba learns to "speak GPT-OSS language" β
|
| 51 |
+
β β Phase 2 (enable Matryoshka) β
|
| 52 |
+
β Adaptive compute: 128 β 2880 dim per layer β
|
| 53 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
## Training Status
|
| 57 |
+
|
| 58 |
+
**Phase 1**: Mamba integration (freeze Attention+MoE, train Mamba)
|
| 59 |
+
|
| 60 |
+
## Usage
|
| 61 |
+
|
| 62 |
+
```python
|
| 63 |
+
# Coming soon - inference code
|
| 64 |
+
# See: https://github.com/unixsysdev/adamba
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
## License
|
| 68 |
+
|
| 69 |
+
Apache 2.0 (same as GPT-OSS)
|