File size: 2,378 Bytes
111b07a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
license: apache-2.0
tags:
- pytorch
- transformer
- mamba
- moe
- hybrid
- matryoshka
- gpt-oss
- adaptive-compute
pipeline_tag: text-generation
---

# πŸŒ€ GPT-OSS Adamba: Hybrid MoE + Mamba

> **21.9B** parameters | **32 experts** | **Mamba-enhanced** reasoning backbone

πŸ“‚ **[GitHub](https://github.com/unixsysdev/adamba)** | πŸ€— **[Original Adamba](https://huggingface.co/datasysdev/adamba)**

## Available Checkpoints

| Variant | Parameters | Dim | Features | Status | Download |
|---------|------------|-----|----------|--------|----------|
| gptoss_phase1 | 21.9B | 2880 | mamba_integration, moe_32experts | βœ… | [Download](./checkpoints/gptoss_phase1.pt) |
| gptoss_phase2 | 21.9B | 2880 | matryoshka, early_exit, moe_32experts | ⏳ | β€” |
| gptoss_phase3 | 30B+ | 4096 | matryoshka, early_exit, moe_32experts, expansion | ⏳ | β€” |
| gptoss_sft | 21.9B | 2880 | matryoshka, moe_32experts, sft | ⏳ | β€” |

## Architecture

Built on [OpenAI GPT-OSS 20B](https://huggingface.co/openai/gpt-oss-20b) with Mamba integration:

| Component | Spec |
|-----------|------|
| **Base Model** | GPT-OSS 20B MoE |
| **Hidden Dim** | 2880 |
| **Attention** | 24 layers (sliding + full alternating) |
| **Mamba** | 12 layers (interleaved 2:1) |
| **MoE** | 32 experts, top-4 routing |
| **Vocab** | 201,088 tokens |
| **Total Blocks** | 36 (24 Attn + 12 Mamba) |

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GPT-OSS 20B (Attention + MoE)                       β”‚
β”‚       ↓ Surgery (inject 12 Mamba layers)             β”‚
β”‚  Hybrid: A-A-M-A-A-M-... pattern                     β”‚
β”‚       ↓ Phase 1 (train Mamba only)                   β”‚
β”‚  Mamba learns to "speak GPT-OSS language"            β”‚
β”‚       ↓ Phase 2 (enable Matryoshka)                  β”‚
β”‚  Adaptive compute: 128 β†’ 2880 dim per layer          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Training Status

**Phase 1**: Mamba integration (freeze Attention+MoE, train Mamba)

## Usage

```python
# Coming soon - inference code
# See: https://github.com/unixsysdev/adamba
```

## License

Apache 2.0 (same as GPT-OSS)