--- license: apache-2.0 tags: - pytorch - transformer - mamba - moe - hybrid - matryoshka - gpt-oss - adaptive-compute pipeline_tag: text-generation --- # 🌀 GPT-OSS Adamba: Hybrid MoE + Mamba > **21.9B** parameters | **32 experts** | **Mamba-enhanced** reasoning backbone 📂 **[GitHub](https://github.com/unixsysdev/adamba)** | 🤗 **[Original Adamba](https://huggingface.co/datasysdev/adamba)** ## Available Checkpoints | Variant | Parameters | Dim | Features | Status | Download | |---------|------------|-----|----------|--------|----------| | gptoss_phase1 | 21.9B | 2880 | mamba_integration, moe_32experts | ✅ | [Download](./checkpoints/gptoss_phase1.pt) | | gptoss_phase2 | 21.9B | 2880 | matryoshka, early_exit, moe_32experts | ⏳ | — | | gptoss_phase3 | 30B+ | 4096 | matryoshka, early_exit, moe_32experts, expansion | ⏳ | — | | gptoss_sft | 21.9B | 2880 | matryoshka, moe_32experts, sft | ⏳ | — | ## Architecture Built on [OpenAI GPT-OSS 20B](https://huggingface.co/openai/gpt-oss-20b) with Mamba integration: | Component | Spec | |-----------|------| | **Base Model** | GPT-OSS 20B MoE | | **Hidden Dim** | 2880 | | **Attention** | 24 layers (sliding + full alternating) | | **Mamba** | 12 layers (interleaved 2:1) | | **MoE** | 32 experts, top-4 routing | | **Vocab** | 201,088 tokens | | **Total Blocks** | 36 (24 Attn + 12 Mamba) | ``` ┌─────────────────────────────────────────────────────┐ │ GPT-OSS 20B (Attention + MoE) │ │ ↓ Surgery (inject 12 Mamba layers) │ │ Hybrid: A-A-M-A-A-M-... pattern │ │ ↓ Phase 1 (train Mamba only) │ │ Mamba learns to "speak GPT-OSS language" │ │ ↓ Phase 2 (enable Matryoshka) │ │ Adaptive compute: 128 → 2880 dim per layer │ └─────────────────────────────────────────────────────┘ ``` ## Training Status **Phase 1**: Mamba integration (freeze Attention+MoE, train Mamba) ## Usage ```python # Coming soon - inference code # See: https://github.com/unixsysdev/adamba ``` ## License Apache 2.0 (same as GPT-OSS)