---
license: apache-2.0
---
# NanoGPT-X

## What This Is

NanoGPT-X is a 15.6M parameter decoder-only transformer trained on WikiText-2.
It integrates architectural innovations from DeepSeek, Meta, Google, Microsoft,
Mistral, and Stanford into a single file that trains on a T4 GPU in under 5 hours.

## Architecture

```
Input -> [Embed] -> [Block x 4] -> [RMSNorm] -> [LM Head] -> Output
                    |
                    v
              Block = x + alpha * Attn(Norm(x)) + alpha * MLP(Norm(x))
                    |
                    v
              Attn  = MLA (default) | DiffAttn (optional)
              MLP   = SwiGLU
              alpha = DeepNorm scaling = sqrt(2N) = 2.83
```

## Components

| Component | Source | What It Does |
|-----------|--------|--------------|
| MLA | DeepSeek-V3 | KV cache compression to 32-dim latent (8x smaller) |
| MTP | DeepSeek-V3 | Predicts t+2, t+3 alongside t+1 for better efficiency |
| DiffAttn | Microsoft 2024 | Signal-minus-noise attention filtering |
| SWA | Mistral | Local attention window of 128 tokens |
| RoPE+NTK | Meta / CodeLLaMA | Relative position with length extrapolation |
| DeepNorm | Microsoft | Residual scaling for deep network stability |
| RMSNorm | LLaMA / PaLM | Fast normalization without mean-centering |
| QK-Norm | Gemma 2 | Pre-attention query/key normalization |
| SwiGLU | PaLM / LLaMA | Gated FFN activation (8/3 ratio) |
| Z-loss | PaLM / Chinchilla | Logit regularization preventing softmax drift |
| Lion | Google Brain 2023 | Sign-momentum optimizer |
| WSD | DeepSeek / MiniMax | Warmup-Stable-Decay LR schedule |
| Flash Attention 2 | Stanford | O(1) memory fused attention kernel |
| torch.compile | PyTorch 2.0+ | Graph compilation with operator fusion |