| --- |
| license: apache-2.0 |
| --- |
| # NanoGPT-X |
|
|
| ## What This Is |
|
|
| NanoGPT-X is a 15.6M parameter decoder-only transformer trained on WikiText-2. |
| It integrates architectural innovations from DeepSeek, Meta, Google, Microsoft, |
| Mistral, and Stanford into a single file that trains on a T4 GPU in under 5 hours. |
|
|
| ## Architecture |
|
|
| ``` |
| Input -> [Embed] -> [Block x 4] -> [RMSNorm] -> [LM Head] -> Output |
| | |
| v |
| Block = x + alpha * Attn(Norm(x)) + alpha * MLP(Norm(x)) |
| | |
| v |
| Attn = MLA (default) | DiffAttn (optional) |
| MLP = SwiGLU |
| alpha = DeepNorm scaling = sqrt(2N) = 2.83 |
| ``` |
|
|
| ## Components |
|
|
| | Component | Source | What It Does | |
| |-----------|--------|--------------| |
| | MLA | DeepSeek-V3 | KV cache compression to 32-dim latent (8x smaller) | |
| | MTP | DeepSeek-V3 | Predicts t+2, t+3 alongside t+1 for better efficiency | |
| | DiffAttn | Microsoft 2024 | Signal-minus-noise attention filtering | |
| | SWA | Mistral | Local attention window of 128 tokens | |
| | RoPE+NTK | Meta / CodeLLaMA | Relative position with length extrapolation | |
| | DeepNorm | Microsoft | Residual scaling for deep network stability | |
| | RMSNorm | LLaMA / PaLM | Fast normalization without mean-centering | |
| | QK-Norm | Gemma 2 | Pre-attention query/key normalization | |
| | SwiGLU | PaLM / LLaMA | Gated FFN activation (8/3 ratio) | |
| | Z-loss | PaLM / Chinchilla | Logit regularization preventing softmax drift | |
| | Lion | Google Brain 2023 | Sign-momentum optimizer | |
| | WSD | DeepSeek / MiniMax | Warmup-Stable-Decay LR schedule | |
| | Flash Attention 2 | Stanford | O(1) memory fused attention kernel | |
| | torch.compile | PyTorch 2.0+ | Graph compilation with operator fusion | |
|
|
|
|