--- license: apache-2.0 --- # NanoGPT-X ## What This Is NanoGPT-X is a 15.6M parameter decoder-only transformer trained on WikiText-2. It integrates architectural innovations from DeepSeek, Meta, Google, Microsoft, Mistral, and Stanford into a single file that trains on a T4 GPU in under 5 hours. ## Architecture ``` Input -> [Embed] -> [Block x 4] -> [RMSNorm] -> [LM Head] -> Output | v Block = x + alpha * Attn(Norm(x)) + alpha * MLP(Norm(x)) | v Attn = MLA (default) | DiffAttn (optional) MLP = SwiGLU alpha = DeepNorm scaling = sqrt(2N) = 2.83 ``` ## Components | Component | Source | What It Does | |-----------|--------|--------------| | MLA | DeepSeek-V3 | KV cache compression to 32-dim latent (8x smaller) | | MTP | DeepSeek-V3 | Predicts t+2, t+3 alongside t+1 for better efficiency | | DiffAttn | Microsoft 2024 | Signal-minus-noise attention filtering | | SWA | Mistral | Local attention window of 128 tokens | | RoPE+NTK | Meta / CodeLLaMA | Relative position with length extrapolation | | DeepNorm | Microsoft | Residual scaling for deep network stability | | RMSNorm | LLaMA / PaLM | Fast normalization without mean-centering | | QK-Norm | Gemma 2 | Pre-attention query/key normalization | | SwiGLU | PaLM / LLaMA | Gated FFN activation (8/3 ratio) | | Z-loss | PaLM / Chinchilla | Logit regularization preventing softmax drift | | Lion | Google Brain 2023 | Sign-momentum optimizer | | WSD | DeepSeek / MiniMax | Warmup-Stable-Decay LR schedule | | Flash Attention 2 | Stanford | O(1) memory fused attention kernel | | torch.compile | PyTorch 2.0+ | Graph compilation with operator fusion |