NanoGPT-X

What This Is

NanoGPT-X is a 15.6M parameter decoder-only transformer trained on WikiText-2. It integrates architectural innovations from DeepSeek, Meta, Google, Microsoft, Mistral, and Stanford into a single file that trains on a T4 GPU in under 5 hours.

Architecture

Input -> [Embed] -> [Block x 4] -> [RMSNorm] -> [LM Head] -> Output
                    |
                    v
              Block = x + alpha * Attn(Norm(x)) + alpha * MLP(Norm(x))
                    |
                    v
              Attn  = MLA (default) | DiffAttn (optional)
              MLP   = SwiGLU
              alpha = DeepNorm scaling = sqrt(2N) = 2.83

Components

Component Source What It Does
MLA DeepSeek-V3 KV cache compression to 32-dim latent (8x smaller)
MTP DeepSeek-V3 Predicts t+2, t+3 alongside t+1 for better efficiency
DiffAttn Microsoft 2024 Signal-minus-noise attention filtering
SWA Mistral Local attention window of 128 tokens
RoPE+NTK Meta / CodeLLaMA Relative position with length extrapolation
DeepNorm Microsoft Residual scaling for deep network stability
RMSNorm LLaMA / PaLM Fast normalization without mean-centering
QK-Norm Gemma 2 Pre-attention query/key normalization
SwiGLU PaLM / LLaMA Gated FFN activation (8/3 ratio)
Z-loss PaLM / Chinchilla Logit regularization preventing softmax drift
Lion Google Brain 2023 Sign-momentum optimizer
WSD DeepSeek / MiniMax Warmup-Stable-Decay LR schedule
Flash Attention 2 Stanford O(1) memory fused attention kernel
torch.compile PyTorch 2.0+ Graph compilation with operator fusion
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support