dharun2049
/

Transformer

Model card Files Files and versions

Transformer / README.md

dharun2049's picture

Update README.md

4548d19 verified 12 days ago

|

History Blame Contribute Delete

1.76 kB

	---
	license: apache-2.0
	---
	# NanoGPT-X

	## What This Is

	NanoGPT-X is a 15.6M parameter decoder-only transformer trained on WikiText-2.
	It integrates architectural innovations from DeepSeek, Meta, Google, Microsoft,
	Mistral, and Stanford into a single file that trains on a T4 GPU in under 5 hours.

	## Architecture

	```
	Input -> [Embed] -> [Block x 4] -> [RMSNorm] -> [LM Head] -> Output
	\|
	v
	Block = x + alpha * Attn(Norm(x)) + alpha * MLP(Norm(x))
	\|
	v
	Attn = MLA (default) \| DiffAttn (optional)
	MLP = SwiGLU
	alpha = DeepNorm scaling = sqrt(2N) = 2.83
	```

	## Components

	\| Component \| Source \| What It Does \|
	\|-----------\|--------\|--------------\|
	\| MLA \| DeepSeek-V3 \| KV cache compression to 32-dim latent (8x smaller) \|
	\| MTP \| DeepSeek-V3 \| Predicts t+2, t+3 alongside t+1 for better efficiency \|
	\| DiffAttn \| Microsoft 2024 \| Signal-minus-noise attention filtering \|
	\| SWA \| Mistral \| Local attention window of 128 tokens \|
	\| RoPE+NTK \| Meta / CodeLLaMA \| Relative position with length extrapolation \|
	\| DeepNorm \| Microsoft \| Residual scaling for deep network stability \|
	\| RMSNorm \| LLaMA / PaLM \| Fast normalization without mean-centering \|
	\| QK-Norm \| Gemma 2 \| Pre-attention query/key normalization \|
	\| SwiGLU \| PaLM / LLaMA \| Gated FFN activation (8/3 ratio) \|
	\| Z-loss \| PaLM / Chinchilla \| Logit regularization preventing softmax drift \|
	\| Lion \| Google Brain 2023 \| Sign-momentum optimizer \|
	\| WSD \| DeepSeek / MiniMax \| Warmup-Stable-Decay LR schedule \|
	\| Flash Attention 2 \| Stanford \| O(1) memory fused attention kernel \|
	\| torch.compile \| PyTorch 2.0+ \| Graph compilation with operator fusion \|