nanoGPT: Step-by-Step Tutorial
This tutorial walks through building and training a tiny GPT from scratch in pure PyTorch. No transformers library, no pre-trained weights β just ~200 lines of clean code.
Table of Contents
- Overview
- Dataset Preparation
- Model Architecture
- Training Loop
- Generation / Inference
- Results
- Files in this Repo
1. Overview
We train a character-level language model on tiny Shakespeare (~1.1M characters, 65 unique characters).
The model learns to predict the next character given all previous characters, autoregressively. This is exactly how GPT-2, GPT-3, and ChatGPT work β just at character scale instead of word/BPE token scale.
Model size: ~10.8M parameters
Architecture: 6 layers, 6 heads, 384 embedding dim, 256 context length
2. Dataset Preparation (prepare.py)
What happens:
- Download tiny Shakespeare text
- Discover vocabulary: find all unique characters β 65 chars
- Build mappings:
stoi(string-to-int):'a' β 0,'b' β 1, ...itos(int-to-string): reverse lookup
- Encode the entire text as integers
- Split 90% train / 10% validation
- Save as
data.pt(PyTorch tensors for fast loading)
Key concept: Character-level tokenization
chars = sorted(list(set(text))) # vocabulary
vocab_size = len(chars) # 65
encode = lambda s: [stoi[c] for c in s] # "hello" -> [46, 43, 50, 50, 53]
decode = lambda l: "".join([itos[i] for i in l])
No tokenizer library needed! For English text, ~65 chars is enough.
3. Model Architecture (model.py)
3.1 Configuration (GPTConfig)
@dataclass
class GPTConfig:
block_size: int = 256 # max sequence length
vocab_size: int = 65 # number of unique characters
n_layer: int = 6 # transformer blocks
n_head: int = 6 # attention heads per block
n_embd: int = 384 # embedding dimension
3.2 Causal Self-Attention
The core idea: every token can "look at" all previous tokens to decide what comes next.
For each token:
Query = "What am I looking for?"
Key = "What do I contain?"
Value = "What information do I have?"
Attention score = Query Β· Key (scaled)
Causal mask = prevent looking at future tokens
Output = weighted sum of Values
We use multi-head attention: split embeddings into 6 parallel attention operations (heads), run them simultaneously, then concatenate.
Code flow:
Input (B, T, C)
β c_attn β (Q, K, V) each (B, T, C)
β reshape to (B, n_head, T, head_size)
β Q @ K.T β attention scores (B, n_head, T, T)
β causal mask β softmax β weighted sum of V
β reshape back β c_proj β Output (B, T, C)
3.3 MLP (Feed-Forward)
After attention, each token gets a private "thinking step":
(B, T, C) β Linear(4*C) β GELU β Linear(C) β (B, T, C)
The 4Γ expansion is standard in transformers.
3.4 Transformer Block
x = x + Attention(LayerNorm(x)) # pre-norm residual
x = x + MLP(LayerNorm(x)) # pre-norm residual
Pre-LayerNorm (normalize before sublayer) is used by GPT-2/3/Llama.
3.5 Full GPT Model
1. Token Embedding (wte): char index β vector
2. Position Embedding (wpe): position index β vector
3. Sum them: x = wte + wpe
4. Pass through N transformer blocks
5. Final LayerNorm
6. Language Model Head: project to vocab_size logits
7. Cross-entropy loss against next-character targets
Weight tying: wte (input embedding) shares weights with lm_head (output projection). Saves parameters, improves training.
4. Training Loop (train.py / train_standalone.py)
4.1 Batch sampling
For each training step, grab random contiguous chunks:
def get_batch(split):
ix = torch.randint(len(data) - BLOCK_SIZE, (BATCH_SIZE,))
x = data[ix : ix+BLOCK_SIZE] # input
y = data[ix+1 : ix+BLOCK_SIZE+1] # target (shifted by 1)
4.2 Learning rate schedule
Cosine with linear warmup:
Step 0-200: LR ramps up from 0 β 1e-3 (warmup)
Step 200-5000: LR decays cosine to 1e-4 (cosine annealing)
Warmup prevents early loss spikes when gradients are large.
4.3 Optimizer
AdamW with separated weight decay:
- 2D parameters (weights) β weight_decay = 0.1
- 1D parameters (biases, LayerNorm) β weight_decay = 0.0
4.4 Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) prevents exploding gradients.
4.5 Evaluation
Every 500 steps, we evaluate on 200 random validation batches and report:
step 500 | train loss 1.8234 | val loss 1.9012 | lr 9.12e-04 | time 45.2s
The best validation checkpoint is saved as best.pt.
5. Generation / Inference (generate.py)
Autoregressive generation:
1. Encode a prompt (e.g., "\nROMEO:")
2. Run forward pass β get logits for last token
3. Apply temperature + top-k sampling β probability distribution
4. Sample next token from distribution
5. Append token to sequence
6. Repeat from step 2
Temperature: lower = more conservative/deterministic, higher = more random/creative
Top-k: only sample from the k most likely tokens (prevents gibberish)
6. Results
Expected after 5000 steps on T4 GPU (~30-60 minutes):
| Metric | Value |
|---|---|
| Initial loss | ~4.3 (random guessing among 65 chars) |
| Final train loss | ~1.2β1.5 |
| Final val loss | ~1.3β1.6 |
| Parameters | 10.77 M |
Generated sample (should look vaguely Shakespeare-like):
ROMEO:
What say you, then? I have heard you say
The hour is come, and I must hence depart.
7. Files in this Repo
| File | Purpose |
|---|---|
model.py |
Pure PyTorch GPT architecture (standalone) |
prepare.py |
Downloads data, builds char-level vocab, saves data.pt |
train.py |
Training script (imports from model.py) |
train_standalone.py |
Self-contained training script (model + train in one file) |
generate.py |
Inference script β load checkpoint and generate text |
input.txt |
Raw tiny Shakespeare text |
data.pt |
Preprocessed train/val tensors + vocab mappings |
best.pt |
Best model checkpoint (saved during training) |
How to Run
# 1. Prepare data
python prepare.py
# 2. Train (GPU recommended)
python train_standalone.py
# 3. Generate
python generate.py --prompt "ROMEO:" --length 500 --temperature 0.8
Learning Checklist
- Read
model.pyβ understand attention masking, pre-norm, weight tying - Read
prepare.pyβ understand character-level tokenization - Read
train.pyβ understand batching, LR schedule, gradient clipping - Run training and watch loss go down
- Tweak hyperparameters (n_layer, n_embd, learning rate) and observe changes
- Generate with different temperatures and top-k values
Based on Andrej Karpathy's build-nanogpt and nanoGPT.