File size: 7,350 Bytes
e91718f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 | # nanoGPT: Step-by-Step Tutorial
This tutorial walks through building and training a **tiny GPT from scratch** in pure PyTorch. No `transformers` library, no pre-trained weights β just ~200 lines of clean code.
---
## Table of Contents
1. [Overview](#1-overview)
2. [Dataset Preparation](#2-dataset-preparation)
3. [Model Architecture](#3-model-architecture)
4. [Training Loop](#4-training-loop)
5. [Generation / Inference](#5-generation--inference)
6. [Results](#6-results)
7. [Files in this Repo](#7-files-in-this-repo)
---
## 1. Overview
We train a **character-level** language model on [tiny Shakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) (~1.1M characters, 65 unique characters).
The model learns to predict the next character given all previous characters, autoregressively. This is exactly how GPT-2, GPT-3, and ChatGPT work β just at character scale instead of word/BPE token scale.
**Model size**: ~10.8M parameters
**Architecture**: 6 layers, 6 heads, 384 embedding dim, 256 context length
---
## 2. Dataset Preparation (`prepare.py`)
### What happens:
1. **Download** tiny Shakespeare text
2. **Discover vocabulary**: find all unique characters β 65 chars
3. **Build mappings**:
- `stoi` (string-to-int): `'a' β 0`, `'b' β 1`, ...
- `itos` (int-to-string): reverse lookup
4. **Encode** the entire text as integers
5. **Split** 90% train / 10% validation
6. **Save** as `data.pt` (PyTorch tensors for fast loading)
### Key concept: Character-level tokenization
```python
chars = sorted(list(set(text))) # vocabulary
vocab_size = len(chars) # 65
encode = lambda s: [stoi[c] for c in s] # "hello" -> [46, 43, 50, 50, 53]
decode = lambda l: "".join([itos[i] for i in l])
```
No tokenizer library needed! For English text, ~65 chars is enough.
---
## 3. Model Architecture (`model.py`)
### 3.1 Configuration (`GPTConfig`)
```python
@dataclass
class GPTConfig:
block_size: int = 256 # max sequence length
vocab_size: int = 65 # number of unique characters
n_layer: int = 6 # transformer blocks
n_head: int = 6 # attention heads per block
n_embd: int = 384 # embedding dimension
```
### 3.2 Causal Self-Attention
The core idea: every token can "look at" all previous tokens to decide what comes next.
```
For each token:
Query = "What am I looking for?"
Key = "What do I contain?"
Value = "What information do I have?"
Attention score = Query Β· Key (scaled)
Causal mask = prevent looking at future tokens
Output = weighted sum of Values
```
We use **multi-head attention**: split embeddings into 6 parallel attention operations (heads), run them simultaneously, then concatenate.
**Code flow:**
```
Input (B, T, C)
β c_attn β (Q, K, V) each (B, T, C)
β reshape to (B, n_head, T, head_size)
β Q @ K.T β attention scores (B, n_head, T, T)
β causal mask β softmax β weighted sum of V
β reshape back β c_proj β Output (B, T, C)
```
### 3.3 MLP (Feed-Forward)
After attention, each token gets a private "thinking step":
```
(B, T, C) β Linear(4*C) β GELU β Linear(C) β (B, T, C)
```
The 4Γ expansion is standard in transformers.
### 3.4 Transformer Block
```
x = x + Attention(LayerNorm(x)) # pre-norm residual
x = x + MLP(LayerNorm(x)) # pre-norm residual
```
**Pre-LayerNorm** (normalize before sublayer) is used by GPT-2/3/Llama.
### 3.5 Full GPT Model
```
1. Token Embedding (wte): char index β vector
2. Position Embedding (wpe): position index β vector
3. Sum them: x = wte + wpe
4. Pass through N transformer blocks
5. Final LayerNorm
6. Language Model Head: project to vocab_size logits
7. Cross-entropy loss against next-character targets
```
**Weight tying**: `wte` (input embedding) shares weights with `lm_head` (output projection). Saves parameters, improves training.
---
## 4. Training Loop (`train.py` / `train_standalone.py`)
### 4.1 Batch sampling
For each training step, grab random contiguous chunks:
```python
def get_batch(split):
ix = torch.randint(len(data) - BLOCK_SIZE, (BATCH_SIZE,))
x = data[ix : ix+BLOCK_SIZE] # input
y = data[ix+1 : ix+BLOCK_SIZE+1] # target (shifted by 1)
```
### 4.2 Learning rate schedule
**Cosine with linear warmup**:
```
Step 0-200: LR ramps up from 0 β 1e-3 (warmup)
Step 200-5000: LR decays cosine to 1e-4 (cosine annealing)
```
Warmup prevents early loss spikes when gradients are large.
### 4.3 Optimizer
**AdamW** with separated weight decay:
- 2D parameters (weights) β weight_decay = 0.1
- 1D parameters (biases, LayerNorm) β weight_decay = 0.0
### 4.4 Gradient clipping
`torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)` prevents exploding gradients.
### 4.5 Evaluation
Every 500 steps, we evaluate on 200 random validation batches and report:
```
step 500 | train loss 1.8234 | val loss 1.9012 | lr 9.12e-04 | time 45.2s
```
The best validation checkpoint is saved as `best.pt`.
---
## 5. Generation / Inference (`generate.py`)
Autoregressive generation:
```
1. Encode a prompt (e.g., "\nROMEO:")
2. Run forward pass β get logits for last token
3. Apply temperature + top-k sampling β probability distribution
4. Sample next token from distribution
5. Append token to sequence
6. Repeat from step 2
```
**Temperature**: lower = more conservative/deterministic, higher = more random/creative
**Top-k**: only sample from the k most likely tokens (prevents gibberish)
---
## 6. Results
Expected after 5000 steps on T4 GPU (~30-60 minutes):
| Metric | Value |
|--------|-------|
| Initial loss | ~4.3 (random guessing among 65 chars) |
| Final train loss | ~1.2β1.5 |
| Final val loss | ~1.3β1.6 |
| Parameters | 10.77 M |
**Generated sample** (should look vaguely Shakespeare-like):
```
ROMEO:
What say you, then? I have heard you say
The hour is come, and I must hence depart.
```
---
## 7. Files in this Repo
| File | Purpose |
|------|---------|
| `model.py` | Pure PyTorch GPT architecture (standalone) |
| `prepare.py` | Downloads data, builds char-level vocab, saves `data.pt` |
| `train.py` | Training script (imports from `model.py`) |
| `train_standalone.py` | Self-contained training script (model + train in one file) |
| `generate.py` | Inference script β load checkpoint and generate text |
| `input.txt` | Raw tiny Shakespeare text |
| `data.pt` | Preprocessed train/val tensors + vocab mappings |
| `best.pt` | Best model checkpoint (saved during training) |
---
## How to Run
```bash
# 1. Prepare data
python prepare.py
# 2. Train (GPU recommended)
python train_standalone.py
# 3. Generate
python generate.py --prompt "ROMEO:" --length 500 --temperature 0.8
```
---
## Learning Checklist
- [ ] Read `model.py` β understand attention masking, pre-norm, weight tying
- [ ] Read `prepare.py` β understand character-level tokenization
- [ ] Read `train.py` β understand batching, LR schedule, gradient clipping
- [ ] Run training and watch loss go down
- [ ] Tweak hyperparameters (n_layer, n_embd, learning rate) and observe changes
- [ ] Generate with different temperatures and top-k values
---
Based on Andrej Karpathy's [build-nanogpt](https://github.com/karpathy/build-nanogpt) and [nanoGPT](https://github.com/karpathy/nanoGPT).
|