Upload TUTORIAL.md

e91718f verified 2 days ago

7.35 kB

	# nanoGPT: Step-by-Step Tutorial

	This tutorial walks through building and training a tiny GPT from scratch in pure PyTorch. No `transformers` library, no pre-trained weights — just ~200 lines of clean code.

	---

	## Table of Contents

	1. [Overview](#1-overview)
	2. [Dataset Preparation](#2-dataset-preparation)
	3. [Model Architecture](#3-model-architecture)
	4. [Training Loop](#4-training-loop)
	5. [Generation / Inference](#5-generation--inference)
	6. [Results](#6-results)
	7. [Files in this Repo](#7-files-in-this-repo)

	---

	## 1. Overview

	We train a character-level language model on [tiny Shakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) (~1.1M characters, 65 unique characters).

	The model learns to predict the next character given all previous characters, autoregressively. This is exactly how GPT-2, GPT-3, and ChatGPT work — just at character scale instead of word/BPE token scale.

	Model size: ~10.8M parameters
	Architecture: 6 layers, 6 heads, 384 embedding dim, 256 context length

	---

	## 2. Dataset Preparation (`prepare.py`)

	### What happens:
	1. Download tiny Shakespeare text
	2. Discover vocabulary: find all unique characters → 65 chars
	3. Build mappings:
	- `stoi` (string-to-int): `'a' → 0`, `'b' → 1`, ...
	- `itos` (int-to-string): reverse lookup
	4. Encode the entire text as integers
	5. Split 90% train / 10% validation
	6. Save as `data.pt` (PyTorch tensors for fast loading)

	### Key concept: Character-level tokenization
	```python
	chars = sorted(list(set(text))) # vocabulary
	vocab_size = len(chars) # 65
	encode = lambda s: [stoi[c] for c in s] # "hello" -> [46, 43, 50, 50, 53]
	decode = lambda l: "".join([itos[i] for i in l])
	```

	No tokenizer library needed! For English text, ~65 chars is enough.

	---

	## 3. Model Architecture (`model.py`)

	### 3.1 Configuration (`GPTConfig`)
	```python
	@dataclass
	class GPTConfig:
	block_size: int = 256 # max sequence length
	vocab_size: int = 65 # number of unique characters
	n_layer: int = 6 # transformer blocks
	n_head: int = 6 # attention heads per block
	n_embd: int = 384 # embedding dimension
	```

	### 3.2 Causal Self-Attention
	The core idea: every token can "look at" all previous tokens to decide what comes next.

	```
	For each token:
	Query = "What am I looking for?"
	Key = "What do I contain?"
	Value = "What information do I have?"

	Attention score = Query · Key (scaled)
	Causal mask = prevent looking at future tokens
	Output = weighted sum of Values
	```

	We use multi-head attention: split embeddings into 6 parallel attention operations (heads), run them simultaneously, then concatenate.

	Code flow:
	```
	Input (B, T, C)
	→ c_attn → (Q, K, V) each (B, T, C)
	→ reshape to (B, n_head, T, head_size)
	→ Q @ K.T → attention scores (B, n_head, T, T)
	→ causal mask → softmax → weighted sum of V
	→ reshape back → c_proj → Output (B, T, C)
	```

	### 3.3 MLP (Feed-Forward)
	After attention, each token gets a private "thinking step":
	```
	(B, T, C) → Linear(4*C) → GELU → Linear(C) → (B, T, C)
	```
	The 4× expansion is standard in transformers.

	### 3.4 Transformer Block
	```
	x = x + Attention(LayerNorm(x)) # pre-norm residual
	x = x + MLP(LayerNorm(x)) # pre-norm residual
	```
	Pre-LayerNorm (normalize before sublayer) is used by GPT-2/3/Llama.

	### 3.5 Full GPT Model
	```
	1. Token Embedding (wte): char index → vector
	2. Position Embedding (wpe): position index → vector
	3. Sum them: x = wte + wpe
	4. Pass through N transformer blocks
	5. Final LayerNorm
	6. Language Model Head: project to vocab_size logits
	7. Cross-entropy loss against next-character targets
	```

	Weight tying: `wte` (input embedding) shares weights with `lm_head` (output projection). Saves parameters, improves training.

	---

	## 4. Training Loop (`train.py` / `train_standalone.py`)

	### 4.1 Batch sampling
	For each training step, grab random contiguous chunks:
	```python
	def get_batch(split):
	ix = torch.randint(len(data) - BLOCK_SIZE, (BATCH_SIZE,))
	x = data[ix : ix+BLOCK_SIZE] # input
	y = data[ix+1 : ix+BLOCK_SIZE+1] # target (shifted by 1)
	```

	### 4.2 Learning rate schedule
	Cosine with linear warmup:
	```
	Step 0-200: LR ramps up from 0 → 1e-3 (warmup)
	Step 200-5000: LR decays cosine to 1e-4 (cosine annealing)
	```
	Warmup prevents early loss spikes when gradients are large.

	### 4.3 Optimizer
	AdamW with separated weight decay:
	- 2D parameters (weights) → weight_decay = 0.1
	- 1D parameters (biases, LayerNorm) → weight_decay = 0.0

	### 4.4 Gradient clipping
	`torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)` prevents exploding gradients.

	### 4.5 Evaluation
	Every 500 steps, we evaluate on 200 random validation batches and report:
	```
	step 500 \| train loss 1.8234 \| val loss 1.9012 \| lr 9.12e-04 \| time 45.2s
	```
	The best validation checkpoint is saved as `best.pt`.

	---

	## 5. Generation / Inference (`generate.py`)

	Autoregressive generation:
	```
	1. Encode a prompt (e.g., "\nROMEO:")
	2. Run forward pass → get logits for last token
	3. Apply temperature + top-k sampling → probability distribution
	4. Sample next token from distribution
	5. Append token to sequence
	6. Repeat from step 2
	```

	Temperature: lower = more conservative/deterministic, higher = more random/creative
	Top-k: only sample from the k most likely tokens (prevents gibberish)

	---

	## 6. Results

	Expected after 5000 steps on T4 GPU (~30-60 minutes):

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Initial loss \| ~4.3 (random guessing among 65 chars) \|
	\| Final train loss \| ~1.2–1.5 \|
	\| Final val loss \| ~1.3–1.6 \|
	\| Parameters \| 10.77 M \|

	Generated sample (should look vaguely Shakespeare-like):
	```
	ROMEO:
	What say you, then? I have heard you say
	The hour is come, and I must hence depart.
	```

	---

	## 7. Files in this Repo

	\| File \| Purpose \|
	\|------\|---------\|
	\| `model.py` \| Pure PyTorch GPT architecture (standalone) \|
	\| `prepare.py` \| Downloads data, builds char-level vocab, saves `data.pt` \|
	\| `train.py` \| Training script (imports from `model.py`) \|
	\| `train_standalone.py` \| Self-contained training script (model + train in one file) \|
	\| `generate.py` \| Inference script — load checkpoint and generate text \|
	\| `input.txt` \| Raw tiny Shakespeare text \|
	\| `data.pt` \| Preprocessed train/val tensors + vocab mappings \|
	\| `best.pt` \| Best model checkpoint (saved during training) \|

	---

	## How to Run

	```bash
	# 1. Prepare data
	python prepare.py

	# 2. Train (GPU recommended)
	python train_standalone.py

	# 3. Generate
	python generate.py --prompt "ROMEO:" --length 500 --temperature 0.8
	```

	---

	## Learning Checklist

	- [ ] Read `model.py` — understand attention masking, pre-norm, weight tying
	- [ ] Read `prepare.py` — understand character-level tokenization
	- [ ] Read `train.py` — understand batching, LR schedule, gradient clipping
	- [ ] Run training and watch loss go down
	- [ ] Tweak hyperparameters (n_layer, n_embd, learning rate) and observe changes
	- [ ] Generate with different temperatures and top-k values

	---

	Based on Andrej Karpathy's [build-nanogpt](https://github.com/karpathy/build-nanogpt) and [nanoGPT](https://github.com/karpathy/nanoGPT).