File size: 7,350 Bytes

e91718f

# nanoGPT: Step-by-Step Tutorial

This tutorial walks through building and training a **tiny GPT from scratch** in pure PyTorch. No `transformers` library, no pre-trained weights — just ~200 lines of clean code.

---

## Table of Contents

1. [Overview](#1-overview)
2. [Dataset Preparation](#2-dataset-preparation)
3. [Model Architecture](#3-model-architecture)
4. [Training Loop](#4-training-loop)
5. [Generation / Inference](#5-generation--inference)
6. [Results](#6-results)
7. [Files in this Repo](#7-files-in-this-repo)

---

## 1. Overview

We train a **character-level** language model on [tiny Shakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) (~1.1M characters, 65 unique characters).

The model learns to predict the next character given all previous characters, autoregressively. This is exactly how GPT-2, GPT-3, and ChatGPT work — just at character scale instead of word/BPE token scale.

**Model size**: ~10.8M parameters  
**Architecture**: 6 layers, 6 heads, 384 embedding dim, 256 context length

---

## 2. Dataset Preparation (`prepare.py`)

### What happens:
1. **Download** tiny Shakespeare text
2. **Discover vocabulary**: find all unique characters → 65 chars
3. **Build mappings**:
   - `stoi` (string-to-int): `'a' → 0`, `'b' → 1`, ...
   - `itos` (int-to-string): reverse lookup
4. **Encode** the entire text as integers
5. **Split** 90% train / 10% validation
6. **Save** as `data.pt` (PyTorch tensors for fast loading)

### Key concept: Character-level tokenization
```python
chars = sorted(list(set(text)))          # vocabulary
vocab_size = len(chars)                  # 65
encode = lambda s: [stoi[c] for c in s]  # "hello" -> [46, 43, 50, 50, 53]
decode = lambda l: "".join([itos[i] for i in l])
```

No tokenizer library needed! For English text, ~65 chars is enough.

---

## 3. Model Architecture (`model.py`)

### 3.1 Configuration (`GPTConfig`)
```python
@dataclass
class GPTConfig:
    block_size: int = 256    # max sequence length
    vocab_size: int = 65     # number of unique characters
    n_layer: int = 6         # transformer blocks
    n_head: int = 6          # attention heads per block
    n_embd: int = 384        # embedding dimension
```

### 3.2 Causal Self-Attention
The core idea: every token can "look at" all previous tokens to decide what comes next.

```
For each token:
  Query = "What am I looking for?"
  Key   = "What do I contain?"
  Value = "What information do I have?"

Attention score = Query · Key  (scaled)
Causal mask     = prevent looking at future tokens
Output          = weighted sum of Values
```

We use **multi-head attention**: split embeddings into 6 parallel attention operations (heads), run them simultaneously, then concatenate.

**Code flow:**
```
Input (B, T, C)
  → c_attn → (Q, K, V) each (B, T, C)
  → reshape to (B, n_head, T, head_size)
  → Q @ K.T → attention scores (B, n_head, T, T)
  → causal mask → softmax → weighted sum of V
  → reshape back → c_proj → Output (B, T, C)
```

### 3.3 MLP (Feed-Forward)
After attention, each token gets a private "thinking step":
```
(B, T, C) → Linear(4*C) → GELU → Linear(C) → (B, T, C)
```
The 4× expansion is standard in transformers.

### 3.4 Transformer Block
```
x = x + Attention(LayerNorm(x))   # pre-norm residual
x = x + MLP(LayerNorm(x))         # pre-norm residual
```
**Pre-LayerNorm** (normalize before sublayer) is used by GPT-2/3/Llama.

### 3.5 Full GPT Model
```
1. Token Embedding  (wte): char index → vector
2. Position Embedding (wpe): position index → vector
3. Sum them: x = wte + wpe
4. Pass through N transformer blocks
5. Final LayerNorm
6. Language Model Head: project to vocab_size logits
7. Cross-entropy loss against next-character targets
```

**Weight tying**: `wte` (input embedding) shares weights with `lm_head` (output projection). Saves parameters, improves training.

---

## 4. Training Loop (`train.py` / `train_standalone.py`)

### 4.1 Batch sampling
For each training step, grab random contiguous chunks:
```python
def get_batch(split):
    ix = torch.randint(len(data) - BLOCK_SIZE, (BATCH_SIZE,))
    x = data[ix : ix+BLOCK_SIZE]      # input
    y = data[ix+1 : ix+BLOCK_SIZE+1]  # target (shifted by 1)
```

### 4.2 Learning rate schedule
**Cosine with linear warmup**:
```
Step 0-200:   LR ramps up from 0 → 1e-3   (warmup)
Step 200-5000: LR decays cosine to 1e-4   (cosine annealing)
```
Warmup prevents early loss spikes when gradients are large.

### 4.3 Optimizer
**AdamW** with separated weight decay:
- 2D parameters (weights) → weight_decay = 0.1
- 1D parameters (biases, LayerNorm) → weight_decay = 0.0

### 4.4 Gradient clipping
`torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)` prevents exploding gradients.

### 4.5 Evaluation
Every 500 steps, we evaluate on 200 random validation batches and report:
```
step  500 | train loss 1.8234 | val loss 1.9012 | lr 9.12e-04 | time 45.2s
```
The best validation checkpoint is saved as `best.pt`.

---

## 5. Generation / Inference (`generate.py`)

Autoregressive generation:
```
1. Encode a prompt (e.g., "\nROMEO:")
2. Run forward pass → get logits for last token
3. Apply temperature + top-k sampling → probability distribution
4. Sample next token from distribution
5. Append token to sequence
6. Repeat from step 2
```

**Temperature**: lower = more conservative/deterministic, higher = more random/creative  
**Top-k**: only sample from the k most likely tokens (prevents gibberish)

---

## 6. Results

Expected after 5000 steps on T4 GPU (~30-60 minutes):

| Metric | Value |
|--------|-------|
| Initial loss | ~4.3 (random guessing among 65 chars) |
| Final train loss | ~1.2–1.5 |
| Final val loss | ~1.3–1.6 |
| Parameters | 10.77 M |

**Generated sample** (should look vaguely Shakespeare-like):
```
ROMEO:
What say you, then? I have heard you say
The hour is come, and I must hence depart.
```

---

## 7. Files in this Repo

| File | Purpose |
|------|---------|
| `model.py` | Pure PyTorch GPT architecture (standalone) |
| `prepare.py` | Downloads data, builds char-level vocab, saves `data.pt` |
| `train.py` | Training script (imports from `model.py`) |
| `train_standalone.py` | Self-contained training script (model + train in one file) |
| `generate.py` | Inference script — load checkpoint and generate text |
| `input.txt` | Raw tiny Shakespeare text |
| `data.pt` | Preprocessed train/val tensors + vocab mappings |
| `best.pt` | Best model checkpoint (saved during training) |

---

## How to Run

```bash
# 1. Prepare data
python prepare.py

# 2. Train (GPU recommended)
python train_standalone.py

# 3. Generate
python generate.py --prompt "ROMEO:" --length 500 --temperature 0.8
```

---

## Learning Checklist

- [ ] Read `model.py` — understand attention masking, pre-norm, weight tying
- [ ] Read `prepare.py` — understand character-level tokenization
- [ ] Read `train.py` — understand batching, LR schedule, gradient clipping
- [ ] Run training and watch loss go down
- [ ] Tweak hyperparameters (n_layer, n_embd, learning rate) and observe changes
- [ ] Generate with different temperatures and top-k values

---

Based on Andrej Karpathy's [build-nanogpt](https://github.com/karpathy/build-nanogpt) and [nanoGPT](https://github.com/karpathy/nanoGPT).