| # nanoGPT: Step-by-Step Tutorial |
|
|
| This tutorial walks through building and training a **tiny GPT from scratch** in pure PyTorch. No `transformers` library, no pre-trained weights β just ~200 lines of clean code. |
|
|
| --- |
|
|
| ## Table of Contents |
|
|
| 1. [Overview](#1-overview) |
| 2. [Dataset Preparation](#2-dataset-preparation) |
| 3. [Model Architecture](#3-model-architecture) |
| 4. [Training Loop](#4-training-loop) |
| 5. [Generation / Inference](#5-generation--inference) |
| 6. [Results](#6-results) |
| 7. [Files in this Repo](#7-files-in-this-repo) |
|
|
| --- |
|
|
| ## 1. Overview |
|
|
| We train a **character-level** language model on [tiny Shakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) (~1.1M characters, 65 unique characters). |
|
|
| The model learns to predict the next character given all previous characters, autoregressively. This is exactly how GPT-2, GPT-3, and ChatGPT work β just at character scale instead of word/BPE token scale. |
|
|
| **Model size**: ~10.8M parameters |
| **Architecture**: 6 layers, 6 heads, 384 embedding dim, 256 context length |
|
|
| --- |
|
|
| ## 2. Dataset Preparation (`prepare.py`) |
|
|
| ### What happens: |
| 1. **Download** tiny Shakespeare text |
| 2. **Discover vocabulary**: find all unique characters β 65 chars |
| 3. **Build mappings**: |
| - `stoi` (string-to-int): `'a' β 0`, `'b' β 1`, ... |
| - `itos` (int-to-string): reverse lookup |
| 4. **Encode** the entire text as integers |
| 5. **Split** 90% train / 10% validation |
| 6. **Save** as `data.pt` (PyTorch tensors for fast loading) |
|
|
| ### Key concept: Character-level tokenization |
| ```python |
| chars = sorted(list(set(text))) # vocabulary |
| vocab_size = len(chars) # 65 |
| encode = lambda s: [stoi[c] for c in s] # "hello" -> [46, 43, 50, 50, 53] |
| decode = lambda l: "".join([itos[i] for i in l]) |
| ``` |
|
|
| No tokenizer library needed! For English text, ~65 chars is enough. |
|
|
| --- |
|
|
| ## 3. Model Architecture (`model.py`) |
|
|
| ### 3.1 Configuration (`GPTConfig`) |
| ```python |
| @dataclass |
| class GPTConfig: |
| block_size: int = 256 # max sequence length |
| vocab_size: int = 65 # number of unique characters |
| n_layer: int = 6 # transformer blocks |
| n_head: int = 6 # attention heads per block |
| n_embd: int = 384 # embedding dimension |
| ``` |
|
|
| ### 3.2 Causal Self-Attention |
| The core idea: every token can "look at" all previous tokens to decide what comes next. |
|
|
| ``` |
| For each token: |
| Query = "What am I looking for?" |
| Key = "What do I contain?" |
| Value = "What information do I have?" |
| |
| Attention score = Query Β· Key (scaled) |
| Causal mask = prevent looking at future tokens |
| Output = weighted sum of Values |
| ``` |
|
|
| We use **multi-head attention**: split embeddings into 6 parallel attention operations (heads), run them simultaneously, then concatenate. |
|
|
| **Code flow:** |
| ``` |
| Input (B, T, C) |
| β c_attn β (Q, K, V) each (B, T, C) |
| β reshape to (B, n_head, T, head_size) |
| β Q @ K.T β attention scores (B, n_head, T, T) |
| β causal mask β softmax β weighted sum of V |
| β reshape back β c_proj β Output (B, T, C) |
| ``` |
|
|
| ### 3.3 MLP (Feed-Forward) |
| After attention, each token gets a private "thinking step": |
| ``` |
| (B, T, C) β Linear(4*C) β GELU β Linear(C) β (B, T, C) |
| ``` |
| The 4Γ expansion is standard in transformers. |
|
|
| ### 3.4 Transformer Block |
| ``` |
| x = x + Attention(LayerNorm(x)) # pre-norm residual |
| x = x + MLP(LayerNorm(x)) # pre-norm residual |
| ``` |
| **Pre-LayerNorm** (normalize before sublayer) is used by GPT-2/3/Llama. |
|
|
| ### 3.5 Full GPT Model |
| ``` |
| 1. Token Embedding (wte): char index β vector |
| 2. Position Embedding (wpe): position index β vector |
| 3. Sum them: x = wte + wpe |
| 4. Pass through N transformer blocks |
| 5. Final LayerNorm |
| 6. Language Model Head: project to vocab_size logits |
| 7. Cross-entropy loss against next-character targets |
| ``` |
|
|
| **Weight tying**: `wte` (input embedding) shares weights with `lm_head` (output projection). Saves parameters, improves training. |
|
|
| --- |
|
|
| ## 4. Training Loop (`train.py` / `train_standalone.py`) |
| |
| ### 4.1 Batch sampling |
| For each training step, grab random contiguous chunks: |
| ```python |
| def get_batch(split): |
| ix = torch.randint(len(data) - BLOCK_SIZE, (BATCH_SIZE,)) |
| x = data[ix : ix+BLOCK_SIZE] # input |
| y = data[ix+1 : ix+BLOCK_SIZE+1] # target (shifted by 1) |
| ``` |
| |
| ### 4.2 Learning rate schedule |
| **Cosine with linear warmup**: |
| ``` |
| Step 0-200: LR ramps up from 0 β 1e-3 (warmup) |
| Step 200-5000: LR decays cosine to 1e-4 (cosine annealing) |
| ``` |
| Warmup prevents early loss spikes when gradients are large. |
|
|
| ### 4.3 Optimizer |
| **AdamW** with separated weight decay: |
| - 2D parameters (weights) β weight_decay = 0.1 |
| - 1D parameters (biases, LayerNorm) β weight_decay = 0.0 |
|
|
| ### 4.4 Gradient clipping |
| `torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)` prevents exploding gradients. |
|
|
| ### 4.5 Evaluation |
| Every 500 steps, we evaluate on 200 random validation batches and report: |
| ``` |
| step 500 | train loss 1.8234 | val loss 1.9012 | lr 9.12e-04 | time 45.2s |
| ``` |
| The best validation checkpoint is saved as `best.pt`. |
|
|
| --- |
|
|
| ## 5. Generation / Inference (`generate.py`) |
|
|
| Autoregressive generation: |
| ``` |
| 1. Encode a prompt (e.g., "\nROMEO:") |
| 2. Run forward pass β get logits for last token |
| 3. Apply temperature + top-k sampling β probability distribution |
| 4. Sample next token from distribution |
| 5. Append token to sequence |
| 6. Repeat from step 2 |
| ``` |
|
|
| **Temperature**: lower = more conservative/deterministic, higher = more random/creative |
| **Top-k**: only sample from the k most likely tokens (prevents gibberish) |
|
|
| --- |
|
|
| ## 6. Results |
|
|
| Expected after 5000 steps on T4 GPU (~30-60 minutes): |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Initial loss | ~4.3 (random guessing among 65 chars) | |
| | Final train loss | ~1.2β1.5 | |
| | Final val loss | ~1.3β1.6 | |
| | Parameters | 10.77 M | |
|
|
| **Generated sample** (should look vaguely Shakespeare-like): |
| ``` |
| ROMEO: |
| What say you, then? I have heard you say |
| The hour is come, and I must hence depart. |
| ``` |
|
|
| --- |
|
|
| ## 7. Files in this Repo |
|
|
| | File | Purpose | |
| |------|---------| |
| | `model.py` | Pure PyTorch GPT architecture (standalone) | |
| | `prepare.py` | Downloads data, builds char-level vocab, saves `data.pt` | |
| | `train.py` | Training script (imports from `model.py`) | |
| | `train_standalone.py` | Self-contained training script (model + train in one file) | |
| | `generate.py` | Inference script β load checkpoint and generate text | |
| | `input.txt` | Raw tiny Shakespeare text | |
| | `data.pt` | Preprocessed train/val tensors + vocab mappings | |
| | `best.pt` | Best model checkpoint (saved during training) | |
|
|
| --- |
|
|
| ## How to Run |
|
|
| ```bash |
| # 1. Prepare data |
| python prepare.py |
| |
| # 2. Train (GPU recommended) |
| python train_standalone.py |
| |
| # 3. Generate |
| python generate.py --prompt "ROMEO:" --length 500 --temperature 0.8 |
| ``` |
|
|
| --- |
|
|
| ## Learning Checklist |
|
|
| - [ ] Read `model.py` β understand attention masking, pre-norm, weight tying |
| - [ ] Read `prepare.py` β understand character-level tokenization |
| - [ ] Read `train.py` β understand batching, LR schedule, gradient clipping |
| - [ ] Run training and watch loss go down |
| - [ ] Tweak hyperparameters (n_layer, n_embd, learning rate) and observe changes |
| - [ ] Generate with different temperatures and top-k values |
|
|
| --- |
|
|
| Based on Andrej Karpathy's [build-nanogpt](https://github.com/karpathy/build-nanogpt) and [nanoGPT](https://github.com/karpathy/nanoGPT). |
|
|