nanogpt-tutorial / TUTORIAL.md
yat343's picture
Upload TUTORIAL.md
e91718f verified

nanoGPT: Step-by-Step Tutorial

This tutorial walks through building and training a tiny GPT from scratch in pure PyTorch. No transformers library, no pre-trained weights β€” just ~200 lines of clean code.


Table of Contents

  1. Overview
  2. Dataset Preparation
  3. Model Architecture
  4. Training Loop
  5. Generation / Inference
  6. Results
  7. Files in this Repo

1. Overview

We train a character-level language model on tiny Shakespeare (~1.1M characters, 65 unique characters).

The model learns to predict the next character given all previous characters, autoregressively. This is exactly how GPT-2, GPT-3, and ChatGPT work β€” just at character scale instead of word/BPE token scale.

Model size: ~10.8M parameters
Architecture: 6 layers, 6 heads, 384 embedding dim, 256 context length


2. Dataset Preparation (prepare.py)

What happens:

  1. Download tiny Shakespeare text
  2. Discover vocabulary: find all unique characters β†’ 65 chars
  3. Build mappings:
    • stoi (string-to-int): 'a' β†’ 0, 'b' β†’ 1, ...
    • itos (int-to-string): reverse lookup
  4. Encode the entire text as integers
  5. Split 90% train / 10% validation
  6. Save as data.pt (PyTorch tensors for fast loading)

Key concept: Character-level tokenization

chars = sorted(list(set(text)))          # vocabulary
vocab_size = len(chars)                  # 65
encode = lambda s: [stoi[c] for c in s]  # "hello" -> [46, 43, 50, 50, 53]
decode = lambda l: "".join([itos[i] for i in l])

No tokenizer library needed! For English text, ~65 chars is enough.


3. Model Architecture (model.py)

3.1 Configuration (GPTConfig)

@dataclass
class GPTConfig:
    block_size: int = 256    # max sequence length
    vocab_size: int = 65     # number of unique characters
    n_layer: int = 6         # transformer blocks
    n_head: int = 6          # attention heads per block
    n_embd: int = 384        # embedding dimension

3.2 Causal Self-Attention

The core idea: every token can "look at" all previous tokens to decide what comes next.

For each token:
  Query = "What am I looking for?"
  Key   = "What do I contain?"
  Value = "What information do I have?"

Attention score = Query Β· Key  (scaled)
Causal mask     = prevent looking at future tokens
Output          = weighted sum of Values

We use multi-head attention: split embeddings into 6 parallel attention operations (heads), run them simultaneously, then concatenate.

Code flow:

Input (B, T, C)
  β†’ c_attn β†’ (Q, K, V) each (B, T, C)
  β†’ reshape to (B, n_head, T, head_size)
  β†’ Q @ K.T β†’ attention scores (B, n_head, T, T)
  β†’ causal mask β†’ softmax β†’ weighted sum of V
  β†’ reshape back β†’ c_proj β†’ Output (B, T, C)

3.3 MLP (Feed-Forward)

After attention, each token gets a private "thinking step":

(B, T, C) β†’ Linear(4*C) β†’ GELU β†’ Linear(C) β†’ (B, T, C)

The 4Γ— expansion is standard in transformers.

3.4 Transformer Block

x = x + Attention(LayerNorm(x))   # pre-norm residual
x = x + MLP(LayerNorm(x))         # pre-norm residual

Pre-LayerNorm (normalize before sublayer) is used by GPT-2/3/Llama.

3.5 Full GPT Model

1. Token Embedding  (wte): char index β†’ vector
2. Position Embedding (wpe): position index β†’ vector
3. Sum them: x = wte + wpe
4. Pass through N transformer blocks
5. Final LayerNorm
6. Language Model Head: project to vocab_size logits
7. Cross-entropy loss against next-character targets

Weight tying: wte (input embedding) shares weights with lm_head (output projection). Saves parameters, improves training.


4. Training Loop (train.py / train_standalone.py)

4.1 Batch sampling

For each training step, grab random contiguous chunks:

def get_batch(split):
    ix = torch.randint(len(data) - BLOCK_SIZE, (BATCH_SIZE,))
    x = data[ix : ix+BLOCK_SIZE]      # input
    y = data[ix+1 : ix+BLOCK_SIZE+1]  # target (shifted by 1)

4.2 Learning rate schedule

Cosine with linear warmup:

Step 0-200:   LR ramps up from 0 β†’ 1e-3   (warmup)
Step 200-5000: LR decays cosine to 1e-4   (cosine annealing)

Warmup prevents early loss spikes when gradients are large.

4.3 Optimizer

AdamW with separated weight decay:

  • 2D parameters (weights) β†’ weight_decay = 0.1
  • 1D parameters (biases, LayerNorm) β†’ weight_decay = 0.0

4.4 Gradient clipping

torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) prevents exploding gradients.

4.5 Evaluation

Every 500 steps, we evaluate on 200 random validation batches and report:

step  500 | train loss 1.8234 | val loss 1.9012 | lr 9.12e-04 | time 45.2s

The best validation checkpoint is saved as best.pt.


5. Generation / Inference (generate.py)

Autoregressive generation:

1. Encode a prompt (e.g., "\nROMEO:")
2. Run forward pass β†’ get logits for last token
3. Apply temperature + top-k sampling β†’ probability distribution
4. Sample next token from distribution
5. Append token to sequence
6. Repeat from step 2

Temperature: lower = more conservative/deterministic, higher = more random/creative
Top-k: only sample from the k most likely tokens (prevents gibberish)


6. Results

Expected after 5000 steps on T4 GPU (~30-60 minutes):

Metric Value
Initial loss ~4.3 (random guessing among 65 chars)
Final train loss ~1.2–1.5
Final val loss ~1.3–1.6
Parameters 10.77 M

Generated sample (should look vaguely Shakespeare-like):

ROMEO:
What say you, then? I have heard you say
The hour is come, and I must hence depart.

7. Files in this Repo

File Purpose
model.py Pure PyTorch GPT architecture (standalone)
prepare.py Downloads data, builds char-level vocab, saves data.pt
train.py Training script (imports from model.py)
train_standalone.py Self-contained training script (model + train in one file)
generate.py Inference script β€” load checkpoint and generate text
input.txt Raw tiny Shakespeare text
data.pt Preprocessed train/val tensors + vocab mappings
best.pt Best model checkpoint (saved during training)

How to Run

# 1. Prepare data
python prepare.py

# 2. Train (GPU recommended)
python train_standalone.py

# 3. Generate
python generate.py --prompt "ROMEO:" --length 500 --temperature 0.8

Learning Checklist

  • Read model.py β€” understand attention masking, pre-norm, weight tying
  • Read prepare.py β€” understand character-level tokenization
  • Read train.py β€” understand batching, LR schedule, gradient clipping
  • Run training and watch loss go down
  • Tweak hyperparameters (n_layer, n_embd, learning rate) and observe changes
  • Generate with different temperatures and top-k values

Based on Andrej Karpathy's build-nanogpt and nanoGPT.