File size: 2,331 Bytes

82cb4ef

# nanoGPT Tutorial

A step-by-step implementation of a tiny GPT model from scratch in pure PyTorch.

## What is this?

This repository contains a complete, tutorial-style implementation of a small GPT (Generative Pre-trained Transformer) trained on tiny Shakespeare. Every line of code is commented to explain **what** it does and **why**.

## Files

| File | Purpose |
|------|---------|
| `model.py` | The full GPT architecture: CausalSelfAttention, MLP, Block, GPT |
| `prepare.py` | Data preparation: character-level tokenization, train/val split |
| `train.py` | Training loop with AdamW, cosine LR schedule, and generation |
| `input.txt` | The tiny Shakespeare dataset (~1.1M characters, 65 unique chars) |
| `data.pt` | Preprocessed tensors (generated by `prepare.py`) |
| `best.pt` | Best model checkpoint (generated by `train.py`) |

## Model Architecture

```
GPT(
  wte (Embedding):      vocab_size -> n_embd  (token embeddings)
  wpe (Embedding):      block_size -> n_embd  (position embeddings)
  h   (6x Block):
    ln_1 (LayerNorm)
    attn (CausalSelfAttention: multi-head self-attention with causal mask)
    ln_2 (LayerNorm)
    mlp  (MLP: expand 4x -> GELU -> project back)
  ln_f (LayerNorm)
  lm_head (Linear):     n_embd -> vocab_size  (next-token prediction)
)
```

**Key design choices:**
- **Character-level vocabulary** — no tokenizer library needed
- **Pre-LayerNorm** residuals — standard in modern transformers
- **Weight tying** — shared weights between input embedding and output projection
- **Causal (autoregressive) attention** — can only attend to past tokens

## How to Run

```bash
# 1. Prepare data
python prepare.py

# 2. Train (requires GPU for speed, CPU works too)
python train.py

# 3. The model will print generated Shakespeare-style text at the end!
```

## Training Details

| Hyperparameter | Value |
|---------------|-------|
| Layers | 6 |
| Heads | 6 |
| Embedding dim | 384 |
| Context length | 256 |
| Batch size | 64 |
| Training steps | 5,000 |
| Optimizer | AdamW (β₁=0.9, β₂=0.95) |
| Learning rate | 1e-3 (cosine decay to 1e-4) |
| Warmup | 200 steps |
| Gradient clip | 1.0 |

## Acknowledgments

Based on Andrej Karpathy's legendary [build-nanogpt](https://github.com/karpathy/build-nanogpt) and [nanoGPT](https://github.com/karpathy/nanoGPT) repositories.