File size: 2,331 Bytes
82cb4ef | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | # nanoGPT Tutorial
A step-by-step implementation of a tiny GPT model from scratch in pure PyTorch.
## What is this?
This repository contains a complete, tutorial-style implementation of a small GPT (Generative Pre-trained Transformer) trained on tiny Shakespeare. Every line of code is commented to explain **what** it does and **why**.
## Files
| File | Purpose |
|------|---------|
| `model.py` | The full GPT architecture: CausalSelfAttention, MLP, Block, GPT |
| `prepare.py` | Data preparation: character-level tokenization, train/val split |
| `train.py` | Training loop with AdamW, cosine LR schedule, and generation |
| `input.txt` | The tiny Shakespeare dataset (~1.1M characters, 65 unique chars) |
| `data.pt` | Preprocessed tensors (generated by `prepare.py`) |
| `best.pt` | Best model checkpoint (generated by `train.py`) |
## Model Architecture
```
GPT(
wte (Embedding): vocab_size -> n_embd (token embeddings)
wpe (Embedding): block_size -> n_embd (position embeddings)
h (6x Block):
ln_1 (LayerNorm)
attn (CausalSelfAttention: multi-head self-attention with causal mask)
ln_2 (LayerNorm)
mlp (MLP: expand 4x -> GELU -> project back)
ln_f (LayerNorm)
lm_head (Linear): n_embd -> vocab_size (next-token prediction)
)
```
**Key design choices:**
- **Character-level vocabulary** — no tokenizer library needed
- **Pre-LayerNorm** residuals — standard in modern transformers
- **Weight tying** — shared weights between input embedding and output projection
- **Causal (autoregressive) attention** — can only attend to past tokens
## How to Run
```bash
# 1. Prepare data
python prepare.py
# 2. Train (requires GPU for speed, CPU works too)
python train.py
# 3. The model will print generated Shakespeare-style text at the end!
```
## Training Details
| Hyperparameter | Value |
|---------------|-------|
| Layers | 6 |
| Heads | 6 |
| Embedding dim | 384 |
| Context length | 256 |
| Batch size | 64 |
| Training steps | 5,000 |
| Optimizer | AdamW (β₁=0.9, β₂=0.95) |
| Learning rate | 1e-3 (cosine decay to 1e-4) |
| Warmup | 200 steps |
| Gradient clip | 1.0 |
## Acknowledgments
Based on Andrej Karpathy's legendary [build-nanogpt](https://github.com/karpathy/build-nanogpt) and [nanoGPT](https://github.com/karpathy/nanoGPT) repositories.
|