| # nanoGPT Tutorial |
|
|
| A step-by-step implementation of a tiny GPT model from scratch in pure PyTorch. |
|
|
| ## What is this? |
|
|
| This repository contains a complete, tutorial-style implementation of a small GPT (Generative Pre-trained Transformer) trained on tiny Shakespeare. Every line of code is commented to explain **what** it does and **why**. |
|
|
| ## Files |
|
|
| | File | Purpose | |
| |------|---------| |
| | `model.py` | The full GPT architecture: CausalSelfAttention, MLP, Block, GPT | |
| | `prepare.py` | Data preparation: character-level tokenization, train/val split | |
| | `train.py` | Training loop with AdamW, cosine LR schedule, and generation | |
| | `input.txt` | The tiny Shakespeare dataset (~1.1M characters, 65 unique chars) | |
| | `data.pt` | Preprocessed tensors (generated by `prepare.py`) | |
| | `best.pt` | Best model checkpoint (generated by `train.py`) | |
|
|
| ## Model Architecture |
|
|
| ``` |
| GPT( |
| wte (Embedding): vocab_size -> n_embd (token embeddings) |
| wpe (Embedding): block_size -> n_embd (position embeddings) |
| h (6x Block): |
| ln_1 (LayerNorm) |
| attn (CausalSelfAttention: multi-head self-attention with causal mask) |
| ln_2 (LayerNorm) |
| mlp (MLP: expand 4x -> GELU -> project back) |
| ln_f (LayerNorm) |
| lm_head (Linear): n_embd -> vocab_size (next-token prediction) |
| ) |
| ``` |
|
|
| **Key design choices:** |
| - **Character-level vocabulary** — no tokenizer library needed |
| - **Pre-LayerNorm** residuals — standard in modern transformers |
| - **Weight tying** — shared weights between input embedding and output projection |
| - **Causal (autoregressive) attention** — can only attend to past tokens |
|
|
| ## How to Run |
|
|
| ```bash |
| # 1. Prepare data |
| python prepare.py |
| |
| # 2. Train (requires GPU for speed, CPU works too) |
| python train.py |
| |
| # 3. The model will print generated Shakespeare-style text at the end! |
| ``` |
|
|
| ## Training Details |
|
|
| | Hyperparameter | Value | |
| |---------------|-------| |
| | Layers | 6 | |
| | Heads | 6 | |
| | Embedding dim | 384 | |
| | Context length | 256 | |
| | Batch size | 64 | |
| | Training steps | 5,000 | |
| | Optimizer | AdamW (β₁=0.9, β₂=0.95) | |
| | Learning rate | 1e-3 (cosine decay to 1e-4) | |
| | Warmup | 200 steps | |
| | Gradient clip | 1.0 | |
|
|
| ## Acknowledgments |
|
|
| Based on Andrej Karpathy's legendary [build-nanogpt](https://github.com/karpathy/build-nanogpt) and [nanoGPT](https://github.com/karpathy/nanoGPT) repositories. |
|
|