YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
nanoGPT Tutorial
A step-by-step implementation of a tiny GPT model from scratch in pure PyTorch.
What is this?
This repository contains a complete, tutorial-style implementation of a small GPT (Generative Pre-trained Transformer) trained on tiny Shakespeare. Every line of code is commented to explain what it does and why.
Files
| File | Purpose |
|---|---|
model.py |
The full GPT architecture: CausalSelfAttention, MLP, Block, GPT |
prepare.py |
Data preparation: character-level tokenization, train/val split |
train.py |
Training loop with AdamW, cosine LR schedule, and generation |
input.txt |
The tiny Shakespeare dataset (~1.1M characters, 65 unique chars) |
data.pt |
Preprocessed tensors (generated by prepare.py) |
best.pt |
Best model checkpoint (generated by train.py) |
Model Architecture
GPT(
wte (Embedding): vocab_size -> n_embd (token embeddings)
wpe (Embedding): block_size -> n_embd (position embeddings)
h (6x Block):
ln_1 (LayerNorm)
attn (CausalSelfAttention: multi-head self-attention with causal mask)
ln_2 (LayerNorm)
mlp (MLP: expand 4x -> GELU -> project back)
ln_f (LayerNorm)
lm_head (Linear): n_embd -> vocab_size (next-token prediction)
)
Key design choices:
- Character-level vocabulary โ no tokenizer library needed
- Pre-LayerNorm residuals โ standard in modern transformers
- Weight tying โ shared weights between input embedding and output projection
- Causal (autoregressive) attention โ can only attend to past tokens
How to Run
# 1. Prepare data
python prepare.py
# 2. Train (requires GPU for speed, CPU works too)
python train.py
# 3. The model will print generated Shakespeare-style text at the end!
Training Details
| Hyperparameter | Value |
|---|---|
| Layers | 6 |
| Heads | 6 |
| Embedding dim | 384 |
| Context length | 256 |
| Batch size | 64 |
| Training steps | 5,000 |
| Optimizer | AdamW (ฮฒโ=0.9, ฮฒโ=0.95) |
| Learning rate | 1e-3 (cosine decay to 1e-4) |
| Warmup | 200 steps |
| Gradient clip | 1.0 |
Acknowledgments
Based on Andrej Karpathy's legendary build-nanogpt and nanoGPT repositories.
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support