nanoGPT Tutorial

A step-by-step implementation of a tiny GPT model from scratch in pure PyTorch.

What is this?

This repository contains a complete, tutorial-style implementation of a small GPT (Generative Pre-trained Transformer) trained on tiny Shakespeare. Every line of code is commented to explain what it does and why.

Files

File	Purpose
`model.py`	The full GPT architecture: CausalSelfAttention, MLP, Block, GPT
`prepare.py`	Data preparation: character-level tokenization, train/val split
`train.py`	Training loop with AdamW, cosine LR schedule, and generation
`input.txt`	The tiny Shakespeare dataset (~1.1M characters, 65 unique chars)
`data.pt`	Preprocessed tensors (generated by `prepare.py`)
`best.pt`	Best model checkpoint (generated by `train.py`)

Model Architecture

GPT(
  wte (Embedding):      vocab_size -> n_embd  (token embeddings)
  wpe (Embedding):      block_size -> n_embd  (position embeddings)
  h   (6x Block):
    ln_1 (LayerNorm)
    attn (CausalSelfAttention: multi-head self-attention with causal mask)
    ln_2 (LayerNorm)
    mlp  (MLP: expand 4x -> GELU -> project back)
  ln_f (LayerNorm)
  lm_head (Linear):     n_embd -> vocab_size  (next-token prediction)
)

Key design choices:

Character-level vocabulary — no tokenizer library needed
Pre-LayerNorm residuals — standard in modern transformers
Weight tying — shared weights between input embedding and output projection
Causal (autoregressive) attention — can only attend to past tokens

How to Run

# 1. Prepare data
python prepare.py

# 2. Train (requires GPU for speed, CPU works too)
python train.py

# 3. The model will print generated Shakespeare-style text at the end!

Training Details

Hyperparameter	Value
Layers	6
Heads	6
Embedding dim	384
Context length	256
Batch size	64
Training steps	5,000
Optimizer	AdamW (β₁=0.9, β₂=0.95)
Learning rate	1e-3 (cosine decay to 1e-4)
Warmup	200 steps
Gradient clip	1.0

Acknowledgments

Based on Andrej Karpathy's legendary build-nanogpt and nanoGPT repositories.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support