yat343
/

nanogpt-tutorial

Model card Files Files and versions

nanogpt-tutorial / README.md

yat343's picture

Upload README.md

82cb4ef verified 3 days ago

|

history blame contribute delete

2.33 kB

	# nanoGPT Tutorial

	A step-by-step implementation of a tiny GPT model from scratch in pure PyTorch.

	## What is this?

	This repository contains a complete, tutorial-style implementation of a small GPT (Generative Pre-trained Transformer) trained on tiny Shakespeare. Every line of code is commented to explain what it does and why.

	## Files

	\| File \| Purpose \|
	\|------\|---------\|
	\| `model.py` \| The full GPT architecture: CausalSelfAttention, MLP, Block, GPT \|
	\| `prepare.py` \| Data preparation: character-level tokenization, train/val split \|
	\| `train.py` \| Training loop with AdamW, cosine LR schedule, and generation \|
	\| `input.txt` \| The tiny Shakespeare dataset (~1.1M characters, 65 unique chars) \|
	\| `data.pt` \| Preprocessed tensors (generated by `prepare.py`) \|
	\| `best.pt` \| Best model checkpoint (generated by `train.py`) \|

	## Model Architecture

	```
	GPT(
	wte (Embedding): vocab_size -> n_embd (token embeddings)
	wpe (Embedding): block_size -> n_embd (position embeddings)
	h (6x Block):
	ln_1 (LayerNorm)
	attn (CausalSelfAttention: multi-head self-attention with causal mask)
	ln_2 (LayerNorm)
	mlp (MLP: expand 4x -> GELU -> project back)
	ln_f (LayerNorm)
	lm_head (Linear): n_embd -> vocab_size (next-token prediction)
	)
	```

	Key design choices:
	- Character-level vocabulary — no tokenizer library needed
	- Pre-LayerNorm residuals — standard in modern transformers
	- Weight tying — shared weights between input embedding and output projection
	- Causal (autoregressive) attention — can only attend to past tokens

	## How to Run

	```bash
	# 1. Prepare data
	python prepare.py

	# 2. Train (requires GPU for speed, CPU works too)
	python train.py

	# 3. The model will print generated Shakespeare-style text at the end!
	```

	## Training Details

	\| Hyperparameter \| Value \|
	\|---------------\|-------\|
	\| Layers \| 6 \|
	\| Heads \| 6 \|
	\| Embedding dim \| 384 \|
	\| Context length \| 256 \|
	\| Batch size \| 64 \|
	\| Training steps \| 5,000 \|
	\| Optimizer \| AdamW (β₁=0.9, β₂=0.95) \|
	\| Learning rate \| 1e-3 (cosine decay to 1e-4) \|
	\| Warmup \| 200 steps \|
	\| Gradient clip \| 1.0 \|

	## Acknowledgments

	Based on Andrej Karpathy's legendary [build-nanogpt](https://github.com/karpathy/build-nanogpt) and [nanoGPT](https://github.com/karpathy/nanoGPT) repositories.