File size: 7,350 Bytes
e91718f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
# nanoGPT: Step-by-Step Tutorial

This tutorial walks through building and training a **tiny GPT from scratch** in pure PyTorch. No `transformers` library, no pre-trained weights β€” just ~200 lines of clean code.

---

## Table of Contents

1. [Overview](#1-overview)
2. [Dataset Preparation](#2-dataset-preparation)
3. [Model Architecture](#3-model-architecture)
4. [Training Loop](#4-training-loop)
5. [Generation / Inference](#5-generation--inference)
6. [Results](#6-results)
7. [Files in this Repo](#7-files-in-this-repo)

---

## 1. Overview

We train a **character-level** language model on [tiny Shakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) (~1.1M characters, 65 unique characters).

The model learns to predict the next character given all previous characters, autoregressively. This is exactly how GPT-2, GPT-3, and ChatGPT work β€” just at character scale instead of word/BPE token scale.

**Model size**: ~10.8M parameters  
**Architecture**: 6 layers, 6 heads, 384 embedding dim, 256 context length

---

## 2. Dataset Preparation (`prepare.py`)

### What happens:
1. **Download** tiny Shakespeare text
2. **Discover vocabulary**: find all unique characters β†’ 65 chars
3. **Build mappings**:
   - `stoi` (string-to-int): `'a' β†’ 0`, `'b' β†’ 1`, ...
   - `itos` (int-to-string): reverse lookup
4. **Encode** the entire text as integers
5. **Split** 90% train / 10% validation
6. **Save** as `data.pt` (PyTorch tensors for fast loading)

### Key concept: Character-level tokenization
```python
chars = sorted(list(set(text)))          # vocabulary
vocab_size = len(chars)                  # 65
encode = lambda s: [stoi[c] for c in s]  # "hello" -> [46, 43, 50, 50, 53]
decode = lambda l: "".join([itos[i] for i in l])
```

No tokenizer library needed! For English text, ~65 chars is enough.

---

## 3. Model Architecture (`model.py`)

### 3.1 Configuration (`GPTConfig`)
```python
@dataclass
class GPTConfig:
    block_size: int = 256    # max sequence length
    vocab_size: int = 65     # number of unique characters
    n_layer: int = 6         # transformer blocks
    n_head: int = 6          # attention heads per block
    n_embd: int = 384        # embedding dimension
```

### 3.2 Causal Self-Attention
The core idea: every token can "look at" all previous tokens to decide what comes next.

```
For each token:
  Query = "What am I looking for?"
  Key   = "What do I contain?"
  Value = "What information do I have?"

Attention score = Query Β· Key  (scaled)
Causal mask     = prevent looking at future tokens
Output          = weighted sum of Values
```

We use **multi-head attention**: split embeddings into 6 parallel attention operations (heads), run them simultaneously, then concatenate.

**Code flow:**
```
Input (B, T, C)
  β†’ c_attn β†’ (Q, K, V) each (B, T, C)
  β†’ reshape to (B, n_head, T, head_size)
  β†’ Q @ K.T β†’ attention scores (B, n_head, T, T)
  β†’ causal mask β†’ softmax β†’ weighted sum of V
  β†’ reshape back β†’ c_proj β†’ Output (B, T, C)
```

### 3.3 MLP (Feed-Forward)
After attention, each token gets a private "thinking step":
```
(B, T, C) β†’ Linear(4*C) β†’ GELU β†’ Linear(C) β†’ (B, T, C)
```
The 4Γ— expansion is standard in transformers.

### 3.4 Transformer Block
```
x = x + Attention(LayerNorm(x))   # pre-norm residual
x = x + MLP(LayerNorm(x))         # pre-norm residual
```
**Pre-LayerNorm** (normalize before sublayer) is used by GPT-2/3/Llama.

### 3.5 Full GPT Model
```
1. Token Embedding  (wte): char index β†’ vector
2. Position Embedding (wpe): position index β†’ vector
3. Sum them: x = wte + wpe
4. Pass through N transformer blocks
5. Final LayerNorm
6. Language Model Head: project to vocab_size logits
7. Cross-entropy loss against next-character targets
```

**Weight tying**: `wte` (input embedding) shares weights with `lm_head` (output projection). Saves parameters, improves training.

---

## 4. Training Loop (`train.py` / `train_standalone.py`)

### 4.1 Batch sampling
For each training step, grab random contiguous chunks:
```python
def get_batch(split):
    ix = torch.randint(len(data) - BLOCK_SIZE, (BATCH_SIZE,))
    x = data[ix : ix+BLOCK_SIZE]      # input
    y = data[ix+1 : ix+BLOCK_SIZE+1]  # target (shifted by 1)
```

### 4.2 Learning rate schedule
**Cosine with linear warmup**:
```
Step 0-200:   LR ramps up from 0 β†’ 1e-3   (warmup)
Step 200-5000: LR decays cosine to 1e-4   (cosine annealing)
```
Warmup prevents early loss spikes when gradients are large.

### 4.3 Optimizer
**AdamW** with separated weight decay:
- 2D parameters (weights) β†’ weight_decay = 0.1
- 1D parameters (biases, LayerNorm) β†’ weight_decay = 0.0

### 4.4 Gradient clipping
`torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)` prevents exploding gradients.

### 4.5 Evaluation
Every 500 steps, we evaluate on 200 random validation batches and report:
```
step  500 | train loss 1.8234 | val loss 1.9012 | lr 9.12e-04 | time 45.2s
```
The best validation checkpoint is saved as `best.pt`.

---

## 5. Generation / Inference (`generate.py`)

Autoregressive generation:
```
1. Encode a prompt (e.g., "\nROMEO:")
2. Run forward pass β†’ get logits for last token
3. Apply temperature + top-k sampling β†’ probability distribution
4. Sample next token from distribution
5. Append token to sequence
6. Repeat from step 2
```

**Temperature**: lower = more conservative/deterministic, higher = more random/creative  
**Top-k**: only sample from the k most likely tokens (prevents gibberish)

---

## 6. Results

Expected after 5000 steps on T4 GPU (~30-60 minutes):

| Metric | Value |
|--------|-------|
| Initial loss | ~4.3 (random guessing among 65 chars) |
| Final train loss | ~1.2–1.5 |
| Final val loss | ~1.3–1.6 |
| Parameters | 10.77 M |

**Generated sample** (should look vaguely Shakespeare-like):
```
ROMEO:
What say you, then? I have heard you say
The hour is come, and I must hence depart.
```

---

## 7. Files in this Repo

| File | Purpose |
|------|---------|
| `model.py` | Pure PyTorch GPT architecture (standalone) |
| `prepare.py` | Downloads data, builds char-level vocab, saves `data.pt` |
| `train.py` | Training script (imports from `model.py`) |
| `train_standalone.py` | Self-contained training script (model + train in one file) |
| `generate.py` | Inference script β€” load checkpoint and generate text |
| `input.txt` | Raw tiny Shakespeare text |
| `data.pt` | Preprocessed train/val tensors + vocab mappings |
| `best.pt` | Best model checkpoint (saved during training) |

---

## How to Run

```bash
# 1. Prepare data
python prepare.py

# 2. Train (GPU recommended)
python train_standalone.py

# 3. Generate
python generate.py --prompt "ROMEO:" --length 500 --temperature 0.8
```

---

## Learning Checklist

- [ ] Read `model.py` β€” understand attention masking, pre-norm, weight tying
- [ ] Read `prepare.py` β€” understand character-level tokenization
- [ ] Read `train.py` β€” understand batching, LR schedule, gradient clipping
- [ ] Run training and watch loss go down
- [ ] Tweak hyperparameters (n_layer, n_embd, learning rate) and observe changes
- [ ] Generate with different temperatures and top-k values

---

Based on Andrej Karpathy's [build-nanogpt](https://github.com/karpathy/build-nanogpt) and [nanoGPT](https://github.com/karpathy/nanoGPT).