bmeyer2025 commited on
Commit
7cb75d6
Β·
verified Β·
1 Parent(s): e6f1e6b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +105 -114
README.md CHANGED
@@ -1,51 +1,53 @@
1
- # tinyllm
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  <p align="center">
4
- <img src="images/robot_shakespeare.png" alt="A tiny robot reading Shakespeare by candlelight, with transformer layers glowing inside its transparent head" width="700">
5
  </p>
6
 
7
  <p align="center">
8
- <strong>I built a tiny LLM from scratch to understand how GPT-4, Claude, and LLaMA actually work.</strong>
9
  </p>
10
 
11
  <p align="center">
12
- <em>10M parameters. Trained on Shakespeare. Modernized with the same architecture as LLaMA and Qwen. Every line of code written from scratch.</em>
13
  </p>
14
 
15
  <p align="center">
16
- <a href="DEVLOG.md">Learning Journal</a> |
17
- <a href="https://huggingface.co/bmeyer2025/tiny-gpt-shakespeare">HuggingFace</a> |
18
- <a href="MODEL_CARD.md">Model Card</a>
19
  </p>
20
 
21
  ---
22
 
23
- ## The idea
24
-
25
- GPT-4, Claude, and LLaMA are all scaled-up versions of the same architecture. I wanted to understand it from the ground up β€” not by reading papers, but by building it myself.
26
-
27
- So I built a 10M parameter transformer, trained it on Shakespeare, then upgraded it piece by piece with the same components used in production LLMs. Every mistake, crash, and debugging session is documented in the [DEVLOG](DEVLOG.md).
28
 
29
- ## What makes it "modern"?
30
 
31
- I started with a vanilla GPT-2-style transformer, then swapped in four upgrades β€” one at a time, measuring each:
32
 
33
- | Component | GPT-2 era | Modern (LLaMA/Qwen) | Impact |
34
- |-----------|-----------|---------------------|--------|
35
- | Normalization | LayerNorm | **RMSNorm** | Free efficiency win |
36
- | FFN | ReLU | **SwiGLU** | **-0.11** val loss |
37
- | Position | Learned embeddings | **RoPE** | **-0.31** val loss |
38
- | Inference | Recompute all | **KV Cache** | Faster generation |
39
-
40
- **RoPE was the star** β€” biggest improvement, fewer parameters, and the position encoding math is genuinely beautiful.
41
-
42
- ## Results
43
-
44
- | Model | Best Val Loss | Training Time |
45
- |-------|-------------|--------------|
46
- | Vanilla (10.8M params) | 1.4804 | 57 min |
47
- | **Modern (10.6M params)** | **1.4754** | **64 min** |
48
 
 
49
  ```
50
  ROMEO:
51
  A gallant-house! what says the woe?
@@ -59,116 +61,105 @@ Which hath a sin by him come to the crown,
59
  That he is reports for me; for ever is he.
60
  ```
61
 
62
- *A 10M parameter model generating Shakespeare dialogue after 67 minutes of training.*
 
 
 
 
 
 
 
 
63
 
64
- ## Project structure
65
 
66
  ```
67
- tinyllm/
68
- β”œβ”€β”€ src/ # Core model code (built from scratch)
69
- β”‚ β”œβ”€β”€ tokenizer.py # Character-level tokenizer + data loading
70
- β”‚ β”œβ”€β”€ attention.py # Single-head causal self-attention
71
- β”‚ β”œβ”€β”€ transformer.py # Multi-head attention, FFN, transformer Block
72
- β”‚ β”œβ”€β”€ model.py # Full vanilla GPT (10.8M params)
73
- β”‚ β”œβ”€β”€ modernize.py # Modern components: RMSNorm, SwiGLU, RoPE, KV cache
74
- β”‚ β”œβ”€β”€ model_modern.py # Modernized GPT (10.6M params)
75
- β”‚ └── generate.py # Text generation with sampling
76
- β”‚
77
- β”œβ”€β”€ experiments/ # Per-swap A/B comparisons
78
- β”‚ β”œβ”€β”€ swap1_rmsnorm.py # LayerNorm β†’ RMSNorm (2000 steps)
79
- β”‚ β”œβ”€β”€ swap2_swiglu.py # ReLU β†’ SwiGLU (2000 steps)
80
- β”‚ β”œβ”€β”€ swap3_rope.py # Learned pos β†’ RoPE (2000 steps)
81
- β”‚ └── swap4_kvcache.py # KV cache speed benchmark
82
- β”‚
83
- β”œβ”€β”€ training/ # Training scripts
84
- β”‚ β”œβ”€β”€ train.py # Vanilla GPT (5000 steps)
85
- β”‚ β”œβ”€β”€ train_modern.py # Modern GPT with early stopping
86
- β”‚ β”œβ”€β”€ train_bpe.py # BPE + gradient accumulation
87
- β”‚ └── benchmark.py # Samples, latency, throughput comparison
88
- β”‚
89
- β”œβ”€β”€ colab/ # Google Colab
90
- β”‚ └── train_colab.py # All-in-one: vanilla + modern + BPE + benchmarks
91
- β”‚
92
- β”œβ”€β”€ data/input.txt # Tiny Shakespeare (~1.1MB)
93
- β”œβ”€β”€ images/ # Generated graphics
94
- β”œβ”€β”€ DEVLOG.md # Full learning journal (the real value)
95
- β”œβ”€β”€ MODEL_CARD.md # HuggingFace model card
96
- └── publish.py # Upload to HuggingFace
97
  ```
98
 
99
- ## Quick start
100
 
101
- **On Google Colab (recommended):**
102
- ```python
103
- !git clone https://github.com/brianmeyer/tinyllm.git
104
- %cd tinyllm
105
- !pip install tiktoken
106
- !python -u colab/train_colab.py
107
- ```
108
 
109
- **Locally (M4 Mac / any GPU):**
110
- ```bash
111
- git clone https://github.com/brianmeyer/tinyllm.git
112
- cd tinyllm
113
- python3 -m venv .venv && source .venv/bin/activate
114
- pip install -r requirements.txt
115
- python -u training/train.py # vanilla, ~60 min
116
- python -u training/train_modern.py # modern, ~67 min
117
- python src/generate.py --demo # see the output
118
- ```
119
 
120
- ## The learning journey
 
 
 
121
 
122
- **Phase 1 β€” Build from scratch:** Tokenizer, attention mechanism, multi-head attention, feed-forward network, transformer block, full GPT model. Every component explained in the [DEVLOG](DEVLOG.md).
123
 
124
- **Phase 2 β€” Modernize one swap at a time:** Replace LayerNorm, ReLU, learned positions, and naive inference with RMSNorm, SwiGLU, RoPE, and KV cache. Each swap tested in isolation so you can see exactly what it does.
125
 
126
- **Phase 3 β€” Scale up:** BPE tokenization (50K vocab), mixed precision, gradient accumulation. Learned why BPE needs way more data than 1MB of Shakespeare.
127
 
128
- **Phase 4 β€” Break everything:** MPS memory leaks, silent process kills, float16 divergence, RoPE position bugs, Colab runtime evictions, lost checkpoints. Each failure documented with root cause and fix.
 
 
 
 
 
 
 
 
129
 
130
- ## What I learned
131
 
132
- 1. **RoPE is the most impactful modern change** β€” 0.31 better loss, fewer params, beautiful math
133
- 2. **More powerful models overfit faster on small data** β€” early stopping is essential
134
- 3. **MPS (Apple Silicon) silently kills training** after 60-80 min due to memory leaks
135
- 4. **When loss is good but output is garbage, the bug is in inference** β€” our RoPE position bug only appeared during KV cache generation
136
- 5. **Change one thing at a time** β€” the per-swap comparison approach is how real ML research works
137
- 6. **Always save checkpoints to persistent storage** β€” we lost 3 hours of Colab training to a runtime disconnect
 
 
138
 
139
- ## Architecture
140
 
141
- ```
142
- ModernGPT (10.6M params)
143
- token_emb: Embedding(65, 384)
144
- blocks Γ— 6:
145
- RMSNorm β†’ MultiHeadAttention(6 heads, RoPE, KV cache) β†’ residual
146
- RMSNorm β†’ SwiGLU(384 β†’ 1024 β†’ 384) β†’ residual
147
- RMSNorm β†’ lm_head (tied with token_emb)
 
 
 
 
 
 
 
 
 
148
  ```
149
 
150
- ## 9 things that went wrong
151
-
152
- | # | What happened | Root cause |
153
- |---|--------------|-----------|
154
- | 1 | MPS training died silently | Memory leak in PyTorch MPS backend |
155
- | 2 | Bundled all 4 swaps together | Rushing β€” should test one at a time |
156
- | 3 | Python output hidden during training | stdout buffering β€” use `python -u` |
157
- | 4 | Modern model generated garbage | RoPE position bug in KV cache inference |
158
- | 5 | Modern model memorized Shakespeare | 10M params too powerful for 1MB data |
159
- | 6 | BPE training diverged | float16 on MPS overflows with 50K vocab |
160
- | 7 | MPS kept killing all retrains | Memory leak unfixable on 16GB |
161
- | 8 | Lost all Colab checkpoints | Runtime disconnected β€” ephemeral storage |
162
- | 9 | Colab GPU quota exhausted | Used all free T4 hours in one session |
163
 
164
- Full analysis of each: [DEVLOG.md](DEVLOG.md)
 
 
 
 
 
165
 
166
  ## References
167
 
168
  - [build-nanogpt](https://github.com/karpathy/build-nanogpt) β€” Karpathy's step-by-step GPT build
169
- - [nanochat](https://github.com/karpathy/nanochat) β€” nanoGPT successor
170
  - [RoPE paper](https://arxiv.org/abs/2104.09864) β€” Su et al.
171
  - [SwiGLU paper](https://arxiv.org/abs/2002.05202) β€” Shazeer
 
172
 
173
  ## License
174
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - pytorch
7
+ - transformer
8
+ - language-model
9
+ - from-scratch
10
+ - educational
11
+ - shakespeare
12
+ - rope
13
+ - swiglu
14
+ - rmsnorm
15
+ - kv-cache
16
+ datasets:
17
+ - tiny-shakespeare
18
+ pipeline_tag: text-generation
19
+ ---
20
+
21
+ # tiny-gpt-shakespeare
22
 
23
  <p align="center">
24
+ <img src="images/brain_book.png" alt="A glowing neural network brain floating above an open Shakespeare book" width="600">
25
  </p>
26
 
27
  <p align="center">
28
+ <strong>I built a tiny LLM from scratch to understand how GPT-4 and LLaMA actually work.</strong>
29
  </p>
30
 
31
  <p align="center">
32
+ <em>10M parameters. Trained on Shakespeare. Every line of code written from scratch. Every mistake documented.</em>
33
  </p>
34
 
35
  <p align="center">
36
+ <a href="https://github.com/brianmeyer/tinyllm">GitHub</a> |
37
+ <a href="https://github.com/brianmeyer/tinyllm/blob/main/DEVLOG.md">Learning Journal</a>
 
38
  </p>
39
 
40
  ---
41
 
42
+ ## What is this?
 
 
 
 
43
 
44
+ A ~10M parameter decoder-only transformer β€” no HuggingFace Transformers library, no pretrained weights, no shortcuts. Built from an empty file to a working Shakespeare generator, then modernized with the same architecture used in LLaMA, Qwen, and Mistral.
45
 
46
+ This is a learning project. The model itself is tiny and toy-scale. The value is in the code, the [DEVLOG](https://github.com/brianmeyer/tinyllm/blob/main/DEVLOG.md), and the 9 things that went wrong along the way.
47
 
48
+ ## It generates Shakespeare
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
+ **Modern model, temp=0.8 (RMSNorm + SwiGLU + RoPE + KV cache):**
51
  ```
52
  ROMEO:
53
  A gallant-house! what says the woe?
 
61
  That he is reports for me; for ever is he.
62
  ```
63
 
64
+ **Vanilla model, temp=0.5:**
65
+ ```
66
+ KING HENRY:
67
+ The father of the marriage of my son,
68
+ And then we will be no longer to be then,
69
+ And but the Lord Hastings of Semiram Stanley.
70
+ ```
71
+
72
+ Not perfect. But recognizable Shakespeare β€” proper character names, dialogue formatting, verse rhythm β€” from a 10M param model trained for ~60 minutes on 1MB of text.
73
 
74
+ ## Architecture
75
 
76
  ```
77
+ ModernGPT (10.6M params)
78
+ token_emb: Embedding(65, 384)
79
+ blocks Γ— 6:
80
+ RMSNorm β†’ MultiHeadAttention(6 heads, RoPE, KV cache) β†’ residual
81
+ RMSNorm β†’ SwiGLU(384 β†’ 1024 β†’ 384) β†’ residual
82
+ RMSNorm β†’ lm_head (tied with token_emb)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
  ```
84
 
85
+ Four upgrades over vanilla GPT-2, each tested in isolation:
86
 
87
+ | Upgrade | What changed | Impact |
88
+ |---------|-------------|--------|
89
+ | **RMSNorm** | Drop mean subtraction from LayerNorm | Free efficiency win |
90
+ | **SwiGLU** | Smooth gating replaces hard ReLU cutoff | **-0.11** val loss at step 500 |
91
+ | **RoPE** | Rotate Q/K vectors instead of adding position embeddings | **-0.31** val loss at step 500 |
92
+ | **KV Cache** | Cache keys/values during generation | Faster inference |
 
93
 
94
+ ## Results
 
 
 
 
 
 
 
 
 
95
 
96
+ | Model | Params | Best Val Loss | Time |
97
+ |-------|--------|-------------|------|
98
+ | Vanilla | 10.8M | 1.4804 | 57 min |
99
+ | **Modern** | **10.6M** | **1.4754** | **64 min** |
100
 
101
+ Modern beats vanilla with fewer params. RoPE was the star β€” biggest single improvement.
102
 
103
+ ## 9 things that went wrong
104
 
105
+ Building this was not smooth. Every failure is documented in the [DEVLOG](https://github.com/brianmeyer/tinyllm/blob/main/DEVLOG.md):
106
 
107
+ 1. MPS training died silently (memory leak)
108
+ 2. Bundled all 4 architecture swaps together instead of testing one at a time
109
+ 3. Python stdout buffering hid training progress
110
+ 4. RoPE position bug in KV cache made the model generate garbage
111
+ 5. Modern model memorized Shakespeare (overfitting on 1MB)
112
+ 6. Float16 diverged on MPS with 50K BPE vocab
113
+ 7. MPS kept killing every retrain attempt
114
+ 8. Lost all Colab checkpoints when runtime disconnected
115
+ 9. Ran out of free Colab GPU quota
116
 
117
+ ## Training details
118
 
119
+ | | |
120
+ |---|---|
121
+ | Dataset | Tiny Shakespeare (~1.1MB, 65 unique characters) |
122
+ | Optimizer | AdamW, lr=3e-4 |
123
+ | Batch size | 64, block size 256 |
124
+ | Steps | 5,000 (best checkpoint via early stopping) |
125
+ | Hardware | Google Colab T4 (and an M4 Mac that kept crashing) |
126
+ | Dropout | 0.3 (increased from 0.2 to fight overfitting) |
127
 
128
+ ## How to use
129
 
130
+ ```python
131
+ import torch
132
+ import sys
133
+ sys.path.append('src')
134
+ from model_modern import ModernGPT
135
+ from tokenizer import encode, decode
136
+
137
+ device = "cuda" if torch.cuda.is_available() else "cpu"
138
+ ckpt = torch.load("model.pt", map_location=device, weights_only=False)
139
+ model = ModernGPT(**ckpt["config"]).to(device)
140
+ model.load_state_dict(ckpt["model_state"])
141
+ model.eval()
142
+
143
+ idx = torch.tensor([encode("ROMEO:")], dtype=torch.long, device=device)
144
+ out = model.generate(idx, max_new_tokens=200, temperature=0.8)
145
+ print(decode(out[0].tolist()))
146
  ```
147
 
148
+ ## What I learned
 
 
 
 
 
 
 
 
 
 
 
 
149
 
150
+ 1. **RoPE is the most impactful modern architecture change** β€” beautiful math, fewer params, better results
151
+ 2. **More powerful models overfit faster on small data** β€” early stopping is essential
152
+ 3. **When loss is good but output is garbage, the bug is in inference code** β€” not the model
153
+ 4. **MPS is not ready for serious training** β€” use CUDA
154
+ 5. **Always save checkpoints to persistent storage** β€” Colab runtimes are ephemeral
155
+ 6. **Change one thing at a time and measure** β€” this is how real ML research works
156
 
157
  ## References
158
 
159
  - [build-nanogpt](https://github.com/karpathy/build-nanogpt) β€” Karpathy's step-by-step GPT build
 
160
  - [RoPE paper](https://arxiv.org/abs/2104.09864) β€” Su et al.
161
  - [SwiGLU paper](https://arxiv.org/abs/2002.05202) β€” Shazeer
162
+ - [RMSNorm paper](https://arxiv.org/abs/1910.07467) β€” Zhang & Sennrich
163
 
164
  ## License
165