bmeyer2025 commited on
Commit
ec47f0d
Β·
verified Β·
1 Parent(s): 9508da5

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +175 -0
README.md ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # tinyllm
2
+
3
+ <p align="center">
4
+ <img src="images/robot_shakespeare.png" alt="A tiny robot reading Shakespeare by candlelight, with transformer layers glowing inside its transparent head" width="700">
5
+ </p>
6
+
7
+ <p align="center">
8
+ <strong>I built a tiny LLM from scratch to understand how GPT-4, Claude, and LLaMA actually work.</strong>
9
+ </p>
10
+
11
+ <p align="center">
12
+ <em>10M parameters. Trained on Shakespeare. Modernized with the same architecture as LLaMA and Qwen. Every line of code written from scratch.</em>
13
+ </p>
14
+
15
+ <p align="center">
16
+ <a href="DEVLOG.md">Learning Journal</a> |
17
+ <a href="https://huggingface.co/bmeyer2025/tiny-gpt-shakespeare">HuggingFace</a> |
18
+ <a href="MODEL_CARD.md">Model Card</a>
19
+ </p>
20
+
21
+ ---
22
+
23
+ ## The idea
24
+
25
+ GPT-4, Claude, and LLaMA are all scaled-up versions of the same architecture. I wanted to understand it from the ground up β€” not by reading papers, but by building it myself.
26
+
27
+ So I built a 10M parameter transformer, trained it on Shakespeare, then upgraded it piece by piece with the same components used in production LLMs. Every mistake, crash, and debugging session is documented in the [DEVLOG](DEVLOG.md).
28
+
29
+ ## What makes it "modern"?
30
+
31
+ I started with a vanilla GPT-2-style transformer, then swapped in four upgrades β€” one at a time, measuring each:
32
+
33
+ | Component | GPT-2 era | Modern (LLaMA/Qwen) | Impact |
34
+ |-----------|-----------|---------------------|--------|
35
+ | Normalization | LayerNorm | **RMSNorm** | Free efficiency win |
36
+ | FFN | ReLU | **SwiGLU** | **-0.11** val loss |
37
+ | Position | Learned embeddings | **RoPE** | **-0.31** val loss |
38
+ | Inference | Recompute all | **KV Cache** | Faster generation |
39
+
40
+ **RoPE was the star** β€” biggest improvement, fewer parameters, and the position encoding math is genuinely beautiful.
41
+
42
+ ## Results
43
+
44
+ | Model | Best Val Loss | Training Time |
45
+ |-------|-------------|--------------|
46
+ | Vanilla (10.8M params) | 1.4804 | 57 min |
47
+ | **Modern (10.6M params)** | **1.4754** | **64 min** |
48
+
49
+ ```
50
+ ROMEO:
51
+ A gallant-house! what says the woe?
52
+
53
+ MERCUTIO:
54
+ Good madam, my lord.
55
+
56
+ ROMEO:
57
+ Villain, for I do not say it is true,
58
+ Which hath a sin by him come to the crown,
59
+ That he is reports for me; for ever is he.
60
+ ```
61
+
62
+ *A 10M parameter model generating Shakespeare dialogue after 67 minutes of training.*
63
+
64
+ ## Project structure
65
+
66
+ ```
67
+ tinyllm/
68
+ β”œβ”€β”€ src/ # Core model code (built from scratch)
69
+ β”‚ β”œβ”€β”€ tokenizer.py # Character-level tokenizer + data loading
70
+ β”‚ β”œβ”€β”€ attention.py # Single-head causal self-attention
71
+ β”‚ β”œβ”€β”€ transformer.py # Multi-head attention, FFN, transformer Block
72
+ β”‚ β”œβ”€β”€ model.py # Full vanilla GPT (10.8M params)
73
+ β”‚ β”œβ”€β”€ modernize.py # Modern components: RMSNorm, SwiGLU, RoPE, KV cache
74
+ β”‚ β”œβ”€β”€ model_modern.py # Modernized GPT (10.6M params)
75
+ β”‚ └── generate.py # Text generation with sampling
76
+ β”‚
77
+ β”œβ”€β”€ experiments/ # Per-swap A/B comparisons
78
+ β”‚ β”œβ”€β”€ swap1_rmsnorm.py # LayerNorm β†’ RMSNorm (2000 steps)
79
+ β”‚ β”œβ”€β”€ swap2_swiglu.py # ReLU β†’ SwiGLU (2000 steps)
80
+ β”‚ β”œβ”€β”€ swap3_rope.py # Learned pos β†’ RoPE (2000 steps)
81
+ β”‚ └── swap4_kvcache.py # KV cache speed benchmark
82
+ β”‚
83
+ β”œβ”€β”€ training/ # Training scripts
84
+ β”‚ β”œβ”€β”€ train.py # Vanilla GPT (5000 steps)
85
+ β”‚ β”œβ”€β”€ train_modern.py # Modern GPT with early stopping
86
+ β”‚ β”œβ”€β”€ train_bpe.py # BPE + gradient accumulation
87
+ β”‚ └── benchmark.py # Samples, latency, throughput comparison
88
+ β”‚
89
+ β”œβ”€β”€ colab/ # Google Colab
90
+ β”‚ └── train_colab.py # All-in-one: vanilla + modern + BPE + benchmarks
91
+ β”‚
92
+ β”œβ”€β”€ data/input.txt # Tiny Shakespeare (~1.1MB)
93
+ β”œβ”€β”€ images/ # Generated graphics
94
+ β”œβ”€β”€ DEVLOG.md # Full learning journal (the real value)
95
+ β”œβ”€β”€ MODEL_CARD.md # HuggingFace model card
96
+ └── publish.py # Upload to HuggingFace
97
+ ```
98
+
99
+ ## Quick start
100
+
101
+ **On Google Colab (recommended):**
102
+ ```python
103
+ !git clone https://github.com/brianmeyer/tinyllm.git
104
+ %cd tinyllm
105
+ !pip install tiktoken
106
+ !python -u colab/train_colab.py
107
+ ```
108
+
109
+ **Locally (M4 Mac / any GPU):**
110
+ ```bash
111
+ git clone https://github.com/brianmeyer/tinyllm.git
112
+ cd tinyllm
113
+ python3 -m venv .venv && source .venv/bin/activate
114
+ pip install -r requirements.txt
115
+ python -u training/train.py # vanilla, ~60 min
116
+ python -u training/train_modern.py # modern, ~67 min
117
+ python src/generate.py --demo # see the output
118
+ ```
119
+
120
+ ## The learning journey
121
+
122
+ **Phase 1 β€” Build from scratch:** Tokenizer, attention mechanism, multi-head attention, feed-forward network, transformer block, full GPT model. Every component explained in the [DEVLOG](DEVLOG.md).
123
+
124
+ **Phase 2 β€” Modernize one swap at a time:** Replace LayerNorm, ReLU, learned positions, and naive inference with RMSNorm, SwiGLU, RoPE, and KV cache. Each swap tested in isolation so you can see exactly what it does.
125
+
126
+ **Phase 3 β€” Scale up:** BPE tokenization (50K vocab), mixed precision, gradient accumulation. Learned why BPE needs way more data than 1MB of Shakespeare.
127
+
128
+ **Phase 4 β€” Break everything:** MPS memory leaks, silent process kills, float16 divergence, RoPE position bugs, Colab runtime evictions, lost checkpoints. Each failure documented with root cause and fix.
129
+
130
+ ## What I learned
131
+
132
+ 1. **RoPE is the most impactful modern change** β€” 0.31 better loss, fewer params, beautiful math
133
+ 2. **More powerful models overfit faster on small data** β€” early stopping is essential
134
+ 3. **MPS (Apple Silicon) silently kills training** after 60-80 min due to memory leaks
135
+ 4. **When loss is good but output is garbage, the bug is in inference** β€” our RoPE position bug only appeared during KV cache generation
136
+ 5. **Change one thing at a time** β€” the per-swap comparison approach is how real ML research works
137
+ 6. **Always save checkpoints to persistent storage** β€” we lost 3 hours of Colab training to a runtime disconnect
138
+
139
+ ## Architecture
140
+
141
+ ```
142
+ ModernGPT (10.6M params)
143
+ token_emb: Embedding(65, 384)
144
+ blocks Γ— 6:
145
+ RMSNorm β†’ MultiHeadAttention(6 heads, RoPE, KV cache) β†’ residual
146
+ RMSNorm β†’ SwiGLU(384 β†’ 1024 β†’ 384) β†’ residual
147
+ RMSNorm β†’ lm_head (tied with token_emb)
148
+ ```
149
+
150
+ ## 9 things that went wrong
151
+
152
+ | # | What happened | Root cause |
153
+ |---|--------------|-----------|
154
+ | 1 | MPS training died silently | Memory leak in PyTorch MPS backend |
155
+ | 2 | Bundled all 4 swaps together | Rushing β€” should test one at a time |
156
+ | 3 | Python output hidden during training | stdout buffering β€” use `python -u` |
157
+ | 4 | Modern model generated garbage | RoPE position bug in KV cache inference |
158
+ | 5 | Modern model memorized Shakespeare | 10M params too powerful for 1MB data |
159
+ | 6 | BPE training diverged | float16 on MPS overflows with 50K vocab |
160
+ | 7 | MPS kept killing all retrains | Memory leak unfixable on 16GB |
161
+ | 8 | Lost all Colab checkpoints | Runtime disconnected β€” ephemeral storage |
162
+ | 9 | Colab GPU quota exhausted | Used all free T4 hours in one session |
163
+
164
+ Full analysis of each: [DEVLOG.md](DEVLOG.md)
165
+
166
+ ## References
167
+
168
+ - [build-nanogpt](https://github.com/karpathy/build-nanogpt) β€” Karpathy's step-by-step GPT build
169
+ - [nanochat](https://github.com/karpathy/nanochat) β€” nanoGPT successor
170
+ - [RoPE paper](https://arxiv.org/abs/2104.09864) β€” Su et al.
171
+ - [SwiGLU paper](https://arxiv.org/abs/2002.05202) β€” Shazeer
172
+
173
+ ## License
174
+
175
+ MIT