Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,51 +1,53 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
<p align="center">
|
| 4 |
-
<img src="images/
|
| 5 |
</p>
|
| 6 |
|
| 7 |
<p align="center">
|
| 8 |
-
<strong>I built a tiny LLM from scratch to understand how GPT-4
|
| 9 |
</p>
|
| 10 |
|
| 11 |
<p align="center">
|
| 12 |
-
<em>10M parameters. Trained on Shakespeare.
|
| 13 |
</p>
|
| 14 |
|
| 15 |
<p align="center">
|
| 16 |
-
<a href="
|
| 17 |
-
<a href="https://
|
| 18 |
-
<a href="MODEL_CARD.md">Model Card</a>
|
| 19 |
</p>
|
| 20 |
|
| 21 |
---
|
| 22 |
|
| 23 |
-
##
|
| 24 |
-
|
| 25 |
-
GPT-4, Claude, and LLaMA are all scaled-up versions of the same architecture. I wanted to understand it from the ground up β not by reading papers, but by building it myself.
|
| 26 |
-
|
| 27 |
-
So I built a 10M parameter transformer, trained it on Shakespeare, then upgraded it piece by piece with the same components used in production LLMs. Every mistake, crash, and debugging session is documented in the [DEVLOG](DEVLOG.md).
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
|
| 34 |
-
|-----------|-----------|---------------------|--------|
|
| 35 |
-
| Normalization | LayerNorm | **RMSNorm** | Free efficiency win |
|
| 36 |
-
| FFN | ReLU | **SwiGLU** | **-0.11** val loss |
|
| 37 |
-
| Position | Learned embeddings | **RoPE** | **-0.31** val loss |
|
| 38 |
-
| Inference | Recompute all | **KV Cache** | Faster generation |
|
| 39 |
-
|
| 40 |
-
**RoPE was the star** β biggest improvement, fewer parameters, and the position encoding math is genuinely beautiful.
|
| 41 |
-
|
| 42 |
-
## Results
|
| 43 |
-
|
| 44 |
-
| Model | Best Val Loss | Training Time |
|
| 45 |
-
|-------|-------------|--------------|
|
| 46 |
-
| Vanilla (10.8M params) | 1.4804 | 57 min |
|
| 47 |
-
| **Modern (10.6M params)** | **1.4754** | **64 min** |
|
| 48 |
|
|
|
|
| 49 |
```
|
| 50 |
ROMEO:
|
| 51 |
A gallant-house! what says the woe?
|
|
@@ -59,116 +61,105 @@ Which hath a sin by him come to the crown,
|
|
| 59 |
That he is reports for me; for ever is he.
|
| 60 |
```
|
| 61 |
|
| 62 |
-
*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
-
##
|
| 65 |
|
| 66 |
```
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
β βββ modernize.py # Modern components: RMSNorm, SwiGLU, RoPE, KV cache
|
| 74 |
-
β βββ model_modern.py # Modernized GPT (10.6M params)
|
| 75 |
-
β βββ generate.py # Text generation with sampling
|
| 76 |
-
β
|
| 77 |
-
βββ experiments/ # Per-swap A/B comparisons
|
| 78 |
-
β βββ swap1_rmsnorm.py # LayerNorm β RMSNorm (2000 steps)
|
| 79 |
-
β βββ swap2_swiglu.py # ReLU β SwiGLU (2000 steps)
|
| 80 |
-
β βββ swap3_rope.py # Learned pos β RoPE (2000 steps)
|
| 81 |
-
β βββ swap4_kvcache.py # KV cache speed benchmark
|
| 82 |
-
β
|
| 83 |
-
βββ training/ # Training scripts
|
| 84 |
-
β βββ train.py # Vanilla GPT (5000 steps)
|
| 85 |
-
β βββ train_modern.py # Modern GPT with early stopping
|
| 86 |
-
β βββ train_bpe.py # BPE + gradient accumulation
|
| 87 |
-
β βββ benchmark.py # Samples, latency, throughput comparison
|
| 88 |
-
β
|
| 89 |
-
βββ colab/ # Google Colab
|
| 90 |
-
β βββ train_colab.py # All-in-one: vanilla + modern + BPE + benchmarks
|
| 91 |
-
β
|
| 92 |
-
βββ data/input.txt # Tiny Shakespeare (~1.1MB)
|
| 93 |
-
βββ images/ # Generated graphics
|
| 94 |
-
βββ DEVLOG.md # Full learning journal (the real value)
|
| 95 |
-
βββ MODEL_CARD.md # HuggingFace model card
|
| 96 |
-
βββ publish.py # Upload to HuggingFace
|
| 97 |
```
|
| 98 |
|
| 99 |
-
|
| 100 |
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
```
|
| 108 |
|
| 109 |
-
|
| 110 |
-
```bash
|
| 111 |
-
git clone https://github.com/brianmeyer/tinyllm.git
|
| 112 |
-
cd tinyllm
|
| 113 |
-
python3 -m venv .venv && source .venv/bin/activate
|
| 114 |
-
pip install -r requirements.txt
|
| 115 |
-
python -u training/train.py # vanilla, ~60 min
|
| 116 |
-
python -u training/train_modern.py # modern, ~67 min
|
| 117 |
-
python src/generate.py --demo # see the output
|
| 118 |
-
```
|
| 119 |
|
| 120 |
-
|
|
|
|
|
|
|
|
|
|
| 121 |
|
| 122 |
-
|
| 123 |
|
| 124 |
-
|
| 125 |
|
| 126 |
-
|
| 127 |
|
| 128 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
|
| 130 |
-
##
|
| 131 |
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
|
|
|
|
|
|
| 138 |
|
| 139 |
-
##
|
| 140 |
|
| 141 |
-
```
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
```
|
| 149 |
|
| 150 |
-
##
|
| 151 |
-
|
| 152 |
-
| # | What happened | Root cause |
|
| 153 |
-
|---|--------------|-----------|
|
| 154 |
-
| 1 | MPS training died silently | Memory leak in PyTorch MPS backend |
|
| 155 |
-
| 2 | Bundled all 4 swaps together | Rushing β should test one at a time |
|
| 156 |
-
| 3 | Python output hidden during training | stdout buffering β use `python -u` |
|
| 157 |
-
| 4 | Modern model generated garbage | RoPE position bug in KV cache inference |
|
| 158 |
-
| 5 | Modern model memorized Shakespeare | 10M params too powerful for 1MB data |
|
| 159 |
-
| 6 | BPE training diverged | float16 on MPS overflows with 50K vocab |
|
| 160 |
-
| 7 | MPS kept killing all retrains | Memory leak unfixable on 16GB |
|
| 161 |
-
| 8 | Lost all Colab checkpoints | Runtime disconnected β ephemeral storage |
|
| 162 |
-
| 9 | Colab GPU quota exhausted | Used all free T4 hours in one session |
|
| 163 |
|
| 164 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 165 |
|
| 166 |
## References
|
| 167 |
|
| 168 |
- [build-nanogpt](https://github.com/karpathy/build-nanogpt) β Karpathy's step-by-step GPT build
|
| 169 |
-
- [nanochat](https://github.com/karpathy/nanochat) β nanoGPT successor
|
| 170 |
- [RoPE paper](https://arxiv.org/abs/2104.09864) β Su et al.
|
| 171 |
- [SwiGLU paper](https://arxiv.org/abs/2002.05202) β Shazeer
|
|
|
|
| 172 |
|
| 173 |
## License
|
| 174 |
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- pytorch
|
| 7 |
+
- transformer
|
| 8 |
+
- language-model
|
| 9 |
+
- from-scratch
|
| 10 |
+
- educational
|
| 11 |
+
- shakespeare
|
| 12 |
+
- rope
|
| 13 |
+
- swiglu
|
| 14 |
+
- rmsnorm
|
| 15 |
+
- kv-cache
|
| 16 |
+
datasets:
|
| 17 |
+
- tiny-shakespeare
|
| 18 |
+
pipeline_tag: text-generation
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# tiny-gpt-shakespeare
|
| 22 |
|
| 23 |
<p align="center">
|
| 24 |
+
<img src="images/brain_book.png" alt="A glowing neural network brain floating above an open Shakespeare book" width="600">
|
| 25 |
</p>
|
| 26 |
|
| 27 |
<p align="center">
|
| 28 |
+
<strong>I built a tiny LLM from scratch to understand how GPT-4 and LLaMA actually work.</strong>
|
| 29 |
</p>
|
| 30 |
|
| 31 |
<p align="center">
|
| 32 |
+
<em>10M parameters. Trained on Shakespeare. Every line of code written from scratch. Every mistake documented.</em>
|
| 33 |
</p>
|
| 34 |
|
| 35 |
<p align="center">
|
| 36 |
+
<a href="https://github.com/brianmeyer/tinyllm">GitHub</a> |
|
| 37 |
+
<a href="https://github.com/brianmeyer/tinyllm/blob/main/DEVLOG.md">Learning Journal</a>
|
|
|
|
| 38 |
</p>
|
| 39 |
|
| 40 |
---
|
| 41 |
|
| 42 |
+
## What is this?
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
+
A ~10M parameter decoder-only transformer β no HuggingFace Transformers library, no pretrained weights, no shortcuts. Built from an empty file to a working Shakespeare generator, then modernized with the same architecture used in LLaMA, Qwen, and Mistral.
|
| 45 |
|
| 46 |
+
This is a learning project. The model itself is tiny and toy-scale. The value is in the code, the [DEVLOG](https://github.com/brianmeyer/tinyllm/blob/main/DEVLOG.md), and the 9 things that went wrong along the way.
|
| 47 |
|
| 48 |
+
## It generates Shakespeare
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
+
**Modern model, temp=0.8 (RMSNorm + SwiGLU + RoPE + KV cache):**
|
| 51 |
```
|
| 52 |
ROMEO:
|
| 53 |
A gallant-house! what says the woe?
|
|
|
|
| 61 |
That he is reports for me; for ever is he.
|
| 62 |
```
|
| 63 |
|
| 64 |
+
**Vanilla model, temp=0.5:**
|
| 65 |
+
```
|
| 66 |
+
KING HENRY:
|
| 67 |
+
The father of the marriage of my son,
|
| 68 |
+
And then we will be no longer to be then,
|
| 69 |
+
And but the Lord Hastings of Semiram Stanley.
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
Not perfect. But recognizable Shakespeare β proper character names, dialogue formatting, verse rhythm β from a 10M param model trained for ~60 minutes on 1MB of text.
|
| 73 |
|
| 74 |
+
## Architecture
|
| 75 |
|
| 76 |
```
|
| 77 |
+
ModernGPT (10.6M params)
|
| 78 |
+
token_emb: Embedding(65, 384)
|
| 79 |
+
blocks Γ 6:
|
| 80 |
+
RMSNorm β MultiHeadAttention(6 heads, RoPE, KV cache) β residual
|
| 81 |
+
RMSNorm β SwiGLU(384 β 1024 β 384) β residual
|
| 82 |
+
RMSNorm β lm_head (tied with token_emb)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
```
|
| 84 |
|
| 85 |
+
Four upgrades over vanilla GPT-2, each tested in isolation:
|
| 86 |
|
| 87 |
+
| Upgrade | What changed | Impact |
|
| 88 |
+
|---------|-------------|--------|
|
| 89 |
+
| **RMSNorm** | Drop mean subtraction from LayerNorm | Free efficiency win |
|
| 90 |
+
| **SwiGLU** | Smooth gating replaces hard ReLU cutoff | **-0.11** val loss at step 500 |
|
| 91 |
+
| **RoPE** | Rotate Q/K vectors instead of adding position embeddings | **-0.31** val loss at step 500 |
|
| 92 |
+
| **KV Cache** | Cache keys/values during generation | Faster inference |
|
|
|
|
| 93 |
|
| 94 |
+
## Results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
+
| Model | Params | Best Val Loss | Time |
|
| 97 |
+
|-------|--------|-------------|------|
|
| 98 |
+
| Vanilla | 10.8M | 1.4804 | 57 min |
|
| 99 |
+
| **Modern** | **10.6M** | **1.4754** | **64 min** |
|
| 100 |
|
| 101 |
+
Modern beats vanilla with fewer params. RoPE was the star β biggest single improvement.
|
| 102 |
|
| 103 |
+
## 9 things that went wrong
|
| 104 |
|
| 105 |
+
Building this was not smooth. Every failure is documented in the [DEVLOG](https://github.com/brianmeyer/tinyllm/blob/main/DEVLOG.md):
|
| 106 |
|
| 107 |
+
1. MPS training died silently (memory leak)
|
| 108 |
+
2. Bundled all 4 architecture swaps together instead of testing one at a time
|
| 109 |
+
3. Python stdout buffering hid training progress
|
| 110 |
+
4. RoPE position bug in KV cache made the model generate garbage
|
| 111 |
+
5. Modern model memorized Shakespeare (overfitting on 1MB)
|
| 112 |
+
6. Float16 diverged on MPS with 50K BPE vocab
|
| 113 |
+
7. MPS kept killing every retrain attempt
|
| 114 |
+
8. Lost all Colab checkpoints when runtime disconnected
|
| 115 |
+
9. Ran out of free Colab GPU quota
|
| 116 |
|
| 117 |
+
## Training details
|
| 118 |
|
| 119 |
+
| | |
|
| 120 |
+
|---|---|
|
| 121 |
+
| Dataset | Tiny Shakespeare (~1.1MB, 65 unique characters) |
|
| 122 |
+
| Optimizer | AdamW, lr=3e-4 |
|
| 123 |
+
| Batch size | 64, block size 256 |
|
| 124 |
+
| Steps | 5,000 (best checkpoint via early stopping) |
|
| 125 |
+
| Hardware | Google Colab T4 (and an M4 Mac that kept crashing) |
|
| 126 |
+
| Dropout | 0.3 (increased from 0.2 to fight overfitting) |
|
| 127 |
|
| 128 |
+
## How to use
|
| 129 |
|
| 130 |
+
```python
|
| 131 |
+
import torch
|
| 132 |
+
import sys
|
| 133 |
+
sys.path.append('src')
|
| 134 |
+
from model_modern import ModernGPT
|
| 135 |
+
from tokenizer import encode, decode
|
| 136 |
+
|
| 137 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 138 |
+
ckpt = torch.load("model.pt", map_location=device, weights_only=False)
|
| 139 |
+
model = ModernGPT(**ckpt["config"]).to(device)
|
| 140 |
+
model.load_state_dict(ckpt["model_state"])
|
| 141 |
+
model.eval()
|
| 142 |
+
|
| 143 |
+
idx = torch.tensor([encode("ROMEO:")], dtype=torch.long, device=device)
|
| 144 |
+
out = model.generate(idx, max_new_tokens=200, temperature=0.8)
|
| 145 |
+
print(decode(out[0].tolist()))
|
| 146 |
```
|
| 147 |
|
| 148 |
+
## What I learned
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
|
| 150 |
+
1. **RoPE is the most impactful modern architecture change** β beautiful math, fewer params, better results
|
| 151 |
+
2. **More powerful models overfit faster on small data** β early stopping is essential
|
| 152 |
+
3. **When loss is good but output is garbage, the bug is in inference code** β not the model
|
| 153 |
+
4. **MPS is not ready for serious training** β use CUDA
|
| 154 |
+
5. **Always save checkpoints to persistent storage** β Colab runtimes are ephemeral
|
| 155 |
+
6. **Change one thing at a time and measure** β this is how real ML research works
|
| 156 |
|
| 157 |
## References
|
| 158 |
|
| 159 |
- [build-nanogpt](https://github.com/karpathy/build-nanogpt) β Karpathy's step-by-step GPT build
|
|
|
|
| 160 |
- [RoPE paper](https://arxiv.org/abs/2104.09864) β Su et al.
|
| 161 |
- [SwiGLU paper](https://arxiv.org/abs/2002.05202) β Shazeer
|
| 162 |
+
- [RMSNorm paper](https://arxiv.org/abs/1910.07467) β Zhang & Sennrich
|
| 163 |
|
| 164 |
## License
|
| 165 |
|