bmeyer2025 commited on
Commit
abec5c1
·
verified ·
1 Parent(s): 7cb75d6

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +76 -97
README.md CHANGED
@@ -20,48 +20,84 @@ pipeline_tag: text-generation
20
 
21
  # tiny-gpt-shakespeare
22
 
23
- <p align="center">
24
- <img src="images/brain_book.png" alt="A glowing neural network brain floating above an open Shakespeare book" width="600">
25
- </p>
26
 
27
- <p align="center">
28
- <strong>I built a tiny LLM from scratch to understand how GPT-4 and LLaMA actually work.</strong>
29
- </p>
30
 
31
- <p align="center">
32
- <em>10M parameters. Trained on Shakespeare. Every line of code written from scratch. Every mistake documented.</em>
33
- </p>
 
 
 
34
 
35
- <p align="center">
36
- <a href="https://github.com/brianmeyer/tinyllm">GitHub</a> |
37
- <a href="https://github.com/brianmeyer/tinyllm/blob/main/DEVLOG.md">Learning Journal</a>
38
- </p>
39
 
40
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
- ## What is this?
43
 
44
- A ~10M parameter decoder-only transformer no HuggingFace Transformers library, no pretrained weights, no shortcuts. Built from an empty file to a working Shakespeare generator, then modernized with the same architecture used in LLaMA, Qwen, and Mistral.
 
 
 
 
 
45
 
46
- This is a learning project. The model itself is tiny and toy-scale. The value is in the code, the [DEVLOG](https://github.com/brianmeyer/tinyllm/blob/main/DEVLOG.md), and the 9 things that went wrong along the way.
47
 
48
- ## It generates Shakespeare
49
 
50
- **Modern model, temp=0.8 (RMSNorm + SwiGLU + RoPE + KV cache):**
 
 
51
  ```
52
  ROMEO:
53
- A gallant-house! what says the woe?
54
-
55
- MERCUTIO:
56
- Good madam, my lord.
57
 
58
  ROMEO:
59
- Villain, for I do not say it is true,
60
- Which hath a sin by him come to the crown,
61
- That he is reports for me; for ever is he.
 
 
 
 
62
  ```
63
 
64
- **Vanilla model, temp=0.5:**
65
  ```
66
  KING HENRY:
67
  The father of the marriage of my son,
@@ -69,63 +105,15 @@ And then we will be no longer to be then,
69
  And but the Lord Hastings of Semiram Stanley.
70
  ```
71
 
72
- Not perfect. But recognizable Shakespeare — proper character names, dialogue formatting, verse rhythm — from a 10M param model trained for ~60 minutes on 1MB of text.
73
-
74
- ## Architecture
75
-
76
- ```
77
- ModernGPT (10.6M params)
78
- token_emb: Embedding(65, 384)
79
- blocks × 6:
80
- RMSNorm → MultiHeadAttention(6 heads, RoPE, KV cache) → residual
81
- RMSNorm → SwiGLU(384 → 1024 → 384) → residual
82
- RMSNorm → lm_head (tied with token_emb)
83
- ```
84
-
85
- Four upgrades over vanilla GPT-2, each tested in isolation:
86
-
87
- | Upgrade | What changed | Impact |
88
- |---------|-------------|--------|
89
- | **RMSNorm** | Drop mean subtraction from LayerNorm | Free efficiency win |
90
- | **SwiGLU** | Smooth gating replaces hard ReLU cutoff | **-0.11** val loss at step 500 |
91
- | **RoPE** | Rotate Q/K vectors instead of adding position embeddings | **-0.31** val loss at step 500 |
92
- | **KV Cache** | Cache keys/values during generation | Faster inference |
93
-
94
- ## Results
95
-
96
- | Model | Params | Best Val Loss | Time |
97
- |-------|--------|-------------|------|
98
- | Vanilla | 10.8M | 1.4804 | 57 min |
99
- | **Modern** | **10.6M** | **1.4754** | **64 min** |
100
 
101
- Modern beats vanilla with fewer params. RoPE was the star biggest single improvement.
 
 
 
 
102
 
103
- ## 9 things that went wrong
104
-
105
- Building this was not smooth. Every failure is documented in the [DEVLOG](https://github.com/brianmeyer/tinyllm/blob/main/DEVLOG.md):
106
-
107
- 1. MPS training died silently (memory leak)
108
- 2. Bundled all 4 architecture swaps together instead of testing one at a time
109
- 3. Python stdout buffering hid training progress
110
- 4. RoPE position bug in KV cache made the model generate garbage
111
- 5. Modern model memorized Shakespeare (overfitting on 1MB)
112
- 6. Float16 diverged on MPS with 50K BPE vocab
113
- 7. MPS kept killing every retrain attempt
114
- 8. Lost all Colab checkpoints when runtime disconnected
115
- 9. Ran out of free Colab GPU quota
116
-
117
- ## Training details
118
-
119
- | | |
120
- |---|---|
121
- | Dataset | Tiny Shakespeare (~1.1MB, 65 unique characters) |
122
- | Optimizer | AdamW, lr=3e-4 |
123
- | Batch size | 64, block size 256 |
124
- | Steps | 5,000 (best checkpoint via early stopping) |
125
- | Hardware | Google Colab T4 (and an M4 Mac that kept crashing) |
126
- | Dropout | 0.3 (increased from 0.2 to fight overfitting) |
127
-
128
- ## How to use
129
 
130
  ```python
131
  import torch
@@ -145,22 +133,13 @@ out = model.generate(idx, max_new_tokens=200, temperature=0.8)
145
  print(decode(out[0].tolist()))
146
  ```
147
 
148
- ## What I learned
149
 
150
- 1. **RoPE is the most impactful modern architecture change** — beautiful math, fewer params, better results
151
- 2. **More powerful models overfit faster on small data** — early stopping is essential
152
- 3. **When loss is good but output is garbage, the bug is in inference code** — not the model
153
- 4. **MPS is not ready for serious training** — use CUDA
154
- 5. **Always save checkpoints to persistent storage** — Colab runtimes are ephemeral
155
- 6. **Change one thing at a time and measure** — this is how real ML research works
156
 
157
  ## References
158
 
159
- - [build-nanogpt](https://github.com/karpathy/build-nanogpt) — Karpathy's step-by-step GPT build
160
- - [RoPE paper](https://arxiv.org/abs/2104.09864) — Su et al.
161
- - [SwiGLU paper](https://arxiv.org/abs/2002.05202) — Shazeer
162
- - [RMSNorm paper](https://arxiv.org/abs/1910.07467) — Zhang & Sennrich
163
-
164
- ## License
165
-
166
- MIT
 
20
 
21
  # tiny-gpt-shakespeare
22
 
23
+ A 10M parameter decoder-only transformer trained on the Tiny Shakespeare dataset. Built from scratch in PyTorch as an educational project — no pretrained weights or external libraries used for the model itself.
 
 
24
 
25
+ ## Model Description
 
 
26
 
27
+ - **Architecture:** Decoder-only transformer with modern components (RMSNorm, SwiGLU, RoPE, KV cache)
28
+ - **Parameters:** 10.6M (modern) / 10.8M (vanilla)
29
+ - **Training data:** [Tiny Shakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) (~1.1MB, 65 unique characters)
30
+ - **Tokenization:** Character-level (65 tokens)
31
+ - **Context length:** 256 tokens
32
+ - **License:** MIT
33
 
34
+ ## Architecture Details
 
 
 
35
 
36
+ | Component | Implementation |
37
+ |-----------|---------------|
38
+ | Layers | 6 transformer blocks |
39
+ | Attention | 6 heads, 64 dims each, with RoPE |
40
+ | FFN | SwiGLU (384 → 1024 → 384) |
41
+ | Normalization | RMSNorm (pre-norm) |
42
+ | Position encoding | Rotary Position Embeddings (RoPE) |
43
+ | Inference | KV cache for autoregressive generation |
44
+ | Weight tying | lm_head shares weights with token embedding |
45
+
46
+ ## Training
47
+
48
+ | Parameter | Value |
49
+ |-----------|-------|
50
+ | Optimizer | AdamW |
51
+ | Learning rate | 3e-4 |
52
+ | Batch size | 64 |
53
+ | Block size | 256 |
54
+ | Dropout | 0.3 |
55
+ | Training steps | 5,000 (best checkpoint at step 2,500) |
56
+ | Hardware | Google Colab T4 GPU |
57
+ | Training time | ~64 minutes |
58
+
59
+ ### Training Results
60
+
61
+ | Model | Parameters | Best Val Loss | Best Step |
62
+ |-------|-----------|-------------|-----------|
63
+ | Vanilla (LayerNorm + ReLU + learned pos) | 10.8M | 1.4804 | 3,000 |
64
+ | Modern (RMSNorm + SwiGLU + RoPE + KV cache) | 10.6M | 1.4754 | 2,500 |
65
+
66
+ Early stopping was used — the model checkpointed at the lowest validation loss.
67
+
68
+ ### Component Comparison
69
 
70
+ Each modern component was tested in isolation against the vanilla baseline (2,000 training steps each):
71
 
72
+ | Component | Val Loss at Step 500 | vs Vanilla |
73
+ |-----------|---------------------|-----------|
74
+ | Vanilla (baseline) | 1.99 | — |
75
+ | RMSNorm | 1.99 | No change |
76
+ | SwiGLU | 1.88 | -0.11 |
77
+ | RoPE | 1.68 | -0.31 |
78
 
79
+ ## Intended Use
80
 
81
+ This is an **educational model**. It is not intended for production use. It generates Shakespeare-style text and serves as a reference implementation for understanding transformer architectures.
82
 
83
+ ## Sample Outputs
84
+
85
+ **Prompt: "ROMEO:", temperature=0.8:**
86
  ```
87
  ROMEO:
88
+ Marry, good day with me!
 
 
 
89
 
90
  ROMEO:
91
+ And not your lady command her at arms.
92
+
93
+ THOMAS MOWBRAY:
94
+ My dear lord, but go on.
95
+
96
+ MERCUTIO:
97
+ Hence will not speak against a marriage.
98
  ```
99
 
100
+ **Prompt: "KING HENRY:", temperature=0.5:**
101
  ```
102
  KING HENRY:
103
  The father of the marriage of my son,
 
105
  And but the Lord Hastings of Semiram Stanley.
106
  ```
107
 
108
+ ## Limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
 
110
+ - **Tiny dataset:** Trained on only 1.1MB of text. The model overfits after ~2,500 steps.
111
+ - **Character-level tokenization:** Inefficient compared to BPE. Each character is a separate token.
112
+ - **No instruction tuning:** This is a base model — it completes text, it does not follow instructions or answer questions.
113
+ - **Small context window:** 256 tokens maximum.
114
+ - **Quality:** Output is recognizably Shakespeare-like but contains grammatical errors and occasionally mixes characters from different plays.
115
 
116
+ ## How to Use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
 
118
  ```python
119
  import torch
 
133
  print(decode(out[0].tolist()))
134
  ```
135
 
136
+ ## Source Code
137
 
138
+ Full implementation with detailed documentation: [github.com/brianmeyer/tinyllm](https://github.com/brianmeyer/tinyllm)
 
 
 
 
 
139
 
140
  ## References
141
 
142
+ - Karpathy, A. [build-nanogpt](https://github.com/karpathy/build-nanogpt) — primary reference
143
+ - Su et al. (2021). [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
144
+ - Shazeer, N. (2020). [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)
145
+ - Zhang & Sennrich (2019). [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467)