Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -20,48 +20,84 @@ pipeline_tag: text-generation
|
|
| 20 |
|
| 21 |
# tiny-gpt-shakespeare
|
| 22 |
|
| 23 |
-
|
| 24 |
-
<img src="images/brain_book.png" alt="A glowing neural network brain floating above an open Shakespeare book" width="600">
|
| 25 |
-
</p>
|
| 26 |
|
| 27 |
-
|
| 28 |
-
<strong>I built a tiny LLM from scratch to understand how GPT-4 and LLaMA actually work.</strong>
|
| 29 |
-
</p>
|
| 30 |
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
-
|
| 36 |
-
<a href="https://github.com/brianmeyer/tinyllm">GitHub</a> |
|
| 37 |
-
<a href="https://github.com/brianmeyer/tinyllm/blob/main/DEVLOG.md">Learning Journal</a>
|
| 38 |
-
</p>
|
| 39 |
|
| 40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
-
|
| 43 |
|
| 44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
-
|
| 47 |
|
| 48 |
-
|
| 49 |
|
| 50 |
-
|
|
|
|
|
|
|
| 51 |
```
|
| 52 |
ROMEO:
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
MERCUTIO:
|
| 56 |
-
Good madam, my lord.
|
| 57 |
|
| 58 |
ROMEO:
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
```
|
| 63 |
|
| 64 |
-
**
|
| 65 |
```
|
| 66 |
KING HENRY:
|
| 67 |
The father of the marriage of my son,
|
|
@@ -69,63 +105,15 @@ And then we will be no longer to be then,
|
|
| 69 |
And but the Lord Hastings of Semiram Stanley.
|
| 70 |
```
|
| 71 |
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
## Architecture
|
| 75 |
-
|
| 76 |
-
```
|
| 77 |
-
ModernGPT (10.6M params)
|
| 78 |
-
token_emb: Embedding(65, 384)
|
| 79 |
-
blocks × 6:
|
| 80 |
-
RMSNorm → MultiHeadAttention(6 heads, RoPE, KV cache) → residual
|
| 81 |
-
RMSNorm → SwiGLU(384 → 1024 → 384) → residual
|
| 82 |
-
RMSNorm → lm_head (tied with token_emb)
|
| 83 |
-
```
|
| 84 |
-
|
| 85 |
-
Four upgrades over vanilla GPT-2, each tested in isolation:
|
| 86 |
-
|
| 87 |
-
| Upgrade | What changed | Impact |
|
| 88 |
-
|---------|-------------|--------|
|
| 89 |
-
| **RMSNorm** | Drop mean subtraction from LayerNorm | Free efficiency win |
|
| 90 |
-
| **SwiGLU** | Smooth gating replaces hard ReLU cutoff | **-0.11** val loss at step 500 |
|
| 91 |
-
| **RoPE** | Rotate Q/K vectors instead of adding position embeddings | **-0.31** val loss at step 500 |
|
| 92 |
-
| **KV Cache** | Cache keys/values during generation | Faster inference |
|
| 93 |
-
|
| 94 |
-
## Results
|
| 95 |
-
|
| 96 |
-
| Model | Params | Best Val Loss | Time |
|
| 97 |
-
|-------|--------|-------------|------|
|
| 98 |
-
| Vanilla | 10.8M | 1.4804 | 57 min |
|
| 99 |
-
| **Modern** | **10.6M** | **1.4754** | **64 min** |
|
| 100 |
|
| 101 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
-
##
|
| 104 |
-
|
| 105 |
-
Building this was not smooth. Every failure is documented in the [DEVLOG](https://github.com/brianmeyer/tinyllm/blob/main/DEVLOG.md):
|
| 106 |
-
|
| 107 |
-
1. MPS training died silently (memory leak)
|
| 108 |
-
2. Bundled all 4 architecture swaps together instead of testing one at a time
|
| 109 |
-
3. Python stdout buffering hid training progress
|
| 110 |
-
4. RoPE position bug in KV cache made the model generate garbage
|
| 111 |
-
5. Modern model memorized Shakespeare (overfitting on 1MB)
|
| 112 |
-
6. Float16 diverged on MPS with 50K BPE vocab
|
| 113 |
-
7. MPS kept killing every retrain attempt
|
| 114 |
-
8. Lost all Colab checkpoints when runtime disconnected
|
| 115 |
-
9. Ran out of free Colab GPU quota
|
| 116 |
-
|
| 117 |
-
## Training details
|
| 118 |
-
|
| 119 |
-
| | |
|
| 120 |
-
|---|---|
|
| 121 |
-
| Dataset | Tiny Shakespeare (~1.1MB, 65 unique characters) |
|
| 122 |
-
| Optimizer | AdamW, lr=3e-4 |
|
| 123 |
-
| Batch size | 64, block size 256 |
|
| 124 |
-
| Steps | 5,000 (best checkpoint via early stopping) |
|
| 125 |
-
| Hardware | Google Colab T4 (and an M4 Mac that kept crashing) |
|
| 126 |
-
| Dropout | 0.3 (increased from 0.2 to fight overfitting) |
|
| 127 |
-
|
| 128 |
-
## How to use
|
| 129 |
|
| 130 |
```python
|
| 131 |
import torch
|
|
@@ -145,22 +133,13 @@ out = model.generate(idx, max_new_tokens=200, temperature=0.8)
|
|
| 145 |
print(decode(out[0].tolist()))
|
| 146 |
```
|
| 147 |
|
| 148 |
-
##
|
| 149 |
|
| 150 |
-
|
| 151 |
-
2. **More powerful models overfit faster on small data** — early stopping is essential
|
| 152 |
-
3. **When loss is good but output is garbage, the bug is in inference code** — not the model
|
| 153 |
-
4. **MPS is not ready for serious training** — use CUDA
|
| 154 |
-
5. **Always save checkpoints to persistent storage** — Colab runtimes are ephemeral
|
| 155 |
-
6. **Change one thing at a time and measure** — this is how real ML research works
|
| 156 |
|
| 157 |
## References
|
| 158 |
|
| 159 |
-
- [build-nanogpt](https://github.com/karpathy/build-nanogpt) —
|
| 160 |
-
- [
|
| 161 |
-
- [
|
| 162 |
-
- [
|
| 163 |
-
|
| 164 |
-
## License
|
| 165 |
-
|
| 166 |
-
MIT
|
|
|
|
| 20 |
|
| 21 |
# tiny-gpt-shakespeare
|
| 22 |
|
| 23 |
+
A 10M parameter decoder-only transformer trained on the Tiny Shakespeare dataset. Built from scratch in PyTorch as an educational project — no pretrained weights or external libraries used for the model itself.
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
## Model Description
|
|
|
|
|
|
|
| 26 |
|
| 27 |
+
- **Architecture:** Decoder-only transformer with modern components (RMSNorm, SwiGLU, RoPE, KV cache)
|
| 28 |
+
- **Parameters:** 10.6M (modern) / 10.8M (vanilla)
|
| 29 |
+
- **Training data:** [Tiny Shakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) (~1.1MB, 65 unique characters)
|
| 30 |
+
- **Tokenization:** Character-level (65 tokens)
|
| 31 |
+
- **Context length:** 256 tokens
|
| 32 |
+
- **License:** MIT
|
| 33 |
|
| 34 |
+
## Architecture Details
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
+
| Component | Implementation |
|
| 37 |
+
|-----------|---------------|
|
| 38 |
+
| Layers | 6 transformer blocks |
|
| 39 |
+
| Attention | 6 heads, 64 dims each, with RoPE |
|
| 40 |
+
| FFN | SwiGLU (384 → 1024 → 384) |
|
| 41 |
+
| Normalization | RMSNorm (pre-norm) |
|
| 42 |
+
| Position encoding | Rotary Position Embeddings (RoPE) |
|
| 43 |
+
| Inference | KV cache for autoregressive generation |
|
| 44 |
+
| Weight tying | lm_head shares weights with token embedding |
|
| 45 |
+
|
| 46 |
+
## Training
|
| 47 |
+
|
| 48 |
+
| Parameter | Value |
|
| 49 |
+
|-----------|-------|
|
| 50 |
+
| Optimizer | AdamW |
|
| 51 |
+
| Learning rate | 3e-4 |
|
| 52 |
+
| Batch size | 64 |
|
| 53 |
+
| Block size | 256 |
|
| 54 |
+
| Dropout | 0.3 |
|
| 55 |
+
| Training steps | 5,000 (best checkpoint at step 2,500) |
|
| 56 |
+
| Hardware | Google Colab T4 GPU |
|
| 57 |
+
| Training time | ~64 minutes |
|
| 58 |
+
|
| 59 |
+
### Training Results
|
| 60 |
+
|
| 61 |
+
| Model | Parameters | Best Val Loss | Best Step |
|
| 62 |
+
|-------|-----------|-------------|-----------|
|
| 63 |
+
| Vanilla (LayerNorm + ReLU + learned pos) | 10.8M | 1.4804 | 3,000 |
|
| 64 |
+
| Modern (RMSNorm + SwiGLU + RoPE + KV cache) | 10.6M | 1.4754 | 2,500 |
|
| 65 |
+
|
| 66 |
+
Early stopping was used — the model checkpointed at the lowest validation loss.
|
| 67 |
+
|
| 68 |
+
### Component Comparison
|
| 69 |
|
| 70 |
+
Each modern component was tested in isolation against the vanilla baseline (2,000 training steps each):
|
| 71 |
|
| 72 |
+
| Component | Val Loss at Step 500 | vs Vanilla |
|
| 73 |
+
|-----------|---------------------|-----------|
|
| 74 |
+
| Vanilla (baseline) | 1.99 | — |
|
| 75 |
+
| RMSNorm | 1.99 | No change |
|
| 76 |
+
| SwiGLU | 1.88 | -0.11 |
|
| 77 |
+
| RoPE | 1.68 | -0.31 |
|
| 78 |
|
| 79 |
+
## Intended Use
|
| 80 |
|
| 81 |
+
This is an **educational model**. It is not intended for production use. It generates Shakespeare-style text and serves as a reference implementation for understanding transformer architectures.
|
| 82 |
|
| 83 |
+
## Sample Outputs
|
| 84 |
+
|
| 85 |
+
**Prompt: "ROMEO:", temperature=0.8:**
|
| 86 |
```
|
| 87 |
ROMEO:
|
| 88 |
+
Marry, good day with me!
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
ROMEO:
|
| 91 |
+
And not your lady command her at arms.
|
| 92 |
+
|
| 93 |
+
THOMAS MOWBRAY:
|
| 94 |
+
My dear lord, but go on.
|
| 95 |
+
|
| 96 |
+
MERCUTIO:
|
| 97 |
+
Hence will not speak against a marriage.
|
| 98 |
```
|
| 99 |
|
| 100 |
+
**Prompt: "KING HENRY:", temperature=0.5:**
|
| 101 |
```
|
| 102 |
KING HENRY:
|
| 103 |
The father of the marriage of my son,
|
|
|
|
| 105 |
And but the Lord Hastings of Semiram Stanley.
|
| 106 |
```
|
| 107 |
|
| 108 |
+
## Limitations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
+
- **Tiny dataset:** Trained on only 1.1MB of text. The model overfits after ~2,500 steps.
|
| 111 |
+
- **Character-level tokenization:** Inefficient compared to BPE. Each character is a separate token.
|
| 112 |
+
- **No instruction tuning:** This is a base model — it completes text, it does not follow instructions or answer questions.
|
| 113 |
+
- **Small context window:** 256 tokens maximum.
|
| 114 |
+
- **Quality:** Output is recognizably Shakespeare-like but contains grammatical errors and occasionally mixes characters from different plays.
|
| 115 |
|
| 116 |
+
## How to Use
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
|
| 118 |
```python
|
| 119 |
import torch
|
|
|
|
| 133 |
print(decode(out[0].tolist()))
|
| 134 |
```
|
| 135 |
|
| 136 |
+
## Source Code
|
| 137 |
|
| 138 |
+
Full implementation with detailed documentation: [github.com/brianmeyer/tinyllm](https://github.com/brianmeyer/tinyllm)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
|
| 140 |
## References
|
| 141 |
|
| 142 |
+
- Karpathy, A. [build-nanogpt](https://github.com/karpathy/build-nanogpt) — primary reference
|
| 143 |
+
- Su et al. (2021). [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
|
| 144 |
+
- Shazeer, N. (2020). [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)
|
| 145 |
+
- Zhang & Sennrich (2019). [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467)
|
|
|
|
|
|
|
|
|
|
|
|