bmeyer2025
/

tiny-gpt-shakespeare

@@ -20,48 +20,84 @@ pipeline_tag: text-generation
 # tiny-gpt-shakespeare
-<p align="center">
-  <img src="images/brain_book.png" alt="A glowing neural network brain floating above an open Shakespeare book" width="600">
-</p>
-<p align="center">
-  <strong>I built a tiny LLM from scratch to understand how GPT-4 and LLaMA actually work.</strong>
-</p>
-<p align="center">
-  <em>10M parameters. Trained on Shakespeare. Every line of code written from scratch. Every mistake documented.</em>
-</p>
-<p align="center">
-  <a href="https://github.com/brianmeyer/tinyllm">GitHub</a> |
-  <a href="https://github.com/brianmeyer/tinyllm/blob/main/DEVLOG.md">Learning Journal</a>
-</p>
----
-## What is this?
-A ~10M parameter decoder-only transformer — no HuggingFace Transformers library, no pretrained weights, no shortcuts. Built from an empty file to a working Shakespeare generator, then modernized with the same architecture used in LLaMA, Qwen, and Mistral.
-This is a learning project. The model itself is tiny and toy-scale. The value is in the code, the [DEVLOG](https://github.com/brianmeyer/tinyllm/blob/main/DEVLOG.md), and the 9 things that went wrong along the way.
-## It generates Shakespeare
-**Modern model, temp=0.8 (RMSNorm + SwiGLU + RoPE + KV cache):**
 ```
 ROMEO:
-A gallant-house! what says the woe?
-MERCUTIO:
-Good madam, my lord.
 ROMEO:
-Villain, for I do not say it is true,
-Which hath a sin by him come to the crown,
-That he is reports for me; for ever is he.
 ```
-**Vanilla model, temp=0.5:**
 ```
 KING HENRY:
 The father of the marriage of my son,
@@ -69,63 +105,15 @@ And then we will be no longer to be then,
 And but the Lord Hastings of Semiram Stanley.
 ```
-Not perfect. But recognizable Shakespeare — proper character names, dialogue formatting, verse rhythm — from a 10M param model trained for ~60 minutes on 1MB of text.
-## Architecture
-```
-ModernGPT (10.6M params)
-  token_emb:   Embedding(65, 384)
-  blocks × 6:
-    RMSNorm → MultiHeadAttention(6 heads, RoPE, KV cache) → residual
-    RMSNorm → SwiGLU(384 → 1024 → 384) → residual
-  RMSNorm → lm_head (tied with token_emb)
-```
-Four upgrades over vanilla GPT-2, each tested in isolation:
-| Upgrade | What changed | Impact |
-|---------|-------------|--------|
-| **RMSNorm** | Drop mean subtraction from LayerNorm | Free efficiency win |
-| **SwiGLU** | Smooth gating replaces hard ReLU cutoff | **-0.11** val loss at step 500 |
-| **RoPE** | Rotate Q/K vectors instead of adding position embeddings | **-0.31** val loss at step 500 |
-| **KV Cache** | Cache keys/values during generation | Faster inference |
-## Results
-| Model | Params | Best Val Loss | Time |
-|-------|--------|-------------|------|
-| Vanilla | 10.8M | 1.4804 | 57 min |
-| **Modern** | **10.6M** | **1.4754** | **64 min** |
-Modern beats vanilla with fewer params. RoPE was the star — biggest single improvement.
-## 9 things that went wrong
-Building this was not smooth. Every failure is documented in the [DEVLOG](https://github.com/brianmeyer/tinyllm/blob/main/DEVLOG.md):
-1. MPS training died silently (memory leak)
-2. Bundled all 4 architecture swaps together instead of testing one at a time
-3. Python stdout buffering hid training progress
-4. RoPE position bug in KV cache made the model generate garbage
-5. Modern model memorized Shakespeare (overfitting on 1MB)
-6. Float16 diverged on MPS with 50K BPE vocab
-7. MPS kept killing every retrain attempt
-8. Lost all Colab checkpoints when runtime disconnected
-9. Ran out of free Colab GPU quota
-## Training details
-| | |
-|---|---|
-| Dataset | Tiny Shakespeare (~1.1MB, 65 unique characters) |
-| Optimizer | AdamW, lr=3e-4 |
-| Batch size | 64, block size 256 |
-| Steps | 5,000 (best checkpoint via early stopping) |
-| Hardware | Google Colab T4 (and an M4 Mac that kept crashing) |
-| Dropout | 0.3 (increased from 0.2 to fight overfitting) |
-## How to use
 ```python
 import torch
@@ -145,22 +133,13 @@ out = model.generate(idx, max_new_tokens=200, temperature=0.8)
 print(decode(out[0].tolist()))
 ```
-## What I learned
-1. **RoPE is the most impactful modern architecture change** — beautiful math, fewer params, better results
-2. **More powerful models overfit faster on small data** — early stopping is essential
-3. **When loss is good but output is garbage, the bug is in inference code** — not the model
-4. **MPS is not ready for serious training** — use CUDA
-5. **Always save checkpoints to persistent storage** — Colab runtimes are ephemeral
-6. **Change one thing at a time and measure** — this is how real ML research works
 ## References
-- [build-nanogpt](https://github.com/karpathy/build-nanogpt) — Karpathy's step-by-step GPT build
-- [RoPE paper](https://arxiv.org/abs/2104.09864) — Su et al.
-- [SwiGLU paper](https://arxiv.org/abs/2002.05202) — Shazeer
-- [RMSNorm paper](https://arxiv.org/abs/1910.07467) — Zhang & Sennrich
-## License
-MIT

 # tiny-gpt-shakespeare
+A 10M parameter decoder-only transformer trained on the Tiny Shakespeare dataset. Built from scratch in PyTorch as an educational project — no pretrained weights or external libraries used for the model itself.
+## Model Description
+- **Architecture:** Decoder-only transformer with modern components (RMSNorm, SwiGLU, RoPE, KV cache)
+- **Parameters:** 10.6M (modern) / 10.8M (vanilla)
+- **Training data:** [Tiny Shakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) (~1.1MB, 65 unique characters)
+- **Tokenization:** Character-level (65 tokens)
+- **Context length:** 256 tokens
+- **License:** MIT
+## Architecture Details
+| Component | Implementation |
+|-----------|---------------|
+| Layers | 6 transformer blocks |
+| Attention | 6 heads, 64 dims each, with RoPE |
+| FFN | SwiGLU (384 → 1024 → 384) |
+| Normalization | RMSNorm (pre-norm) |
+| Position encoding | Rotary Position Embeddings (RoPE) |
+| Inference | KV cache for autoregressive generation |
+| Weight tying | lm_head shares weights with token embedding |
+## Training
+| Parameter | Value |
+|-----------|-------|
+| Optimizer | AdamW |
+| Learning rate | 3e-4 |
+| Batch size | 64 |
+| Block size | 256 |
+| Dropout | 0.3 |
+| Training steps | 5,000 (best checkpoint at step 2,500) |
+| Hardware | Google Colab T4 GPU |
+| Training time | ~64 minutes |
+### Training Results
+| Model | Parameters | Best Val Loss | Best Step |
+|-------|-----------|-------------|-----------|
+| Vanilla (LayerNorm + ReLU + learned pos) | 10.8M | 1.4804 | 3,000 |
+| Modern (RMSNorm + SwiGLU + RoPE + KV cache) | 10.6M | 1.4754 | 2,500 |
+Early stopping was used — the model checkpointed at the lowest validation loss.
+### Component Comparison
+Each modern component was tested in isolation against the vanilla baseline (2,000 training steps each):
+| Component | Val Loss at Step 500 | vs Vanilla |
+|-----------|---------------------|-----------|
+| Vanilla (baseline) | 1.99 | — |
+| RMSNorm | 1.99 | No change |
+| SwiGLU | 1.88 | -0.11 |
+| RoPE | 1.68 | -0.31 |
+## Intended Use
+This is an **educational model**. It is not intended for production use. It generates Shakespeare-style text and serves as a reference implementation for understanding transformer architectures.
+## Sample Outputs
+**Prompt: "ROMEO:", temperature=0.8:**
 ```
 ROMEO:
+Marry, good day with me!
 ROMEO:
+And not your lady command her at arms.
+THOMAS MOWBRAY:
+My dear lord, but go on.
+MERCUTIO:
+Hence will not speak against a marriage.
 ```
+**Prompt: "KING HENRY:", temperature=0.5:**
 ```
 KING HENRY:
 The father of the marriage of my son,
 And but the Lord Hastings of Semiram Stanley.
 ```
+## Limitations
+- **Tiny dataset:** Trained on only 1.1MB of text. The model overfits after ~2,500 steps.
+- **Character-level tokenization:** Inefficient compared to BPE. Each character is a separate token.
+- **No instruction tuning:** This is a base model — it completes text, it does not follow instructions or answer questions.
+- **Small context window:** 256 tokens maximum.
+- **Quality:** Output is recognizably Shakespeare-like but contains grammatical errors and occasionally mixes characters from different plays.
+## How to Use
 ```python
 import torch
 print(decode(out[0].tolist()))
 ```
+## Source Code
+Full implementation with detailed documentation: [github.com/brianmeyer/tinyllm](https://github.com/brianmeyer/tinyllm)
 ## References
+- Karpathy, A. [build-nanogpt](https://github.com/karpathy/build-nanogpt) — primary reference
+- Su et al. (2021). [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
+- Shazeer, N. (2020). [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)
+- Zhang & Sennrich (2019). [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467)