File size: 4,658 Bytes
7cb75d6 ec47f0d abec5c1 ec47f0d abec5c1 ec47f0d abec5c1 ec47f0d abec5c1 ec47f0d abec5c1 ec47f0d abec5c1 ec47f0d abec5c1 ec47f0d abec5c1 ec47f0d abec5c1 ec47f0d abec5c1 3b1bd10 ec47f0d 3b1bd10 abec5c1 3b1bd10 ec47f0d 3b1bd10 7cb75d6 3b1bd10 7cb75d6 abec5c1 ec47f0d abec5c1 ec47f0d abec5c1 ec47f0d 7cb75d6 ec47f0d abec5c1 ec47f0d abec5c1 ec47f0d abec5c1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | ---
license: mit
language:
- en
tags:
- pytorch
- transformer
- language-model
- from-scratch
- educational
- shakespeare
- rope
- swiglu
- rmsnorm
- kv-cache
datasets:
- tiny-shakespeare
pipeline_tag: text-generation
---
# tiny-gpt-shakespeare
A 10M parameter decoder-only transformer trained on the Tiny Shakespeare dataset. Built from scratch in PyTorch as an educational project — no pretrained weights or external libraries used for the model itself.
## Model Description
- **Architecture:** Decoder-only transformer with modern components (RMSNorm, SwiGLU, RoPE, KV cache)
- **Parameters:** 10.6M (modern) / 10.8M (vanilla)
- **Training data:** [Tiny Shakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) (~1.1MB, 65 unique characters)
- **Tokenization:** Character-level (65 tokens)
- **Context length:** 256 tokens
- **License:** MIT
## Architecture Details
| Component | Implementation |
|-----------|---------------|
| Layers | 6 transformer blocks |
| Attention | 6 heads, 64 dims each, with RoPE |
| FFN | SwiGLU (384 → 1024 → 384) |
| Normalization | RMSNorm (pre-norm) |
| Position encoding | Rotary Position Embeddings (RoPE) |
| Inference | KV cache for autoregressive generation |
| Weight tying | lm_head shares weights with token embedding |
## Training
| Parameter | Value |
|-----------|-------|
| Optimizer | AdamW |
| Learning rate | 3e-4 |
| Batch size | 64 |
| Block size | 256 |
| Dropout | 0.3 |
| Training steps | 5,000 (best checkpoint at step 2,500) |
| Hardware | Google Colab T4 GPU |
| Training time | ~64 minutes |
### Training Results
| Model | Parameters | Best Val Loss | Best Step |
|-------|-----------|-------------|-----------|
| Vanilla (LayerNorm + ReLU + learned pos) | 10.8M | 1.4804 | 3,000 |
| Modern (RMSNorm + SwiGLU + RoPE + KV cache) | 10.6M | 1.4754 | 2,500 |
Early stopping was used — the model checkpointed at the lowest validation loss.
### Component Comparison
Each modern component was tested in isolation against the vanilla baseline (2,000 training steps each):
| Component | Val Loss at Step 500 | vs Vanilla |
|-----------|---------------------|-----------|
| Vanilla (baseline) | 1.99 | — |
| RMSNorm | 1.99 | No change |
| SwiGLU | 1.88 | -0.11 |
| RoPE | 1.68 | -0.31 |
## Intended Use
This is an **educational model**. It is not intended for production use. It generates Shakespeare-style text and serves as a reference implementation for understanding transformer architectures.
## Sample Outputs
**Modern model, prompt: "ROMEO:", temperature=0.8:**
```
ROMEO:
A gallant-house! what says the woe?
MERCUTIO:
Good madam, my lord.
ROMEO:
Villain, for I do not say it is true,
Which hath a sin by him come to the crown,
That he is reports for me; for ever is he.
```
**Vanilla model, prompt: "ROMEO:", temperature=0.8:**
```
ROMEO:
Good father, cousin, my lord, I could not need me.
First Servant:
Sir, but you came to this humour of the king,
Lest hear him withis heart flowers.
```
## Limitations
- **Tiny dataset:** Trained on only 1.1MB of text. The model overfits after ~2,500 steps.
- **Character-level tokenization:** Inefficient compared to BPE. Each character is a separate token.
- **No instruction tuning:** This is a base model — it completes text, it does not follow instructions or answer questions.
- **Small context window:** 256 tokens maximum.
- **Quality:** Output is recognizably Shakespeare-like but contains grammatical errors and occasionally mixes characters from different plays.
## How to Use
```python
import torch
import sys
sys.path.append('src')
from model_modern import ModernGPT
from tokenizer import encode, decode
device = "cuda" if torch.cuda.is_available() else "cpu"
ckpt = torch.load("model.pt", map_location=device, weights_only=False)
model = ModernGPT(**ckpt["config"]).to(device)
model.load_state_dict(ckpt["model_state"])
model.eval()
idx = torch.tensor([encode("ROMEO:")], dtype=torch.long, device=device)
out = model.generate(idx, max_new_tokens=200, temperature=0.8)
print(decode(out[0].tolist()))
```
## Source Code
Full implementation with detailed documentation: [github.com/brianmeyer/tinyllm](https://github.com/brianmeyer/tinyllm)
## References
- Karpathy, A. [build-nanogpt](https://github.com/karpathy/build-nanogpt) — primary reference
- Su et al. (2021). [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
- Shazeer, N. (2020). [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)
- Zhang & Sennrich (2019). [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467)
|