tiny-gpt-shakespeare
A 10M parameter decoder-only transformer trained on the Tiny Shakespeare dataset. Built from scratch in PyTorch as an educational project β no pretrained weights or external libraries used for the model itself.
Model Description
- Architecture: Decoder-only transformer with modern components (RMSNorm, SwiGLU, RoPE, KV cache)
- Parameters: 10.6M (modern) / 10.8M (vanilla)
- Training data: Tiny Shakespeare (~1.1MB, 65 unique characters)
- Tokenization: Character-level (65 tokens)
- Context length: 256 tokens
- License: MIT
Architecture Details
| Component | Implementation |
|---|---|
| Layers | 6 transformer blocks |
| Attention | 6 heads, 64 dims each, with RoPE |
| FFN | SwiGLU (384 β 1024 β 384) |
| Normalization | RMSNorm (pre-norm) |
| Position encoding | Rotary Position Embeddings (RoPE) |
| Inference | KV cache for autoregressive generation |
| Weight tying | lm_head shares weights with token embedding |
Training
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 3e-4 |
| Batch size | 64 |
| Block size | 256 |
| Dropout | 0.3 |
| Training steps | 5,000 (best checkpoint at step 2,500) |
| Hardware | Google Colab T4 GPU |
| Training time | ~64 minutes |
Training Results
| Model | Parameters | Best Val Loss | Best Step |
|---|---|---|---|
| Vanilla (LayerNorm + ReLU + learned pos) | 10.8M | 1.4804 | 3,000 |
| Modern (RMSNorm + SwiGLU + RoPE + KV cache) | 10.6M | 1.4754 | 2,500 |
Early stopping was used β the model checkpointed at the lowest validation loss.
Component Comparison
Each modern component was tested in isolation against the vanilla baseline (2,000 training steps each):
| Component | Val Loss at Step 500 | vs Vanilla |
|---|---|---|
| Vanilla (baseline) | 1.99 | β |
| RMSNorm | 1.99 | No change |
| SwiGLU | 1.88 | -0.11 |
| RoPE | 1.68 | -0.31 |
Intended Use
This is an educational model. It is not intended for production use. It generates Shakespeare-style text and serves as a reference implementation for understanding transformer architectures.
Sample Outputs
Modern model, prompt: "ROMEO:", temperature=0.8:
ROMEO:
A gallant-house! what says the woe?
MERCUTIO:
Good madam, my lord.
ROMEO:
Villain, for I do not say it is true,
Which hath a sin by him come to the crown,
That he is reports for me; for ever is he.
Vanilla model, prompt: "ROMEO:", temperature=0.8:
ROMEO:
Good father, cousin, my lord, I could not need me.
First Servant:
Sir, but you came to this humour of the king,
Lest hear him withis heart flowers.
Limitations
- Tiny dataset: Trained on only 1.1MB of text. The model overfits after ~2,500 steps.
- Character-level tokenization: Inefficient compared to BPE. Each character is a separate token.
- No instruction tuning: This is a base model β it completes text, it does not follow instructions or answer questions.
- Small context window: 256 tokens maximum.
- Quality: Output is recognizably Shakespeare-like but contains grammatical errors and occasionally mixes characters from different plays.
How to Use
import torch
import sys
sys.path.append('src')
from model_modern import ModernGPT
from tokenizer import encode, decode
device = "cuda" if torch.cuda.is_available() else "cpu"
ckpt = torch.load("model.pt", map_location=device, weights_only=False)
model = ModernGPT(**ckpt["config"]).to(device)
model.load_state_dict(ckpt["model_state"])
model.eval()
idx = torch.tensor([encode("ROMEO:")], dtype=torch.long, device=device)
out = model.generate(idx, max_new_tokens=200, temperature=0.8)
print(decode(out[0].tolist()))
Source Code
Full implementation with detailed documentation: github.com/brianmeyer/tinyllm
References
- Karpathy, A. build-nanogpt β primary reference
- Su et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding
- Shazeer, N. (2020). GLU Variants Improve Transformer
- Zhang & Sennrich (2019). Root Mean Square Layer Normalization
- Downloads last month
- 1,677