---
license: mit
language:
- en
tags:
- pytorch
- transformer
- language-model
- from-scratch
- educational
- shakespeare
- rope
- swiglu
- rmsnorm
- kv-cache
datasets:
- tiny-shakespeare
pipeline_tag: text-generation
---

# tiny-gpt-shakespeare

A 10M parameter decoder-only transformer trained on the Tiny Shakespeare dataset. Built from scratch in PyTorch as an educational project — no pretrained weights or external libraries used for the model itself.

## Model Description

- **Architecture:** Decoder-only transformer with modern components (RMSNorm, SwiGLU, RoPE, KV cache)
- **Parameters:** 10.6M (modern) / 10.8M (vanilla)
- **Training data:** [Tiny Shakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) (~1.1MB, 65 unique characters)
- **Tokenization:** Character-level (65 tokens)
- **Context length:** 256 tokens
- **License:** MIT

## Architecture Details

| Component | Implementation |
|-----------|---------------|
| Layers | 6 transformer blocks |
| Attention | 6 heads, 64 dims each, with RoPE |
| FFN | SwiGLU (384 → 1024 → 384) |
| Normalization | RMSNorm (pre-norm) |
| Position encoding | Rotary Position Embeddings (RoPE) |
| Inference | KV cache for autoregressive generation |
| Weight tying | lm_head shares weights with token embedding |

## Training

| Parameter | Value |
|-----------|-------|
| Optimizer | AdamW |
| Learning rate | 3e-4 |
| Batch size | 64 |
| Block size | 256 |
| Dropout | 0.3 |
| Training steps | 5,000 (best checkpoint at step 2,500) |
| Hardware | Google Colab T4 GPU |
| Training time | ~64 minutes |

### Training Results

| Model | Parameters | Best Val Loss | Best Step |
|-------|-----------|-------------|-----------|
| Vanilla (LayerNorm + ReLU + learned pos) | 10.8M | 1.4804 | 3,000 |
| Modern (RMSNorm + SwiGLU + RoPE + KV cache) | 10.6M | 1.4754 | 2,500 |

Early stopping was used — the model checkpointed at the lowest validation loss.

### Component Comparison

Each modern component was tested in isolation against the vanilla baseline (2,000 training steps each):

| Component | Val Loss at Step 500 | vs Vanilla |
|-----------|---------------------|-----------|
| Vanilla (baseline) | 1.99 | — |
| RMSNorm | 1.99 | No change |
| SwiGLU | 1.88 | -0.11 |
| RoPE | 1.68 | -0.31 |

## Intended Use

This is an **educational model**. It is not intended for production use. It generates Shakespeare-style text and serves as a reference implementation for understanding transformer architectures.

## Sample Outputs

**Modern model, prompt: "ROMEO:", temperature=0.8:**
```
ROMEO:
A gallant-house! what says the woe?

MERCUTIO:
Good madam, my lord.

ROMEO:
Villain, for I do not say it is true,
Which hath a sin by him come to the crown,
That he is reports for me; for ever is he.
```

**Vanilla model, prompt: "ROMEO:", temperature=0.8:**
```
ROMEO:
Good father, cousin, my lord, I could not need me.

First Servant:
Sir, but you came to this humour of the king,
Lest hear him withis heart flowers.
```

## Limitations

- **Tiny dataset:** Trained on only 1.1MB of text. The model overfits after ~2,500 steps.
- **Character-level tokenization:** Inefficient compared to BPE. Each character is a separate token.
- **No instruction tuning:** This is a base model — it completes text, it does not follow instructions or answer questions.
- **Small context window:** 256 tokens maximum.
- **Quality:** Output is recognizably Shakespeare-like but contains grammatical errors and occasionally mixes characters from different plays.

## How to Use

```python
import torch
import sys
sys.path.append('src')
from model_modern import ModernGPT
from tokenizer import encode, decode

device = "cuda" if torch.cuda.is_available() else "cpu"
ckpt = torch.load("model.pt", map_location=device, weights_only=False)
model = ModernGPT(**ckpt["config"]).to(device)
model.load_state_dict(ckpt["model_state"])
model.eval()

idx = torch.tensor([encode("ROMEO:")], dtype=torch.long, device=device)
out = model.generate(idx, max_new_tokens=200, temperature=0.8)
print(decode(out[0].tolist()))
```

## Source Code

Full implementation with detailed documentation: [github.com/brianmeyer/tinyllm](https://github.com/brianmeyer/tinyllm)

## References

- Karpathy, A. [build-nanogpt](https://github.com/karpathy/build-nanogpt) — primary reference
- Su et al. (2021). [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
- Shazeer, N. (2020). [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)
- Zhang & Sennrich (2019). [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467)