gpt2-nano
A 44M parameter GPT-2-style language model built from scratch in PyTorch. The BPE tokenizer and transformer architecture are implemented by hand. Trained on 99M tokens from FineWeb-edu on Apple Silicon MPS.
Model details
| Component | Detail |
|---|---|
| Parameters | 44M |
| Layers | 12 |
| Embedding dim | 512 |
| Attention heads | 8 (head_dim = 64) |
| MLP expansion | 4x (512 โ 2048 โ 512) |
| Context length | 1024 tokens |
| Positional encoding | Sinusoidal (fixed) |
| Normalization | Pre-norm LayerNorm |
| Vocab size | 9,157 (custom BPE) |
Training
- Data: 99M tokens from FineWeb-edu (10BT sample)
- Optimizer: AdamW (lr=3e-4, weight_decay=0.1, cosine schedule)
- Gradient clipping: max_norm=1.0
- Hardware: Apple Silicon MPS
- Duration: ~13 hours, 10,000 steps
- Final val loss: 2.15
- Final val perplexity: 8.5
Usage
import torch
from src.gpt import GPT, generate
from data.bpe_tokenizer import encode, decode, load_tokenizer
# Load model
ckpt = torch.load("checkpoints/final.pt", map_location="cpu")
gpt = GPT(**ckpt["config"])
gpt.load_state_dict(ckpt["model_state_dict"])
gpt.eval()
# Load tokenizer
merges, vocab = load_tokenizer()
# Generate
prompt_tokens = encode("The ", merges, vocab)
text = generate(gpt, merges, vocab, prompt_tokens, context_len=1024, max_new_tokens=50)
print(text)
Files
checkpoints/final.ptโ model weights, optimizer state, and configbpe-tokenizer/merges.jsonโ BPE merge rulesbpe-tokenizer/vocab.jsonโ token-to-id mappingbpe-shards/*.binโ pre-tokenized training data (binary format)
Source
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support