gpt2-nano

A 44M parameter GPT-2-style language model built from scratch in PyTorch. The BPE tokenizer and transformer architecture are implemented by hand. Trained on 99M tokens from FineWeb-edu on Apple Silicon MPS.

Model details

Component Detail
Parameters 44M
Layers 12
Embedding dim 512
Attention heads 8 (head_dim = 64)
MLP expansion 4x (512 โ†’ 2048 โ†’ 512)
Context length 1024 tokens
Positional encoding Sinusoidal (fixed)
Normalization Pre-norm LayerNorm
Vocab size 9,157 (custom BPE)

Training

  • Data: 99M tokens from FineWeb-edu (10BT sample)
  • Optimizer: AdamW (lr=3e-4, weight_decay=0.1, cosine schedule)
  • Gradient clipping: max_norm=1.0
  • Hardware: Apple Silicon MPS
  • Duration: ~13 hours, 10,000 steps
  • Final val loss: 2.15
  • Final val perplexity: 8.5

Usage

import torch
from src.gpt import GPT, generate
from data.bpe_tokenizer import encode, decode, load_tokenizer

# Load model
ckpt = torch.load("checkpoints/final.pt", map_location="cpu")
gpt = GPT(**ckpt["config"])
gpt.load_state_dict(ckpt["model_state_dict"])
gpt.eval()

# Load tokenizer
merges, vocab = load_tokenizer()

# Generate
prompt_tokens = encode("The ", merges, vocab)
text = generate(gpt, merges, vocab, prompt_tokens, context_len=1024, max_new_tokens=50)
print(text)

Files

  • checkpoints/final.pt โ€” model weights, optimizer state, and config
  • bpe-tokenizer/merges.json โ€” BPE merge rules
  • bpe-tokenizer/vocab.json โ€” token-to-id mapping
  • bpe-shards/*.bin โ€” pre-tokenized training data (binary format)

Source

GitHub repository

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train kotlarmilos/gpt2-nano