gpt2-nano

A 44M parameter GPT-2 style language model built from scratch in PyTorch. The BPE tokenizer and the transformer architecture are written by hand. Trained on 99M tokens from FineWeb-edu on Apple Silicon MPS.

Where this fits

This is my parallel track. While the applied post-training work fine-tunes existing models, Phi-4 on dotnet/runtime then the Gemma 3 reasoning adapter then the Gemma 4 GlucoLens adapter, here I go underneath all of it and build the transformer by hand so I understand every part I later adapt. The systems analysis of this run, where the compute and memory walls sit and why the run reached only a small fraction of peak hardware, is written up as a scaling article at https://huggingface.co/blog/kotlarmilos/gpt2-nano.

Model details

Component	Detail
Parameters	44M
Layers	12
Embedding dim	512
Attention heads	8 (head_dim = 64)
MLP expansion	4x (512 to 2048 to 512)
Context length	1024 tokens
Positional encoding	Sinusoidal, fixed
Normalization	Pre-norm LayerNorm
Vocab size	9,157, custom BPE

Training

Data. 99M tokens from FineWeb-edu, 10BT sample
Optimizer. AdamW, lr 3e-4, weight decay 0.1, cosine schedule
Gradient clipping. max_norm 1.0
Hardware. Apple Silicon MPS
Duration. About 13 hours, 10,000 steps
Final val loss. 2.15
Final val perplexity. 8.5

Usage

import torch
from src.gpt import GPT, generate
from data.bpe_tokenizer import encode, decode, load_tokenizer

ckpt = torch.load("checkpoints/final.pt", map_location="cpu")
gpt = GPT(**ckpt["config"])
gpt.load_state_dict(ckpt["model_state_dict"])
gpt.eval()

merges, vocab = load_tokenizer()

prompt_tokens = encode("The ", merges, vocab)
text = generate(gpt, merges, vocab, prompt_tokens, context_len=1024, max_new_tokens=50)
print(text)

Files

checkpoints/final.pt, model weights, optimizer state, and config
bpe-tokenizer/merges.json, BPE merge rules
bpe-tokenizer/vocab.json, token to id mapping
bpe-shards/*.bin, pre-tokenized training data in binary format

Source

Repository https://github.com/kotlarmilos/gpt2-nano
Scaling writeup https://huggingface.co/blog/kotlarmilos/gpt2-nano

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train kotlarmilos/gpt2-nano

Article mentioning kotlarmilos/gpt2-nano

What I learned about scaling laws by building a transformer from scratch

kotlarmilos

•

4 days ago

Evaluation results

Validation loss on FineWeb-Edu
self-reported

2.150
Validation perplexity on FineWeb-Edu
self-reported

8.500