GPT from scratch — 1800 steps, ppl=195.7

00da4c8 verified 14 days ago

1.22 kB

language:
  - en
license: mit
tags:
  - causal-lm
  - gpt
  - from-scratch
  - fineweb
  - pytorch

FineWeb GPT — trained from scratch

A GPT-style language model trained completely from scratch as a learning exercise. Every component was written from scratch: BPE tokenizer, transformer architecture, and training loop.

Architecture


Parameters	8.4M
Layers	6
d_model	256
Attention heads	8
Context length	512
Vocabulary	8,192 (BPE ByteLevel)
Positional encoding	RoPE
Normalization	RMSNorm
Activation	SwiGLU

Training


Dataset	FineWeb-Edu sample-10BT (~5M tokens)
Steps	1,800
Optimizer	AdamW, cosine LR + warmup
Val loss	5.2764
Perplexity	195.7
Hardware	Apple Silicon MPS

Load the tokenizer

from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("REPO_ID")
print(tokenizer("The study of mathematics").tokens())

Limitations

Learning exercise only — trained on ~5M tokens, perplexity 196. Outputs are repetitive and often incoherent.

Stack

PyTorch · HuggingFace datasets · tokenizers · wandb · huggingface_hub