fineweb-gpt-scratch / README.md
shreyask's picture
GPT from scratch — 1800 steps, ppl=195.7
00da4c8 verified
metadata
language:
  - en
license: mit
tags:
  - causal-lm
  - gpt
  - from-scratch
  - fineweb
  - pytorch

FineWeb GPT — trained from scratch

A GPT-style language model trained completely from scratch as a learning exercise. Every component was written from scratch: BPE tokenizer, transformer architecture, and training loop.

Architecture

Parameters 8.4M
Layers 6
d_model 256
Attention heads 8
Context length 512
Vocabulary 8,192 (BPE ByteLevel)
Positional encoding RoPE
Normalization RMSNorm
Activation SwiGLU

Training

Dataset FineWeb-Edu sample-10BT (~5M tokens)
Steps 1,800
Optimizer AdamW, cosine LR + warmup
Val loss 5.2764
Perplexity 195.7
Hardware Apple Silicon MPS

Load the tokenizer

from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("REPO_ID")
print(tokenizer("The study of mathematics").tokens())

Limitations

Learning exercise only — trained on ~5M tokens, perplexity 196. Outputs are repetitive and often incoherent.

Stack

PyTorch · HuggingFace datasets · tokenizers · wandb · huggingface_hub