GPT-152M trained from scratch on FineWeb-Edu

A 152 million parameter GPT-style decoder-only transformer trained completely from scratch using raw PyTorch. No pretrained weights were used at any point. Built as a learning project to understand every component of a modern language model.

Model Details

Property	Value
Parameters	152M
Architecture	Decoder-only Transformer
Layers	12
Hidden dim	768
Attention heads	12
FFN type	SwiGLU (3-matrix gated)
Positional encoding	RoPE (Rotary)
Context length	512 tokens
Tokenizer	GPT-2 BPE (50,257 vocab)

Training Details

Property	Value
Dataset	FineWeb-Edu (sample-10BT)
Tokens trained	197M
Optimizer steps	6,000
Optimizer	AdamW (β=0.9, 0.95)
Peak learning rate	3e-4
LR schedule	Linear warmup (200 steps) + Cosine decay
Effective batch size	64 sequences (4 × 16 grad accumulation)
Hardware	NVIDIA Tesla T4 (free Kaggle GPU)
Training time	~8.5 hours

Training Results

Metric	Value
Initial loss (random)	10.99
Final train loss	3.91
Final val loss	4.00
Final val perplexity	54.6
Random baseline PPL	59,832
Improvement over random	1,096×

Architecture Highlights

Pre-LayerNorm: Normalise before each sublayer for stable gradients
SwiGLU FFN: Gated activation (used in LLaMA, PaLM)
RoPE: Rotary positional embeddings (relative position sensitivity)
Weight tying: Embedding and LM head share weights (saves 38M params)
Gradient checkpointing: Recompute activations to save VRAM

How to Load and Use

import torch
from transformers import GPT2Tokenizer

# Download the model file from this repo
# then load with the custom model class

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
ckpt = torch.load("pytorch_model.pt", map_location="cpu")

# See the full model code and inference script in this repo

Sample Outputs (temperature=0.8, top-k=50)

Prompt: "Quantum mechanics is the branch of physics that"

Quantum mechanics is the branch of physics that describes the behavior of matter and energy at the smallest scales...

Prompt: "The French Revolution began in 1789 because"

The French Revolution began in 1789 because of growing social inequality and the financial crisis facing the French monarchy...

Limitations

Trained on only 197M tokens (GPT-3 used 300B)
May produce factually incorrect statements
Best with educational/textbook-style prompts
Greedy decoding produces repetition — always use top-k sampling

Intended Use

This model is released for educational purposes only to demonstrate that a working language model can be built and trained from scratch using freely available tools and compute.

Training Code

Full training code with detailed comments explaining every component is available at the GitHub repository linked below.

Built with ❤️ using PyTorch on Kaggle free GPUs.

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Nj-1111
/

gpt-152m-fineweb