GPT-152M trained from scratch on FineWeb-Edu

A 152 million parameter GPT-style decoder-only transformer trained completely from scratch using raw PyTorch. No pretrained weights were used at any point. Built as a learning project to understand every component of a modern language model.

Model Details

Property Value
Parameters 152M
Architecture Decoder-only Transformer
Layers 12
Hidden dim 768
Attention heads 12
FFN type SwiGLU (3-matrix gated)
Positional encoding RoPE (Rotary)
Context length 512 tokens
Tokenizer GPT-2 BPE (50,257 vocab)

Training Details

Property Value
Dataset FineWeb-Edu (sample-10BT)
Tokens trained 197M
Optimizer steps 6,000
Optimizer AdamW (β=0.9, 0.95)
Peak learning rate 3e-4
LR schedule Linear warmup (200 steps) + Cosine decay
Effective batch size 64 sequences (4 × 16 grad accumulation)
Hardware NVIDIA Tesla T4 (free Kaggle GPU)
Training time ~8.5 hours

Training Results

Metric Value
Initial loss (random) 10.99
Final train loss 3.91
Final val loss 4.00
Final val perplexity 54.6
Random baseline PPL 59,832
Improvement over random 1,096×

Architecture Highlights

  • Pre-LayerNorm: Normalise before each sublayer for stable gradients
  • SwiGLU FFN: Gated activation (used in LLaMA, PaLM)
  • RoPE: Rotary positional embeddings (relative position sensitivity)
  • Weight tying: Embedding and LM head share weights (saves 38M params)
  • Gradient checkpointing: Recompute activations to save VRAM

How to Load and Use

import torch
from transformers import GPT2Tokenizer

# Download the model file from this repo
# then load with the custom model class

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
ckpt = torch.load("pytorch_model.pt", map_location="cpu")

# See the full model code and inference script in this repo

Sample Outputs (temperature=0.8, top-k=50)

Prompt: "Quantum mechanics is the branch of physics that"

Quantum mechanics is the branch of physics that describes the behavior of matter and energy at the smallest scales...

Prompt: "The French Revolution began in 1789 because"

The French Revolution began in 1789 because of growing social inequality and the financial crisis facing the French monarchy...

Limitations

  • Trained on only 197M tokens (GPT-3 used 300B)
  • May produce factually incorrect statements
  • Best with educational/textbook-style prompts
  • Greedy decoding produces repetition — always use top-k sampling

Intended Use

This model is released for educational purposes only to demonstrate that a working language model can be built and trained from scratch using freely available tools and compute.

Training Code

Full training code with detailed comments explaining every component is available at the GitHub repository linked below.

Built with ❤️ using PyTorch on Kaggle free GPUs.

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Dataset used to train Nj-1111/gpt-152m-fineweb

Space using Nj-1111/gpt-152m-fineweb 1