GPT-152M trained from scratch on FineWeb-Edu
A 152 million parameter GPT-style decoder-only transformer trained completely from scratch using raw PyTorch. No pretrained weights were used at any point. Built as a learning project to understand every component of a modern language model.
Model Details
| Property | Value |
|---|---|
| Parameters | 152M |
| Architecture | Decoder-only Transformer |
| Layers | 12 |
| Hidden dim | 768 |
| Attention heads | 12 |
| FFN type | SwiGLU (3-matrix gated) |
| Positional encoding | RoPE (Rotary) |
| Context length | 512 tokens |
| Tokenizer | GPT-2 BPE (50,257 vocab) |
Training Details
| Property | Value |
|---|---|
| Dataset | FineWeb-Edu (sample-10BT) |
| Tokens trained | 197M |
| Optimizer steps | 6,000 |
| Optimizer | AdamW (β=0.9, 0.95) |
| Peak learning rate | 3e-4 |
| LR schedule | Linear warmup (200 steps) + Cosine decay |
| Effective batch size | 64 sequences (4 × 16 grad accumulation) |
| Hardware | NVIDIA Tesla T4 (free Kaggle GPU) |
| Training time | ~8.5 hours |
Training Results
| Metric | Value |
|---|---|
| Initial loss (random) | 10.99 |
| Final train loss | 3.91 |
| Final val loss | 4.00 |
| Final val perplexity | 54.6 |
| Random baseline PPL | 59,832 |
| Improvement over random | 1,096× |
Architecture Highlights
- Pre-LayerNorm: Normalise before each sublayer for stable gradients
- SwiGLU FFN: Gated activation (used in LLaMA, PaLM)
- RoPE: Rotary positional embeddings (relative position sensitivity)
- Weight tying: Embedding and LM head share weights (saves 38M params)
- Gradient checkpointing: Recompute activations to save VRAM
How to Load and Use
import torch
from transformers import GPT2Tokenizer
# Download the model file from this repo
# then load with the custom model class
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
ckpt = torch.load("pytorch_model.pt", map_location="cpu")
# See the full model code and inference script in this repo
Sample Outputs (temperature=0.8, top-k=50)
Prompt: "Quantum mechanics is the branch of physics that"
Quantum mechanics is the branch of physics that describes the behavior of matter and energy at the smallest scales...
Prompt: "The French Revolution began in 1789 because"
The French Revolution began in 1789 because of growing social inequality and the financial crisis facing the French monarchy...
Limitations
- Trained on only 197M tokens (GPT-3 used 300B)
- May produce factually incorrect statements
- Best with educational/textbook-style prompts
- Greedy decoding produces repetition — always use top-k sampling
Intended Use
This model is released for educational purposes only to demonstrate that a working language model can be built and trained from scratch using freely available tools and compute.
Training Code
Full training code with detailed comments explaining every component is available at the GitHub repository linked below.
Built with ❤️ using PyTorch on Kaggle free GPUs.
- Downloads last month
- 1