GPT-2 Small โ€” Trained on Filtered Common Crawl

A GPT-2 small model (124M parameters) trained on filtered Common Crawl data as part of ECE405 Assignment 2 (based on Stanford CS336 Assignment 4).

Model Details

Parameter Value
Architecture GPT-2 small (124M params)
Layers 12
Hidden dim 768
Attention heads 12
Context length 512
FFN dim 2048
Vocab size 50,257 (GPT-2 BPE)
RoPE theta 10,000

Training

Parameter Value
Hardware 2x NVIDIA RTX A4000 (16 GB each)
Precision bfloat16
Training steps 12,000
Effective batch size 256 sequences/step
Tokens per step 131,072
Total tokens seen ~1.57B
Training time ~14 hours
Best val loss 2.856

Training Data

5,000 Common Crawl WET files (CC-MAIN-2026-08) filtered through:

  1. Language identification (English, fastText lid.176.bin, threshold 0.80)
  2. Gopher quality rules (word count, mean word length, ellipsis ratio, alphabetic ratio)
  3. Quality classifier (fastText, trained on Wikipedia-linked pages vs random CC)
  4. NSFW/toxic content removal (Dolma/Jigsaw fastText models)
  5. PII masking (emails, phone numbers, IP addresses)

Result: 5.75M documents, 8.7B tokens (17 GB tokenized with GPT-2 BPE)

Validation Loss Curve

Step Val Loss
1,000 4.032
2,000 3.683
3,000 3.507
4,000 3.379
5,000 3.273
6,000 3.202
7,000 3.101
8,000 3.030
9,000 2.982
10,000 2.927
11,000 2.883
12,000 2.856

Usage

This model uses a custom architecture (cs336-basics BasicsTransformerLM) with RoPE embeddings and SwiGLU FFN. Load with:

import torch
import json

config = json.load(open("model_config.json"))
state_dict = torch.load("model.pt", map_location="cpu")

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support