GPT-2 Small โ Trained on Filtered Common Crawl
A GPT-2 small model (124M parameters) trained on filtered Common Crawl data as part of ECE405 Assignment 2 (based on Stanford CS336 Assignment 4).
Model Details
| Parameter | Value |
|---|---|
| Architecture | GPT-2 small (124M params) |
| Layers | 12 |
| Hidden dim | 768 |
| Attention heads | 12 |
| Context length | 512 |
| FFN dim | 2048 |
| Vocab size | 50,257 (GPT-2 BPE) |
| RoPE theta | 10,000 |
Training
| Parameter | Value |
|---|---|
| Hardware | 2x NVIDIA RTX A4000 (16 GB each) |
| Precision | bfloat16 |
| Training steps | 12,000 |
| Effective batch size | 256 sequences/step |
| Tokens per step | 131,072 |
| Total tokens seen | ~1.57B |
| Training time | ~14 hours |
| Best val loss | 2.856 |
Training Data
5,000 Common Crawl WET files (CC-MAIN-2026-08) filtered through:
- Language identification (English, fastText lid.176.bin, threshold 0.80)
- Gopher quality rules (word count, mean word length, ellipsis ratio, alphabetic ratio)
- Quality classifier (fastText, trained on Wikipedia-linked pages vs random CC)
- NSFW/toxic content removal (Dolma/Jigsaw fastText models)
- PII masking (emails, phone numbers, IP addresses)
Result: 5.75M documents, 8.7B tokens (17 GB tokenized with GPT-2 BPE)
Validation Loss Curve
| Step | Val Loss |
|---|---|
| 1,000 | 4.032 |
| 2,000 | 3.683 |
| 3,000 | 3.507 |
| 4,000 | 3.379 |
| 5,000 | 3.273 |
| 6,000 | 3.202 |
| 7,000 | 3.101 |
| 8,000 | 3.030 |
| 9,000 | 2.982 |
| 10,000 | 2.927 |
| 11,000 | 2.883 |
| 12,000 | 2.856 |
Usage
This model uses a custom architecture (cs336-basics BasicsTransformerLM) with RoPE embeddings and SwiGLU FFN. Load with:
import torch
import json
config = json.load(open("model_config.json"))
state_dict = torch.load("model.pt", map_location="cpu")
Links
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support