GPUburnout-3B-75K / README.md
GPUburnout's picture
Initial upload: GPUburnout-3B-75K pretrained base (75K steps, val loss 2.2475)
de423ce verified
---
license: apache-2.0
language:
- en
tags:
- llama
- pretrained
- from-scratch
- gpuburnout
pipeline_tag: text-generation
---
# GPUburnout-3B-75K
A 3.12 billion parameter Llama-style decoder-only transformer, pretrained from scratch as the Season 4 model in the GPUburnout blog series. Final pretraining checkpoint at 75,000 steps.
This is the base model, before any instruction tuning. For the chat-tuned versions, see [GPUburnout/GPUburnout-3B-75K-Chat](https://huggingface.co/GPUburnout/GPUburnout-3B-75K-Chat) (the shipped SFT champion).
## Model Details
- **Architecture:** Llama-style decoder-only transformer
- **Parameters:** 3.12B
- **Hidden size:** 2,560
- **Intermediate size:** 10,240
- **Layers:** 32
- **Attention heads:** 40 query, 10 key/value (GQA)
- **Head dim:** 64
- **Vocab size:** 32,005
- **Max position:** 2,048
- **Tie word embeddings:** True
- **Precision:** float16
## Training
- **Steps:** 75,000
- **Final val loss:** 2.2475
- **Training cost:** ~$425 (RunPod A100/H200)
- **Pretraining data mix:** FineWeb-Edu, FineMath, Stack-Edu-Python, PubMed abstracts
- **Optimizer:** 8-bit AdamW
- **Hardware:** Mixed A100 SXM 80GB and H200 NVL across training run
## Benchmarks (0-shot, float16)
| Benchmark | Score |
|---|---|
| TruthfulQA MC2 | 47.61% |
| HellaSwag (acc_norm) | 28.30% |
| ARC-Easy (acc_norm) | 43.06% |
| ARC-Challenge (acc_norm) | 21.84% |
| MMLU (5-shot acc) | 23.02% |
These are typical small-model-on-limited-tokens numbers. The model has not seen enough data to develop deep academic knowledge (MMLU is near random). It absorbed enough factual content for ARC-Easy to land ~2x random and for TruthfulQA to score above random on calibrated truthfulness, which is the strongest signal at this scale.
## Related Models
- [GPUburnout/GPUburnout-3B-75K-Chat](https://huggingface.co/GPUburnout/GPUburnout-3B-75K-Chat) - SFT champion (step 1500, lr=5e-5, r=16)
- [GPUburnout/GPUburnout-3B-75K-Chat-step3000](https://huggingface.co/GPUburnout/GPUburnout-3B-75K-Chat-step3000) - Alternative SFT checkpoint (lowest val loss, not shipped)
- [GPUburnout/GPUburnout-3B-checkpoints](https://huggingface.co/GPUburnout/GPUburnout-3B-checkpoints) - Intermediate pretraining checkpoints
## Blog
This model is part of the GPUburnout LLM-from-scratch blog series. Season 4 documents the 3B build, including the platform pivot from Thunder to RunPod, MooseFS storage debugging, and the 75K-step pretraining run.
https://gpuburnout.com
## License
Apache 2.0. Free to use, modify, redistribute. Attribution appreciated.