Initial upload: GPUburnout-3B-75K pretrained base (75K steps, val loss 2.2475)

de423ce verified 23 days ago

2.57 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- llama
	- pretrained
	- from-scratch
	- gpuburnout
	pipeline_tag: text-generation
	---

	# GPUburnout-3B-75K

	A 3.12 billion parameter Llama-style decoder-only transformer, pretrained from scratch as the Season 4 model in the GPUburnout blog series. Final pretraining checkpoint at 75,000 steps.

	This is the base model, before any instruction tuning. For the chat-tuned versions, see [GPUburnout/GPUburnout-3B-75K-Chat](https://huggingface.co/GPUburnout/GPUburnout-3B-75K-Chat) (the shipped SFT champion).

	## Model Details

	- Architecture: Llama-style decoder-only transformer
	- Parameters: 3.12B
	- Hidden size: 2,560
	- Intermediate size: 10,240
	- Layers: 32
	- Attention heads: 40 query, 10 key/value (GQA)
	- Head dim: 64
	- Vocab size: 32,005
	- Max position: 2,048
	- Tie word embeddings: True
	- Precision: float16

	## Training

	- Steps: 75,000
	- Final val loss: 2.2475
	- Training cost: ~$425 (RunPod A100/H200)
	- Pretraining data mix: FineWeb-Edu, FineMath, Stack-Edu-Python, PubMed abstracts
	- Optimizer: 8-bit AdamW
	- Hardware: Mixed A100 SXM 80GB and H200 NVL across training run

	## Benchmarks (0-shot, float16)

	\| Benchmark \| Score \|
	\|---\|---\|
	\| TruthfulQA MC2 \| 47.61% \|
	\| HellaSwag (acc_norm) \| 28.30% \|
	\| ARC-Easy (acc_norm) \| 43.06% \|
	\| ARC-Challenge (acc_norm) \| 21.84% \|
	\| MMLU (5-shot acc) \| 23.02% \|

	These are typical small-model-on-limited-tokens numbers. The model has not seen enough data to develop deep academic knowledge (MMLU is near random). It absorbed enough factual content for ARC-Easy to land ~2x random and for TruthfulQA to score above random on calibrated truthfulness, which is the strongest signal at this scale.

	## Related Models

	- [GPUburnout/GPUburnout-3B-75K-Chat](https://huggingface.co/GPUburnout/GPUburnout-3B-75K-Chat) - SFT champion (step 1500, lr=5e-5, r=16)
	- [GPUburnout/GPUburnout-3B-75K-Chat-step3000](https://huggingface.co/GPUburnout/GPUburnout-3B-75K-Chat-step3000) - Alternative SFT checkpoint (lowest val loss, not shipped)
	- [GPUburnout/GPUburnout-3B-checkpoints](https://huggingface.co/GPUburnout/GPUburnout-3B-checkpoints) - Intermediate pretraining checkpoints

	## Blog

	This model is part of the GPUburnout LLM-from-scratch blog series. Season 4 documents the 3B build, including the platform pivot from Thunder to RunPod, MooseFS storage debugging, and the 75K-step pretraining run.

	https://gpuburnout.com

	## License

	Apache 2.0. Free to use, modify, redistribute. Attribution appreciated.