| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - llama |
| - pretrained |
| - from-scratch |
| - gpuburnout |
| pipeline_tag: text-generation |
| --- |
| |
| # GPUburnout-3B-75K |
|
|
| A 3.12 billion parameter Llama-style decoder-only transformer, pretrained from scratch as the Season 4 model in the GPUburnout blog series. Final pretraining checkpoint at 75,000 steps. |
|
|
| This is the base model, before any instruction tuning. For the chat-tuned versions, see [GPUburnout/GPUburnout-3B-75K-Chat](https://huggingface.co/GPUburnout/GPUburnout-3B-75K-Chat) (the shipped SFT champion). |
|
|
| ## Model Details |
|
|
| - **Architecture:** Llama-style decoder-only transformer |
| - **Parameters:** 3.12B |
| - **Hidden size:** 2,560 |
| - **Intermediate size:** 10,240 |
| - **Layers:** 32 |
| - **Attention heads:** 40 query, 10 key/value (GQA) |
| - **Head dim:** 64 |
| - **Vocab size:** 32,005 |
| - **Max position:** 2,048 |
| - **Tie word embeddings:** True |
| - **Precision:** float16 |
|
|
| ## Training |
|
|
| - **Steps:** 75,000 |
| - **Final val loss:** 2.2475 |
| - **Training cost:** ~$425 (RunPod A100/H200) |
| - **Pretraining data mix:** FineWeb-Edu, FineMath, Stack-Edu-Python, PubMed abstracts |
| - **Optimizer:** 8-bit AdamW |
| - **Hardware:** Mixed A100 SXM 80GB and H200 NVL across training run |
|
|
| ## Benchmarks (0-shot, float16) |
|
|
| | Benchmark | Score | |
| |---|---| |
| | TruthfulQA MC2 | 47.61% | |
| | HellaSwag (acc_norm) | 28.30% | |
| | ARC-Easy (acc_norm) | 43.06% | |
| | ARC-Challenge (acc_norm) | 21.84% | |
| | MMLU (5-shot acc) | 23.02% | |
| |
| These are typical small-model-on-limited-tokens numbers. The model has not seen enough data to develop deep academic knowledge (MMLU is near random). It absorbed enough factual content for ARC-Easy to land ~2x random and for TruthfulQA to score above random on calibrated truthfulness, which is the strongest signal at this scale. |
| |
| ## Related Models |
| |
| - [GPUburnout/GPUburnout-3B-75K-Chat](https://huggingface.co/GPUburnout/GPUburnout-3B-75K-Chat) - SFT champion (step 1500, lr=5e-5, r=16) |
| - [GPUburnout/GPUburnout-3B-75K-Chat-step3000](https://huggingface.co/GPUburnout/GPUburnout-3B-75K-Chat-step3000) - Alternative SFT checkpoint (lowest val loss, not shipped) |
| - [GPUburnout/GPUburnout-3B-checkpoints](https://huggingface.co/GPUburnout/GPUburnout-3B-checkpoints) - Intermediate pretraining checkpoints |
| |
| ## Blog |
| |
| This model is part of the GPUburnout LLM-from-scratch blog series. Season 4 documents the 3B build, including the platform pivot from Thunder to RunPod, MooseFS storage debugging, and the 75K-step pretraining run. |
| |
| https://gpuburnout.com |
| |
| ## License |
| |
| Apache 2.0. Free to use, modify, redistribute. Attribution appreciated. |
| |