GPUburnout-1B-160K
A 1.04 billion parameter Llama-style language model trained from scratch to Chinchilla-optimal on 20.97B tokens.
This is the 160K step (Chinchilla-optimal) checkpoint. For the earlier 90K step checkpoint, see GPUburnout-1B.
Model Details
- Architecture: Llama-style decoder-only transformer
- Parameters: 1.04B
- Hidden dim: 2048
- Layers: 16
- Attention: GQA (32 query heads, 8 KV heads)
- FFN: SwiGLU (intermediate 8192)
- Position encoding: RoPE (theta=500000)
- Context length: 2048 tokens
- Vocabulary: 32,005 tokens (BPE + 5 special tokens)
- Weight tying: Yes (embedding + LM head)
Training
- Data: 20.97B tokens (FineWeb-Edu 85%, Python-Edu 4.2%, FineMath 10.8%)
- Hardware: A100 SXM 80GB on RunPod
- Steps: 160,000 (Chinchilla-optimal: 20x params in tokens)
- Final loss: 2.446
- Throughput: ~30,500 tokens/sec
Training Phases
| Phase | Steps | Loss | Cost |
|---|---|---|---|
| Phase 1 (smoke test) | 200 | ~6-7 | ~$0.50 |
| Phase 2 (proof of life) | 10K | 2.93 | ~$22 |
| Phase 3 | 60K | 2.57 | ~$94 |
| Phase 4 | 90K | 2.494 | ~$61 |
| Phase 5 (spot) | 120K | 2.530 | ~$34 |
| Phase 6 (spot, Chinchilla) | 160K | 2.446 | — |
Tokenizer
Includes ChatML special tokens for SFT:
<|im_start|>(32000),<|im_end|>(32001)<|system|>(32002),<|user|>(32003),<|assistant|>(32004)
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("GPUburnout/GPUburnout-1B-160K", torch_dtype="float16")
tokenizer = AutoTokenizer.from_pretrained("GPUburnout/GPUburnout-1B-160K")
inputs = tokenizer("The capital of France is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Blog
Full training journey documented at gpuburnout.com
Author
Jun Park (@GPUburnout)
- Downloads last month
- 71