GPUburnout-1B-160K

A 1.04 billion parameter Llama-style language model trained from scratch to Chinchilla-optimal on 20.97B tokens.

This is the 160K step (Chinchilla-optimal) checkpoint. For the earlier 90K step checkpoint, see GPUburnout-1B.

Model Details

Architecture: Llama-style decoder-only transformer
Parameters: 1.04B
Hidden dim: 2048
Layers: 16
Attention: GQA (32 query heads, 8 KV heads)
FFN: SwiGLU (intermediate 8192)
Position encoding: RoPE (theta=500000)
Context length: 2048 tokens
Vocabulary: 32,005 tokens (BPE + 5 special tokens)
Weight tying: Yes (embedding + LM head)

Training

Data: 20.97B tokens (FineWeb-Edu 85%, Python-Edu 4.2%, FineMath 10.8%)
Hardware: A100 SXM 80GB on RunPod
Steps: 160,000 (Chinchilla-optimal: 20x params in tokens)
Final loss: 2.446
Throughput: ~30,500 tokens/sec

Training Phases

Phase	Steps	Loss	Cost
Phase 1 (smoke test)	200	~6-7	~$0.50
Phase 2 (proof of life)	10K	2.93	~$22
Phase 3	60K	2.57	~$94
Phase 4	90K	2.494	~$61
Phase 5 (spot)	120K	2.530	~$34
Phase 6 (spot, Chinchilla)	160K	2.446	—

Tokenizer

Includes ChatML special tokens for SFT:

<|im_start|> (32000), <|im_end|> (32001)
<|system|> (32002), <|user|> (32003), <|assistant|> (32004)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("GPUburnout/GPUburnout-1B-160K", torch_dtype="float16")
tokenizer = AutoTokenizer.from_pretrained("GPUburnout/GPUburnout-1B-160K")

inputs = tokenizer("The capital of France is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Blog

Full training journey documented at gpuburnout.com

Author

Jun Park (@GPUburnout)

Downloads last month: 4

Safetensors

Model size

1B params

Tensor type

F16

GPUburnout
/

GPUburnout-1B-160K