GPUburnout-1B-160K

A 1.04 billion parameter Llama-style language model trained from scratch to Chinchilla-optimal on 20.97B tokens.

This is the 160K step (Chinchilla-optimal) checkpoint. For the earlier 90K step checkpoint, see GPUburnout-1B.

Model Details

  • Architecture: Llama-style decoder-only transformer
  • Parameters: 1.04B
  • Hidden dim: 2048
  • Layers: 16
  • Attention: GQA (32 query heads, 8 KV heads)
  • FFN: SwiGLU (intermediate 8192)
  • Position encoding: RoPE (theta=500000)
  • Context length: 2048 tokens
  • Vocabulary: 32,005 tokens (BPE + 5 special tokens)
  • Weight tying: Yes (embedding + LM head)

Training

  • Data: 20.97B tokens (FineWeb-Edu 85%, Python-Edu 4.2%, FineMath 10.8%)
  • Hardware: A100 SXM 80GB on RunPod
  • Steps: 160,000 (Chinchilla-optimal: 20x params in tokens)
  • Final loss: 2.446
  • Throughput: ~30,500 tokens/sec

Training Phases

Phase Steps Loss Cost
Phase 1 (smoke test) 200 ~6-7 ~$0.50
Phase 2 (proof of life) 10K 2.93 ~$22
Phase 3 60K 2.57 ~$94
Phase 4 90K 2.494 ~$61
Phase 5 (spot) 120K 2.530 ~$34
Phase 6 (spot, Chinchilla) 160K 2.446 —

Tokenizer

Includes ChatML special tokens for SFT:

  • <|im_start|> (32000), <|im_end|> (32001)
  • <|system|> (32002), <|user|> (32003), <|assistant|> (32004)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("GPUburnout/GPUburnout-1B-160K", torch_dtype="float16")
tokenizer = AutoTokenizer.from_pretrained("GPUburnout/GPUburnout-1B-160K")

inputs = tokenizer("The capital of France is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Blog

Full training journey documented at gpuburnout.com

Author

Jun Park (@GPUburnout)

Downloads last month
71
Safetensors
Model size
1B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support