GPUburnout-1B

A 1.04 billion parameter Llama-style language model trained from scratch on 11.8B tokens for $175.

Model Details

  • Architecture: Llama-style decoder-only transformer
  • Parameters: 1.04B
  • Hidden dim: 2048
  • Layers: 16
  • Attention: GQA (32 query heads, 8 KV heads)
  • FFN: SwiGLU (intermediate 8192)
  • Position encoding: RoPE (theta=500000)
  • Context length: 2048 tokens
  • Vocabulary: 32,005 tokens (BPE + 5 special tokens)
  • Weight tying: Yes (embedding + LM head)

Training

  • Data: 11.8B tokens (FineWeb-Edu 85%, Python-Edu 4.2%, FineMath 10.8%)
  • Hardware: A100 SXM 80GB on RunPod
  • Steps: 90,000
  • Final loss: 2.494
  • Total cost: ~$175 GPU compute
  • Throughput: ~28,535 tokens/sec

Benchmarks (0-shot)

Benchmark Metric Score Random
ARC-Easy acc 47.1% 25%
HellaSwag acc_norm 28.8% 25%
ARC-Challenge acc_norm 23.3% 25%
MMLU acc 23.0% 25%

Tokenizer

Includes ChatML special tokens for future SFT:

  • <|im_start|> (32000), <|im_end|> (32001)
  • <|system|> (32002), <|user|> (32003), <|assistant|> (32004)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("GPUburnout/GPUburnout-1B", torch_dtype="float16")
tokenizer = AutoTokenizer.from_pretrained("GPUburnout/GPUburnout-1B")

inputs = tokenizer("The capital of France is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Blog

Full training journey documented at gpuburnout.com

Author

Jun Park (@GPUburnout)

Downloads last month
62
Safetensors
Model size
1B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using GPUburnout/gpuburnout-1b 1