GPUburnout-3B-75K-Chat-v2

3B parameter Llama-style chat model trained from scratch by Jun Park (GPUburnout).

This is v2 of the 3B-Chat. The original GPUburnout-3B-75K-Chat was trained with a plain-concat SFT data formatter instead of the proper apply_chat_template flow. That bug caused loop collapse at inference and made the 3B underperform the 2B on ARC-Easy by 8 points.

v2 is the same base pretraining (GPUburnout-3B-75K) with the correct chat-template SFT recipe applied (Run A from the recipe-parity ablation: lr=2e-4 + apply_chat_template). It wins on 5 of 6 benchmarks vs the 2B chat model.

Full backstory: It Took Me Two Weeks to Read My Own Code

Benchmarks

Metric 2B-Chat 3B-Chat (v1, retired) 3B-Chat-v2 (this)
TruthfulQA 42.42 42.43 43.54
IFEval 17.03 25.18 19.78
HellaSwag 46.20 46.60 47.79
ARC-Easy 58.12 49.83 59.89
ARC-Challenge 32.76 32.34 34.13
MMLU 25.81 24.93 25.15

(The v1 +25.18 IFEval is a format-mismatch artifact. The honest v2 number is -3.36 vs base, consistent with the linear trend across 1B and 2B.)

Files

  • model.safetensors and tokenizer files: full-precision HF format
  • GPUburnout-3B-75K-Chat-v2-f16.gguf: GGUF f16 for llama.cpp
  • GPUburnout-3B-75K-Chat-v2-Q4_K_M.gguf: Q4_K_M quantized GGUF (~1.9 GB, recommended for CPU inference)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("GPUburnout/GPUburnout-3B-75K-Chat-v2")
model = AutoModelForCausalLM.from_pretrained("GPUburnout/GPUburnout-3B-75K-Chat-v2")

Or via llama.cpp / llama-cpp-python with the Q4_K_M GGUF.

Downloads last month
145
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using GPUburnout/GPUburnout-3B-75K-Chat-v2 1