Benchmark,Base %,Distilled %,Std Dev AIME 2024,1.5,35.2,0.8 MATH-500,25.0,89.1,1.2 GSM8K,65.0,92.8,0.5 GPQA Diamond,28.0,45.5,1.5 LiveCodeBench,15.0,32.5,2.1 HumanEval,55.0,82.3,1.8