Cannot reproduce claimed 96.3 AIME 2025 score

#7
by NonoRiri-7 - opened

Hi, thanks for releasing this checkpoint. I'm unable to reproduce the AIME 2025 score of 96.3% and consistently get ~90% across both vLLM v0.18.1 and SGLang.

Setup:

  • Hardware: 8xB200
  • Sampling: temperature=1.0, top_p=0.95, max_tokens=98304, avg@32 (following Kimi-K2.5 model card)
  • Inference: vLLM latest + SGLang, both give the same result

Per-question avg@32 scores (INT4 baseline vs NVFP4):

Q INT4 NVFP4 Δ
0 1.000 1.000 0
1 1.000 0.969 -0.031
2 1.000 1.000 0
3 1.000 1.000 0
4 1.000 1.000 0
5 1.000 1.000 0
6 1.000 0.969 -0.031
7 1.000 1.000 0
8 1.000 1.000 0
9 1.000 0.969 -0.031
10 1.000 1.000 0
11 1.000 1.000 0
12 0.813 0.688 -0.125
13 0.531 0.281 -0.250
14 0.281 0.188 -0.094
15 1.000 0.969 -0.031
16 1.000 1.000 0
17 1.000 1.000 0
18 1.000 1.000 0
19 0.969 1.000 +0.031
20 1.000 1.000 0
21 1.000 1.000 0
22 1.000 1.000 0
23 1.000 1.000 0
24 1.000 0.938 -0.063
25 1.000 1.000 0
26 1.000 1.000 0
27 0.844 0.719 -0.125
28 0.938 0.969 +0.031
29 0.844 0.281 -0.563
Total 94.1% 89.8% -4.3%

The gap concentrates on (Q12, Q13, Q27, Q29).

Could you share:

  1. The exact vLLM version/docker image used for evaluation?
  2. Any additional flags beyond what's in the model card?

Happy to provide more details or logs if helpful.

Sign up or log in to comment