Cannot reproduce claimed 96.3 AIME 2025 score

by NonoRiri-7 - opened Apr 3

Apr 3

Hi, thanks for releasing this checkpoint. I'm unable to reproduce the AIME 2025 score of 96.3% and consistently get ~90% across both vLLM v0.18.1 and SGLang.

Setup:

Hardware: 8xB200
Sampling: temperature=1.0, top_p=0.95, max_tokens=98304, avg@32 (following Kimi-K2.5 model card)
Inference: vLLM latest + SGLang, both give the same result

Per-question avg@32 scores (INT4 baseline vs NVFP4):

Q	INT4	NVFP4	Δ
0	1.000	1.000	0
1	1.000	0.969	-0.031
2	1.000	1.000	0
3	1.000	1.000	0
4	1.000	1.000	0
5	1.000	1.000	0
6	1.000	0.969	-0.031
7	1.000	1.000	0
8	1.000	1.000	0
9	1.000	0.969	-0.031
10	1.000	1.000	0
11	1.000	1.000	0
12	0.813	0.688	-0.125
13	0.531	0.281	-0.250
14	0.281	0.188	-0.094
15	1.000	0.969	-0.031
16	1.000	1.000	0
17	1.000	1.000	0
18	1.000	1.000	0
19	0.969	1.000	+0.031
20	1.000	1.000	0
21	1.000	1.000	0
22	1.000	1.000	0
23	1.000	1.000	0
24	1.000	0.938	-0.063
25	1.000	1.000	0
26	1.000	1.000	0
27	0.844	0.719	-0.125
28	0.938	0.969	+0.031
29	0.844	0.281	-0.563
Total	94.1%	89.8%	-4.3%

The gap concentrates on (Q12, Q13, Q27, Q29).

Could you share:

The exact vLLM version/docker image used for evaluation?
Any additional flags beyond what's in the model card?

Happy to provide more details or logs if helpful.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment