Cannot reproduce claimed 96.3 AIME 2025 score
#7
by NonoRiri-7 - opened
Hi, thanks for releasing this checkpoint. I'm unable to reproduce the AIME 2025 score of 96.3% and consistently get ~90% across both vLLM v0.18.1 and SGLang.
Setup:
- Hardware: 8xB200
- Sampling: temperature=1.0, top_p=0.95, max_tokens=98304, avg@32 (following Kimi-K2.5 model card)
- Inference: vLLM latest + SGLang, both give the same result
Per-question avg@32 scores (INT4 baseline vs NVFP4):
| Q | INT4 | NVFP4 | Δ |
|---|---|---|---|
| 0 | 1.000 | 1.000 | 0 |
| 1 | 1.000 | 0.969 | -0.031 |
| 2 | 1.000 | 1.000 | 0 |
| 3 | 1.000 | 1.000 | 0 |
| 4 | 1.000 | 1.000 | 0 |
| 5 | 1.000 | 1.000 | 0 |
| 6 | 1.000 | 0.969 | -0.031 |
| 7 | 1.000 | 1.000 | 0 |
| 8 | 1.000 | 1.000 | 0 |
| 9 | 1.000 | 0.969 | -0.031 |
| 10 | 1.000 | 1.000 | 0 |
| 11 | 1.000 | 1.000 | 0 |
| 12 | 0.813 | 0.688 | -0.125 |
| 13 | 0.531 | 0.281 | -0.250 |
| 14 | 0.281 | 0.188 | -0.094 |
| 15 | 1.000 | 0.969 | -0.031 |
| 16 | 1.000 | 1.000 | 0 |
| 17 | 1.000 | 1.000 | 0 |
| 18 | 1.000 | 1.000 | 0 |
| 19 | 0.969 | 1.000 | +0.031 |
| 20 | 1.000 | 1.000 | 0 |
| 21 | 1.000 | 1.000 | 0 |
| 22 | 1.000 | 1.000 | 0 |
| 23 | 1.000 | 1.000 | 0 |
| 24 | 1.000 | 0.938 | -0.063 |
| 25 | 1.000 | 1.000 | 0 |
| 26 | 1.000 | 1.000 | 0 |
| 27 | 0.844 | 0.719 | -0.125 |
| 28 | 0.938 | 0.969 | +0.031 |
| 29 | 0.844 | 0.281 | -0.563 |
| Total | 94.1% | 89.8% | -4.3% |
The gap concentrates on (Q12, Q13, Q27, Q29).
Could you share:
- The exact vLLM version/docker image used for evaluation?
- Any additional flags beyond what's in the model card?
Happy to provide more details or logs if helpful.