Some errors were found in the model evaluation【Important】

#3
by Alicia-Ross - opened

I'm a LLM engineer. I reviewed your model today but found some errors and have some questions.

In IMOAnswerBench, Gemini 3 Pro's score is 83.3, not 82.16 (refer to Step-3.5-Flash).

In AIME 2026, kimi 2.5's score is 92.5, not 90.62 (refer to GLM5).

In HMMT Nov. 2025, GPT-5.2 (xhigh)'s score is 97.1, not 95.83 (refer to GLM5).

In Livecodebenchv6, DeepSeek-V3.2 should be 83.3, not 82.71; Gemini 3 Pro's score is 90.7, not 88.22; Claude Opus 4.5's score is 84.8, not 83.70 (refer to Step-3.5-Flash).

Gaia2-search seems to have many untested models, making it impossible to determine the accuracy of the metrics.

τ²-Bench
The scores for DeepSeek-V3.2, Kimi K2.5, Claude Opus 4.5, Gemini 3 Pro, and GPT-5.2 (xhigh) are 85.2, 85.4, 92.5, 90.7, and 85.5 respectively. (Your test results are significantly lower than StepFun's results.) (Similar conclusions can be found in Step-3.5-Flash or GLM5.)

The score for GPT-5.2-thinking in SWE-Bench Verified is not as low as 71.8; it should be 80.0. This is a very serious error (refer to the Kimi K2.5 report).

Information on other models of ARC-AGI-v2 is also scarce.

Discussion is welcome.

Sign up or log in to comment