Some errors were found in the model evaluation【Important】
I'm a LLM engineer. I reviewed your model today but found some errors and have some questions.
In IMOAnswerBench, Gemini 3 Pro's score is 83.3, not 82.16 (refer to Step-3.5-Flash).
In AIME 2026, kimi 2.5's score is 92.5, not 90.62 (refer to GLM5).
In HMMT Nov. 2025, GPT-5.2 (xhigh)'s score is 97.1, not 95.83 (refer to GLM5).
In Livecodebenchv6, DeepSeek-V3.2 should be 83.3, not 82.71; Gemini 3 Pro's score is 90.7, not 88.22; Claude Opus 4.5's score is 84.8, not 83.70 (refer to Step-3.5-Flash).
Gaia2-search seems to have many untested models, making it impossible to determine the accuracy of the metrics.
τ²-Bench
The scores for DeepSeek-V3.2, Kimi K2.5, Claude Opus 4.5, Gemini 3 Pro, and GPT-5.2 (xhigh) are 85.2, 85.4, 92.5, 90.7, and 85.5 respectively. (Your test results are significantly lower than StepFun's results.) (Similar conclusions can be found in Step-3.5-Flash or GLM5.)
The score for GPT-5.2-thinking in SWE-Bench Verified is not as low as 71.8; it should be 80.0. This is a very serious error (refer to the Kimi K2.5 report).
Information on other models of ARC-AGI-v2 is also scarce.
Discussion is welcome.