When I tried to reproduce the results of the multilingual evaluation set of Kimi-K2.5 and Deepseekv3.2, I found that the scores were 5 to 10 points different from the scores in the official report. Was another method used in the evaluation?
· Sign up or log in to comment