Is the evaluation framework using mini-swe-agent?

#102
by zhang7471 - opened

When I tried to reproduce the results of the multilingual evaluation set of Kimi-K2.5 and Deepseekv3.2, I found that the scores were 5 to 10 points different from the scores in the official report. Was another method used in the evaluation?

Sign up or log in to comment