evaluation methodology for humaneval

#15
by viplismism - opened

hey hi @shunxing1234 and @arieldeng

Congrats on the impressive KAT-Coder results!
I'm trying to replicate your HumanEval evaluation to benchmark my own models. Could you clarify:

  1. What temperature did you use for HumanEval (96.3% in Table 1)
  2. Is this pass@1 or pass@k? If pass@k, what's k and n?
  3. Did you use the Python HumanEval or MultiPL-E (Rust/other language)?

Screenshot 2025-11-21 at 9.55.56 PM

I would be interested if you can replicate them. this model is smart, but not working with roocode, cline etc.
Devs do not update or fix it.

Sign up or log in to comment