hjkim00's picture
Restore all essential files - code, configs, and MBPP/HumanEval data
24c2665 verified

Program Execution

LLM solutions are regarded as failed on timeout and OOM etc.

Time Limits

We set the timeout $T=\max(T_{base}, T_{gt}\times k)$ where:

  • $T_{base}$ is the minimal timeout (configurable by --min-time-limit; default to 4s);
  • $T_{gt}$ is the runtime of the ground-truth solutions (achieved via profiling);
  • $k$ is a configurable factor --gt-time-limit-factor (default to 4);

If your machine is too slow and you are getting high-variance results, try to use larger $k$ and $T_{base}$.

Memory Limits

  • Default behavior: The default memory limit per process is min(4GB, system_maximum).
  • Environment variable EVALPLUS_MAX_MEMORY_BYTES:
    • -1 means no limit.
    • Otherwise, the limit is set to the specified value in bytes.

Related discussion threads:

Parallelism

You are NOT encouraged to make your test-bed over stressed while running evaluation. For example, using --parallel 64 on a 4-core machine or doing something else during evaluation are bad ideas...

Tips for Fast Evaluation

If you do greedy decoding where there is only one sample for each task, the evaluation should take just a few seconds. When running 200 samples x 164 tasks x ~700+ tests, it can take around 2-10 minutes by using --parallel 64 and --test-details. Here are some tips to speed up the evaluation:

  • Use --parallel $(nproc)
  • Do NOT use --test-details if you just want to quickly get pass@k as --test-details will run all tests (700+ on average for each task), while without --test-details the testing for a sample stops immediately when it fails the first test.
  • Use our pre-evaluated results (see LLM-generated code)
  • Use HumanEval+ Mini

🚀 Try out HumanEvalPlus-Mini! which selects a minimal set of additional tests with the highest quality, achieving almost the same effectiveness of the full version. Just add a --mini flag, it can run 23+% faster! (even faster if you evaluate all tests without fail-stop with --test-details).

docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples samples.jsonl --mini
# ...Or locally ⚠️
# evalplus.evaluate --dataset humaneval --samples samples.jsonl --mini