File size: 2,339 Bytes
24c2665
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# Program Execution

LLM solutions are regarded as **failed** on timeout and OOM etc.

## Time Limits

We set the timeout $T=\max(T_{base}, T_{gt}\times k)$ where:

- $T_{base}$ is the minimal timeout (configurable by `--min-time-limit`; default to 4s);
- $T_{gt}$ is the runtime of the ground-truth solutions (achieved via profiling);
- $k$ is a configurable factor `--gt-time-limit-factor` (default to 4);

If your machine is too slow and you are getting high-variance results, try to use larger $k$ and $T_{base}$.

## Memory Limits

- **Default behavior**: The default memory limit per process is `min(4GB, system_maximum)`.
- Environment variable `EVALPLUS_MAX_MEMORY_BYTES`:
  - `-1` means no limit.
  - Otherwise, the limit is set to the specified value in bytes.

Related discussion threads:

- https://github.com/evalplus/evalplus/pull/225

## Parallelism

You are **NOT** encouraged to make your test-bed over stressed while running evaluation.
For example, using `--parallel 64` on a 4-core machine or doing something else during evaluation are bad ideas...

## Tips for Fast Evaluation

If you do greedy decoding where there is only one sample for each task, the evaluation should take just a few seconds.
When running 200 samples x 164 tasks x ~700+ tests, it can take around 2-10 minutes by using `--parallel 64` and `--test-details`.
Here are some tips to speed up the evaluation:

- Use `--parallel $(nproc)`
- Do **NOT** use `--test-details` if you just want to quickly get pass@k as `--test-details` will run all tests (700+ on average for each task), while without `--test-details` the testing for a sample stops immediately when it fails the first test.
- Use our pre-evaluated results (see [LLM-generated code](#-LLM-generated-code))
- Use HumanEval+ Mini

</div>
</details>

> [!Tip]
>
> 🚀 **Try out `HumanEvalPlus-Mini`!** which selects a _minimal_ set of additional tests with the highest quality, achieving almost the same effectiveness of the full version. Just add a **`--mini`** flag, it can run 23+% faster! (even faster if you evaluate all tests without fail-stop with `--test-details`).
>
> ```bash
> docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples samples.jsonl --mini
> # ...Or locally ⚠️
> # evalplus.evaluate --dataset humaneval --samples samples.jsonl --mini
> ```