File size: 6,886 Bytes
24c2665
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
# EvalPlus Commands

* `evalplus.codegen`: Code generation + Code post-processing
* `evalplus.evaluate`: Code generation + Code post-processing + Evaluation
* `evalplus.sanitize`: Code post-processing

## Code Generation

`evalplus.codegen` support following backends:

- `vllm`: Set `--model` as Hugging Face model ID such as `microsoft/Phi-3-mini-128k-instruct`
- `hf`: HuggingFace Transformers; same way to setup `--model`
- `openai`: Configure `OPENAI_API_KEY`; one can configure `--base-url`
- `anthropic`: Configure `ANTHROPIC_API_KEY`
- `google`: Configure `GOOGLE_API_KEY`
- `bedrock`: Configure `BEDROCK_ROLE_ARN`
- `gptqmodel`: Set quantized `--model` as Hugging Face model ID such as `ModelCloud/Qwen2.5-Coder-32B-Instruct-gptqmodel-4bit-vortex-v1`

```shell
evalplus.codegen --model "mistralai/Mistral-7B-Instruct-v0.3" --greedy --root [result_path] --dataset [mbpp|humaneval] --backend [vllm|hf|openai|...]
```

To perform code generation using user-defined tasks and datasets:

```shell
# Override HumanEval datasets
HUMANEVAL_OVERRIDE_PATH="/path/to/HumanEvalPlus.jsonl.gz" evalplus.codegen --model "mistralai/Mistral-7B-Instruct-v0.3" --greedy --root [result_path] --dataset humaneval --backend [vllm|hf|openai|...]
# Override MBPP datasets
MBPP_OVERRIDE_PATH="/path/to/MbppPlus.jsonl.gz" evalplus.codegen --model "mistralai/Mistral-7B-Instruct-v0.3" --greedy --root [result_path] --dataset mbpp --backend [vllm|hf|openai|...]
```

## Customized Code Generation

You can perform your own code generation from scratch by doing something like this:

```python
from evalplus.data import get_[human_eval|mbpp]_plus, write_jsonl

samples = [
    dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"]))
    for task_id, problem in get_[human_eval|mbpp]_plus().items()
]
write_jsonl("samples.jsonl", samples)
```

> [!Note]
>
> The main structure of `problem` is as follows:
>
> - `task_id` is the identifier string for the task
> - `entry_point` is name of the function
> - `prompt` is the function signature with docstring
> - `canonical_solution` is the ground-truth implementation (re-implemented to fix bugs in HumanEval)
> - `base_input` is the test inputs in original HumanEval
> - `plus_input` is the test inputs brought by EvalPlus

> [!Note]
>
> **Expected Schema of `samples.jsonl`**
>
> 1. `task_id`: Task ID, which are the keys of `get_[human_eval|mbpp]_plus()`
> 2. `solution` (optional): Self-contained solution (usually including the prompt)
>    - Example: `{"task_id": "HumanEval/?", "solution": "def f():\n    return 1"}`
> 3. `completion` (optional): Function body without prompt
>    - Example: `{"task_id": "HumanEval/?", "completion": "    return 1"}`
>
> Only one of `solution` and `completion` is required. If both are provided, `solution` will be used.
> We also accept solutions in the form of directory, i.e., `--samples ${SAMPLE_DIR}` where `${SAMPLE_DIR}` is organized as: `${SAMPLE_DIR}/${TASK_ID}/{SAMPLE_ID}.py` (`${TASK_ID} = task_id.replace("/", "_")`).

## Code post-processing

> [!Note]
>
> This step is by default performed in `evalplus.codegen`.
> Yet, you might want to use it if you have generated the code using other tools.

LLM-generated text may not be compilable code for including natural language lines or incomplete extra code.
We provide a tool namely `evalplus.sanitize` to clean up the code:

```shell
# πŸ’‘ If you are storing codes in jsonl:
evalplus.sanitize --samples samples.jsonl
# Sanitized code will be produced to `samples-sanitized.jsonl`

# πŸ’‘ If you are storing codes in directories:
evalplus.sanitize --samples /path/to/vicuna-[??]b_temp_[??]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`
```

<details><summary>πŸ”Ž Checking the compilability of post-processed code<i>:: click to expand ::</i></summary>
<div>

To double-check the post-processing results, you can use `evalplus.syncheck` to check the code validity before and after sanitization, which will print erroneous code snippets and why they are wrong:

```shell
# πŸ’‘ If you are storing codes in jsonl:
evalplus.syncheck --samples samples.jsonl --dataset [humaneval|mbpp]

# πŸ’‘ If you are storing codes in directories:
evalplus.syncheck --samples /path/to/vicuna-[??]b_temp_[??] --dataset [humaneval|mbpp]
```

</div>
</details>



## Code Evaluation

You are strongly recommended to use a sandbox such as [docker](https://docs.docker.com/get-docker/):

```bash
docker run --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \
           evalplus.evaluate --dataset humaneval                                     \
           --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl
```

...Or if you want to try it locally regardless of the risks ⚠️:

```bash
evalplus.evaluate --dataset [humaneval|mbpp] --samples samples.jsonl
```

To use a user-defined dataset locally, you can set `HUMANEVAL_OVERRIDE_PATH` or `MBPP_OVERRIDE_PATH`:

```bash
HUMANEVAL_OVERRIDE_PATH="/path/to/HumanEvalPlus.jsonl.gz" evalplus.evaluate --dataset humaneval --samples samples.jsonl
```

> [!Tip]
>
> Program execution can be configured. See [Program Execution in EvalPlus and EvalPerf](./docs/execution.md).

<details><summary>πŸ€” Evaluate with local GitHub repo? <i>:: click to expand ::</i></summary>
<div>

```bash
export PYTHONPATH=$PYTHONPATH:$(pwd)
python evalplus/evaluate.py --dataset humaneval --samples samples.jsonl
```

</div>
</details>

<details><summary>⌨️ More command-line flags <i>:: click to expand ::</i></summary>
<div>

- `--parallel`: by default half of the cores
- `--base-only` (store_ture): only run base HumanEval tests
- `--i-just-wanna-run`: force a re-run

</div>
</details>

The output should be like (below is GPT-4 greedy decoding example):

```
Computing expected output...
Expected outputs computed in 15.18s
Reading samples...
164it [00:04, 37.79it/s]
Evaluating samples...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 164/164 [00:03<00:00, 44.75it/s]
Base
{'pass@1': 0.8841463414634146}
Base + Extra
{'pass@1': 0.768}
```

- `Base` is the `pass@k` for the original HumanEval
- `Base + Extra` is the `pass@k` for the our **HumanEval+** (with extra tests)
- The "k" includes `[1, 10, 100]` where k values `<=` the sample size will be used
- A cache file named like `samples_eval_results.jsonl` will be cached. Remove it to re-run the evaluation

## Test input generation using EvalPlus

Please check `evalplus/inputgen.py`.

## Useful tools

We provide some useful tools for curation, visualization, and analysis of the EvalPlus datasets in the `tools/` directory.
To use these tools, please first install the repository from GitHub:

```bash
git clone https://github.com/evalplus/evalplus.git
cd evalplus
pip install -r tools/requirements.txt
```