File size: 6,886 Bytes
24c2665 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 |
# EvalPlus Commands
* `evalplus.codegen`: Code generation + Code post-processing
* `evalplus.evaluate`: Code generation + Code post-processing + Evaluation
* `evalplus.sanitize`: Code post-processing
## Code Generation
`evalplus.codegen` support following backends:
- `vllm`: Set `--model` as Hugging Face model ID such as `microsoft/Phi-3-mini-128k-instruct`
- `hf`: HuggingFace Transformers; same way to setup `--model`
- `openai`: Configure `OPENAI_API_KEY`; one can configure `--base-url`
- `anthropic`: Configure `ANTHROPIC_API_KEY`
- `google`: Configure `GOOGLE_API_KEY`
- `bedrock`: Configure `BEDROCK_ROLE_ARN`
- `gptqmodel`: Set quantized `--model` as Hugging Face model ID such as `ModelCloud/Qwen2.5-Coder-32B-Instruct-gptqmodel-4bit-vortex-v1`
```shell
evalplus.codegen --model "mistralai/Mistral-7B-Instruct-v0.3" --greedy --root [result_path] --dataset [mbpp|humaneval] --backend [vllm|hf|openai|...]
```
To perform code generation using user-defined tasks and datasets:
```shell
# Override HumanEval datasets
HUMANEVAL_OVERRIDE_PATH="/path/to/HumanEvalPlus.jsonl.gz" evalplus.codegen --model "mistralai/Mistral-7B-Instruct-v0.3" --greedy --root [result_path] --dataset humaneval --backend [vllm|hf|openai|...]
# Override MBPP datasets
MBPP_OVERRIDE_PATH="/path/to/MbppPlus.jsonl.gz" evalplus.codegen --model "mistralai/Mistral-7B-Instruct-v0.3" --greedy --root [result_path] --dataset mbpp --backend [vllm|hf|openai|...]
```
## Customized Code Generation
You can perform your own code generation from scratch by doing something like this:
```python
from evalplus.data import get_[human_eval|mbpp]_plus, write_jsonl
samples = [
dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"]))
for task_id, problem in get_[human_eval|mbpp]_plus().items()
]
write_jsonl("samples.jsonl", samples)
```
> [!Note]
>
> The main structure of `problem` is as follows:
>
> - `task_id` is the identifier string for the task
> - `entry_point` is name of the function
> - `prompt` is the function signature with docstring
> - `canonical_solution` is the ground-truth implementation (re-implemented to fix bugs in HumanEval)
> - `base_input` is the test inputs in original HumanEval
> - `plus_input` is the test inputs brought by EvalPlus
> [!Note]
>
> **Expected Schema of `samples.jsonl`**
>
> 1. `task_id`: Task ID, which are the keys of `get_[human_eval|mbpp]_plus()`
> 2. `solution` (optional): Self-contained solution (usually including the prompt)
> - Example: `{"task_id": "HumanEval/?", "solution": "def f():\n return 1"}`
> 3. `completion` (optional): Function body without prompt
> - Example: `{"task_id": "HumanEval/?", "completion": " return 1"}`
>
> Only one of `solution` and `completion` is required. If both are provided, `solution` will be used.
> We also accept solutions in the form of directory, i.e., `--samples ${SAMPLE_DIR}` where `${SAMPLE_DIR}` is organized as: `${SAMPLE_DIR}/${TASK_ID}/{SAMPLE_ID}.py` (`${TASK_ID} = task_id.replace("/", "_")`).
## Code post-processing
> [!Note]
>
> This step is by default performed in `evalplus.codegen`.
> Yet, you might want to use it if you have generated the code using other tools.
LLM-generated text may not be compilable code for including natural language lines or incomplete extra code.
We provide a tool namely `evalplus.sanitize` to clean up the code:
```shell
# π‘ If you are storing codes in jsonl:
evalplus.sanitize --samples samples.jsonl
# Sanitized code will be produced to `samples-sanitized.jsonl`
# π‘ If you are storing codes in directories:
evalplus.sanitize --samples /path/to/vicuna-[??]b_temp_[??]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`
```
<details><summary>π Checking the compilability of post-processed code<i>:: click to expand ::</i></summary>
<div>
To double-check the post-processing results, you can use `evalplus.syncheck` to check the code validity before and after sanitization, which will print erroneous code snippets and why they are wrong:
```shell
# π‘ If you are storing codes in jsonl:
evalplus.syncheck --samples samples.jsonl --dataset [humaneval|mbpp]
# π‘ If you are storing codes in directories:
evalplus.syncheck --samples /path/to/vicuna-[??]b_temp_[??] --dataset [humaneval|mbpp]
```
</div>
</details>
## Code Evaluation
You are strongly recommended to use a sandbox such as [docker](https://docs.docker.com/get-docker/):
```bash
docker run --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \
evalplus.evaluate --dataset humaneval \
--samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl
```
...Or if you want to try it locally regardless of the risks β οΈ:
```bash
evalplus.evaluate --dataset [humaneval|mbpp] --samples samples.jsonl
```
To use a user-defined dataset locally, you can set `HUMANEVAL_OVERRIDE_PATH` or `MBPP_OVERRIDE_PATH`:
```bash
HUMANEVAL_OVERRIDE_PATH="/path/to/HumanEvalPlus.jsonl.gz" evalplus.evaluate --dataset humaneval --samples samples.jsonl
```
> [!Tip]
>
> Program execution can be configured. See [Program Execution in EvalPlus and EvalPerf](./docs/execution.md).
<details><summary>π€ Evaluate with local GitHub repo? <i>:: click to expand ::</i></summary>
<div>
```bash
export PYTHONPATH=$PYTHONPATH:$(pwd)
python evalplus/evaluate.py --dataset humaneval --samples samples.jsonl
```
</div>
</details>
<details><summary>β¨οΈ More command-line flags <i>:: click to expand ::</i></summary>
<div>
- `--parallel`: by default half of the cores
- `--base-only` (store_ture): only run base HumanEval tests
- `--i-just-wanna-run`: force a re-run
</div>
</details>
The output should be like (below is GPT-4 greedy decoding example):
```
Computing expected output...
Expected outputs computed in 15.18s
Reading samples...
164it [00:04, 37.79it/s]
Evaluating samples...
100%|ββββββββββββββββββββββββββββββββββββββββββ| 164/164 [00:03<00:00, 44.75it/s]
Base
{'pass@1': 0.8841463414634146}
Base + Extra
{'pass@1': 0.768}
```
- `Base` is the `pass@k` for the original HumanEval
- `Base + Extra` is the `pass@k` for the our **HumanEval+** (with extra tests)
- The "k" includes `[1, 10, 100]` where k values `<=` the sample size will be used
- A cache file named like `samples_eval_results.jsonl` will be cached. Remove it to re-run the evaluation
## Test input generation using EvalPlus
Please check `evalplus/inputgen.py`.
## Useful tools
We provide some useful tools for curation, visualization, and analysis of the EvalPlus datasets in the `tools/` directory.
To use these tools, please first install the repository from GitHub:
```bash
git clone https://github.com/evalplus/evalplus.git
cd evalplus
pip install -r tools/requirements.txt
```
|