|
|
Metadata-Version: 2.2 |
|
|
Name: evalplus |
|
|
Version: 0.4.0.dev33 |
|
|
Summary: "EvalPlus for rigourous evaluation of LLM-synthesized code" |
|
|
Home-page: https://github.com/evalplus/evalplus |
|
|
License: Apache-2.0 |
|
|
Platform: any |
|
|
Classifier: Operating System :: OS Independent |
|
|
Classifier: Programming Language :: Python :: 3 |
|
|
Classifier: License :: OSI Approved :: Apache Software License |
|
|
Requires-Python: >=3.9 |
|
|
Description-Content-Type: text/markdown |
|
|
License-File: LICENSE |
|
|
Requires-Dist: wget>=3.2 |
|
|
Requires-Dist: tempdir>=0.7.1 |
|
|
Requires-Dist: multipledispatch>=0.6.0 |
|
|
Requires-Dist: appdirs>=1.4.4 |
|
|
Requires-Dist: numpy>=1.19.5 |
|
|
Requires-Dist: tqdm>=4.56.0 |
|
|
Requires-Dist: termcolor>=2.0.0 |
|
|
Requires-Dist: fire>=0.6.0 |
|
|
Requires-Dist: openai>=1.11.1 |
|
|
Requires-Dist: tree_sitter>=0.22.0 |
|
|
Requires-Dist: tree-sitter-python>=0.21.0 |
|
|
Requires-Dist: rich>=12.3.0 |
|
|
Requires-Dist: transformers>=4.43.0 |
|
|
Requires-Dist: anthropic>=0.34.1 |
|
|
Requires-Dist: google-generativeai>=0.7.2 |
|
|
Requires-Dist: datasets>=2.21.0 |
|
|
Requires-Dist: psutil>=5.9.0 |
|
|
Requires-Dist: boto3>=1.35.0 |
|
|
Provides-Extra: perf |
|
|
Requires-Dist: Pympler>=1.0.1; extra == "perf" |
|
|
Requires-Dist: cirron>=0.4; extra == "perf" |
|
|
Provides-Extra: vllm |
|
|
Requires-Dist: vllm>=0.5.1; extra == "vllm" |
|
|
Provides-Extra: gptqmodel |
|
|
Requires-Dist: gptqmodel>=1.4.1; extra == "gptqmodel" |
|
|
|
|
|
|
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://evalplus.github.io"><img src="https://img.shields.io/badge/%F0%9F%8F%86-leaderboard-8A2BE2"></a> |
|
|
<a href="https://openreview.net/forum?id=1qvx610Cu7"><img src="https://img.shields.io/badge/EvalPlus-NeurIPS'23-a55fed.svg"></a> |
|
|
<a href="https://openreview.net/forum?id=IBCBMeAhmC"><img src="https://img.shields.io/badge/EvalPerf-COLM'24-a55fed.svg"></a> |
|
|
<a href="https://huggingface.co/evalplus/"><img src="https://img.shields.io/badge/π€%20Hugging%20Face-evalplus-%23ff8811.svg"></a> |
|
|
<a href="https://pypi.org/project/evalplus/"><img src="https://img.shields.io/pypi/v/evalplus?color=g"></a> |
|
|
<a href="https://hub.docker.com/r/ganler/evalplus" title="Docker"><img src="https://img.shields.io/docker/image-size/ganler/evalplus"></a> |
|
|
</p> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="#-about">πAbout</a> β’ |
|
|
<a href="#-quick-start">π₯Quick Start</a> β’ |
|
|
<a href="#-llm-backends">πLLM Backends</a> β’ |
|
|
<a href="#-documents">πDocuments</a> β’ |
|
|
<a href="#-citation">πCitation</a> β’ |
|
|
<a href="#-acknowledgement">πAcknowledgement</a> |
|
|
</p> |
|
|
|
|
|
|
|
|
|
|
|
Who's using EvalPlus datasets? EvalPlus has been used by various LLM teams, including: |
|
|
|
|
|
* [Meta Llama 3.1 and 3.3](https://ai.meta.com/blog/meta-llama-3-1/) |
|
|
* [Allen AI TΓLU 1/2/3](https://github.com/allenai/open-instruct/blob/main/docs/tulu1_tulu2.md#benchmark-based-eval) |
|
|
* [Qwen2.5-Coder](https://qwenlm.github.io/blog/qwen2.5-coder-family/) |
|
|
* [CodeQwen 1.5](https://qwenlm.github.io/blog/codeqwen1.5/) |
|
|
* [DeepSeek-Coder V2](https://arxiv.org/pdf/2406.11931) |
|
|
* [Qwen2](https://arxiv.org/pdf/2407.10671) |
|
|
* [Snowflake Arctic](https://www.snowflake.com/en/data-cloud/arctic/) |
|
|
* [StarCoder2](https://arxiv.org/pdf/2402.19173) |
|
|
* [Magicoder](https://arxiv.org/pdf/2312.02120) |
|
|
* [WizardCoder](https://arxiv.org/pdf/2306.08568) |
|
|
|
|
|
Below tracks the notable updates of EvalPlus: |
|
|
|
|
|
- **[2024-10-20 `v0.3.1`]**: EvalPlus `v0.3.1` is officially released! Highlights: *(i)* Code efficiency evaluation via EvalPerf, *(ii)* one command to run all: generation + post-processing + evaluation, *(iii)* support for more inference backends such as Google Gemini & Anthropic, etc. |
|
|
- **[2024-06-09 pre `v0.3.0`]**: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to [EvalArena](https://github.com/crux-eval/eval-arena). |
|
|
- **[2024-04-17 pre `v0.3.0`]**: MBPP+ is upgraded to `v0.2.0` by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected. |
|
|
|
|
|
<details><summary>Earlier news <i>:: click to expand ::</i></summary> |
|
|
<div> |
|
|
|
|
|
- ([`v0.2.1`](https://github.com/evalplus/evalplus/releases/tag/v0.2.1)) You can use EvalPlus datasets via [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness)! HumanEval+ oracle fixes (32). |
|
|
- ([`v0.2.0`](https://github.com/evalplus/evalplus/releases/tag/v0.2.0)) MBPP+ is released! HumanEval contract & input fixes (0/3/9/148/114/1/2/99/28/32/35/160). |
|
|
- ([`v0.1.7`](https://github.com/evalplus/evalplus/releases/tag/v0.1.7)) [Leaderboard](https://evalplus.github.io/leaderboard.html) release; HumanEval+ contract and input fixes (32/166/126/6) |
|
|
- ([`v0.1.6`](https://github.com/evalplus/evalplus/releases/tag/v0.1.6)) Configurable and by-default-conservative timeout settings; HumanEval+ contract & ground-truth fixes (129/148/75/53/0/3/9/140) |
|
|
- ([`v0.1.5`](https://github.com/evalplus/evalplus/releases/tag/v0.1.5)) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples! |
|
|
- ([`v0.1.1`](https://github.com/evalplus/evalplus/releases/tag/v0.1.1)) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc. |
|
|
- ([`v0.1.0`](https://github.com/evalplus/evalplus/releases/tag/v0.1.0)) HumanEval+ is released! |
|
|
|
|
|
</div> |
|
|
</details> |
|
|
|
|
|
|
|
|
## π About |
|
|
|
|
|
EvalPlus is a rigorous evaluation framework for LLM4Code, with: |
|
|
|
|
|
- β¨ **HumanEval+**: 80x more tests than the original HumanEval! |
|
|
- β¨ **MBPP+**: 35x more tests than the original MBPP! |
|
|
- β¨ **EvalPerf**: evaluating the efficiency of LLM-generated code! |
|
|
- β¨ **Framework**: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks. |
|
|
|
|
|
Why EvalPlus? |
|
|
|
|
|
- β¨ **Precise evaluation**: See [our leaderboard](https://evalplus.github.io/leaderboard.html) for latest LLM rankings before & after rigorous evaluation. |
|
|
- β¨ **Coding rigorousness**: Look at the score differences! esp. before & after using EvalPlus tests! Less drop means more rigorousness in code generation; while a bigger drop means the generated code tends to be fragile. |
|
|
- β¨ **Code efficiency**: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs. |
|
|
|
|
|
Want to know more details? Read our papers & materials! |
|
|
|
|
|
- **EvalPlus**: [NeurIPS'23 paper](https://openreview.net/forum?id=1qvx610Cu7), [Slides](https://docs.google.com/presentation/d/1eTxzUQG9uHaU13BGhrqm4wH5NmMZiM3nI0ezKlODxKs), [Poster](https://jw-liu.xyz/assets/pdf/EvalPlus_Poster.pdf), [Leaderboard](https://evalplus.github.io/leaderboard.html) |
|
|
- **EvalPerf**: [COLM'24 paper](https://openreview.net/forum?id=IBCBMeAhmC), [Poster](https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf), [Documentation](./docs/evalperf.md), [Leaderboard](https://evalplus.github.io/evalperf.html) |
|
|
|
|
|
|
|
|
## π₯ Quick Start |
|
|
|
|
|
### Code Correctness Evaluation: HumanEval(+) or MBPP(+) |
|
|
|
|
|
```bash |
|
|
pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus" |
|
|
# Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release |
|
|
|
|
|
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ |
|
|
--dataset [humaneval|mbpp] \ |
|
|
--backend vllm \ |
|
|
--greedy |
|
|
``` |
|
|
|
|
|
<details><summary>π‘οΈ Safe code execution within Docker <i>:: click to expand ::</i></summary> |
|
|
<div> |
|
|
|
|
|
```bash |
|
|
# Local generation |
|
|
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \ |
|
|
--dataset humaneval \ |
|
|
--backend vllm \ |
|
|
--greedy |
|
|
|
|
|
# Code execution within Docker |
|
|
docker run --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \ |
|
|
evalplus.evaluate --dataset humaneval \ |
|
|
--samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl |
|
|
``` |
|
|
|
|
|
</div> |
|
|
</details> |
|
|
|
|
|
### Code Efficiency Evaluation: EvalPerf (*nix only) |
|
|
|
|
|
```bash |
|
|
pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus" |
|
|
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release |
|
|
|
|
|
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf |
|
|
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm |
|
|
``` |
|
|
|
|
|
<details><summary>π‘οΈ Safe code execution within Docker <i>:: click to expand ::</i></summary> |
|
|
<div> |
|
|
|
|
|
```bash |
|
|
# Local generation |
|
|
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \ |
|
|
--dataset evalperf \ |
|
|
--backend vllm \ |
|
|
--temperature 1.0 \ |
|
|
--n-samples 100 |
|
|
|
|
|
# Code execution within Docker |
|
|
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf |
|
|
docker run --cap-add PERFMON --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \ |
|
|
evalplus.evalperf --samples /app/evalperf/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl |
|
|
``` |
|
|
|
|
|
</div> |
|
|
</details> |
|
|
|
|
|
## π LLM Backends |
|
|
|
|
|
### HuggingFace models |
|
|
|
|
|
- `transformers` backend: |
|
|
|
|
|
```bash |
|
|
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ |
|
|
--dataset [humaneval|mbpp] \ |
|
|
--backend hf \ |
|
|
--greedy |
|
|
``` |
|
|
|
|
|
> [!Note] |
|
|
> |
|
|
> EvalPlus uses different prompts for base and chat models. |
|
|
> By default it is detected by `tokenizer.chat_template` when using `hf`/`vllm` as backend. |
|
|
> For other backends, only chat mode is allowed. |
|
|
> |
|
|
> Therefore, if your base models come with a `tokenizer.chat_template`, |
|
|
> please add `--force-base-prompt` to avoid being evaluated |
|
|
> in a chat mode. |
|
|
|
|
|
<details><summary>Enable Flash Attention 2 <i>:: click to expand ::</i></summary> |
|
|
<div> |
|
|
|
|
|
```bash |
|
|
|
|
|
pip install packaging ninja |
|
|
pip install flash-attn --no-build-isolation |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ |
|
|
--dataset [humaneval|mbpp] \ |
|
|
--backend hf \ |
|
|
--attn-implementation [flash_attention_2|sdpa] \ |
|
|
--greedy |
|
|
``` |
|
|
|
|
|
</div> |
|
|
</details> |
|
|
|
|
|
- `vllm` backend: |
|
|
|
|
|
```bash |
|
|
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ |
|
|
--dataset [humaneval|mbpp] \ |
|
|
--backend vllm \ |
|
|
--tp [TENSOR_PARALLEL_SIZE] \ |
|
|
--greedy |
|
|
``` |
|
|
|
|
|
- `openai` compatible servers (e.g., [vLLM](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)): |
|
|
|
|
|
```bash |
|
|
# OpenAI models |
|
|
export OPENAI_API_KEY="{KEY}" # https://platform.openai.com/settings/organization/api-keys |
|
|
evalplus.evaluate --model "gpt-4o-2024-08-06" \ |
|
|
--dataset [humaneval|mbpp] \ |
|
|
--backend openai --greedy |
|
|
|
|
|
# DeepSeek |
|
|
export OPENAI_API_KEY="{KEY}" # https://platform.deepseek.com/api_keys |
|
|
evalplus.evaluate --model "deepseek-chat" \ |
|
|
--dataset [humaneval|mbpp] \ |
|
|
--base-url https://api.deepseek.com \ |
|
|
--backend openai --greedy |
|
|
|
|
|
# Grok |
|
|
export OPENAI_API_KEY="{KEY}" # https://console.x.ai/ |
|
|
evalplus.evaluate --model "grok-beta" \ |
|
|
--dataset [humaneval|mbpp] \ |
|
|
--base-url https://api.x.ai/v1 \ |
|
|
--backend openai --greedy |
|
|
|
|
|
# vLLM server |
|
|
# First, launch a vLLM server: https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html |
|
|
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ |
|
|
--dataset [humaneval|mbpp] \ |
|
|
--base-url http://localhost:8000/v1 \ |
|
|
--backend openai --greedy |
|
|
|
|
|
# GPTQModel |
|
|
evalplus.evaluate --model "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1" \ |
|
|
--dataset [humaneval|mbpp] \ |
|
|
--backend gptqmodel --greedy |
|
|
``` |
|
|
|
|
|
### OpenAI models |
|
|
|
|
|
- Access OpenAI APIs from [OpenAI Console](https://platform.openai.com/) |
|
|
|
|
|
```bash |
|
|
export OPENAI_API_KEY="[YOUR_API_KEY]" |
|
|
evalplus.evaluate --model "gpt-4o" \ |
|
|
--dataset [humaneval|mbpp] \ |
|
|
--backend openai \ |
|
|
--greedy |
|
|
``` |
|
|
|
|
|
### Anthropic models |
|
|
|
|
|
- Access Anthropic APIs from [Anthropic Console](https://console.anthropic.com/) |
|
|
|
|
|
```bash |
|
|
export ANTHROPIC_API_KEY="[YOUR_API_KEY]" |
|
|
evalplus.evaluate --model "claude-3-haiku-20240307" \ |
|
|
--dataset [humaneval|mbpp] \ |
|
|
--backend anthropic \ |
|
|
--greedy |
|
|
``` |
|
|
|
|
|
### Google Gemini models |
|
|
|
|
|
- Access Gemini APIs from [Google AI Studio](https://aistudio.google.com/) |
|
|
|
|
|
```bash |
|
|
export GOOGLE_API_KEY="[YOUR_API_KEY]" |
|
|
evalplus.evaluate --model "gemini-1.5-pro" \ |
|
|
--dataset [humaneval|mbpp] \ |
|
|
--backend google \ |
|
|
--greedy |
|
|
``` |
|
|
|
|
|
### Amazon Bedrock models |
|
|
|
|
|
- [Amazon Bedrock](https://aws.amazon.com/bedrock/) |
|
|
|
|
|
```bash |
|
|
export BEDROCK_ROLE_ARN="[BEDROCK_ROLE_ARN]" |
|
|
evalplus.evaluate --model "anthropic.claude-3-5-sonnet-20241022-v2:0" \ |
|
|
--dataset [humaneval|mbpp] \ |
|
|
--backend bedrock \ |
|
|
--greedy |
|
|
``` |
|
|
|
|
|
You can checkout the generation and results at `evalplus_results/[humaneval|mbpp]/` |
|
|
|
|
|
<details><summary>β¬ Using EvalPlus as a local repo? <i>:: click to expand ::</i></summary> |
|
|
<div> |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/evalplus/evalplus.git |
|
|
cd evalplus |
|
|
export PYTHONPATH=$PYTHONPATH:$(pwd) |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
</div> |
|
|
</details> |
|
|
|
|
|
## π Documents |
|
|
|
|
|
To learn more about how to use EvalPlus, please refer to: |
|
|
|
|
|
- [EvalPlus Commands](./docs/cli.md) |
|
|
- [EvalPerf](./docs/evalperf.md) |
|
|
- [Program Execution](./docs/execution.md) |
|
|
|
|
|
## π Citation |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{evalplus, |
|
|
title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation}, |
|
|
author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming}, |
|
|
booktitle = {Thirty-seventh Conference on Neural Information Processing Systems}, |
|
|
year = {2023}, |
|
|
url = {https://openreview.net/forum?id=1qvx610Cu7}, |
|
|
} |
|
|
|
|
|
@inproceedings{evalperf, |
|
|
title = {Evaluating Language Models for Efficient Code Generation}, |
|
|
author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming}, |
|
|
booktitle = {First Conference on Language Modeling}, |
|
|
year = {2024}, |
|
|
url = {https://openreview.net/forum?id=IBCBMeAhmC}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## π Acknowledgement |
|
|
|
|
|
- [HumanEval](https://github.com/openai/human-eval) |
|
|
- [MBPP](https://github.com/google-research/google-research/tree/master/mbpp) |
|
|
|