| # Evaluating Language Models for Efficient Code Generation (COLM'24) | |
| * [Paper](https://www.arxiv.org/abs/2408.06450) | |
| * [Poster](https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf) | |
| ## Overview | |
| **Code Efficiency Evaluation** requires: | |
| * **Performance-exercising tasks**: | |
| * Computationally non-trivial *task* | |
| * Computationally intensive *test input* | |
| * **Meaningful compound metric**: | |
| * We need to evaluate on multiple tasks to get statistical sense of LLM's code efficiency | |
| * Yet, commonly used avg. speedup is biased towards tasks with larger efficiency gaps. | |
| Using **Differential Performance Evaluation**, we curate the EvalPerf dataset -- current version (`20240328`) includes: | |
| * 118 performance-exercising tasks | |
| * Each task is equipped with a computationally challenging test input generated by the SaS generator | |
| * Differential performance score (DPS) that brings conclusions like "Your submission can outperform 80% of LLM solutions..." | |
| * Pairwise comparison of LLMs' code efficiency over commonly passing tasks to ablate correctness impact | |
| ## Running EvalPerf | |
| ```bash | |
| evalplus.evalperf --model {model_name} --backend [vllm|hf|openai|google|anthropic] | |
| # model_name can be hugginface path such as `ise-uiuc/Magicoder-DS-6.7B` | |
| ``` | |
| This script overall performs four steps: | |
| * **Step 1**: We sample 100 solutions (`n_samples`) from each LLM to evaluate | |
| * **Step 2**: For tasks with at least 10 passing samples (`min_correct`), we perform efficiency evaluation | |
| * **Step 3**: Produce a `{model_name}_evalperf_v{VERSION}.jsonl` file where each line includes: | |
| * `task_id` (str) | |
| * `results` (`List[Dict]`) | |
| * `solution` (str) | |
| * `pass` (bool) | |
| * `profiled` (bool) | |
| * `matching_cluster_idx` (`Optional[int]`) | |
| * `_num_cpu_instructions` (`Optional[int]`) | |
| * `dps` (`Optional[float]`) | |
| * `dps_norm` (`Optional[float]`) | |
| * `ref` (`List[Dict]`) | |
| * `solution` (str) | |
| * `score` (float; 100 based) | |
| * `_num_cpu_instructions` (`Optional[int]`) | |
| * `dps` (`Optional[float]`) | |
| * `dps_norm` (`Optional[float]`) | |
| * `pass@1` (float; 100 based) | |
| * `n_profiled` (`Optional[int]`) | |
| * **Step 4**: Compute the differential performance score | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{liu2024evaluating, | |
| title = {Evaluating Language Models for Efficient Code Generation}, | |
| author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming}, | |
| booktitle = {First Conference on Language Modeling}, | |
| year = {2024}, | |
| url = {https://openreview.net/forum?id=IBCBMeAhmC}, | |
| } | |
| ``` | |