File size: 2,563 Bytes
24c2665
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# Evaluating Language Models for Efficient Code Generation (COLM'24)

* [Paper](https://www.arxiv.org/abs/2408.06450)
* [Poster](https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf)

## Overview

**Code Efficiency Evaluation** requires:

* **Performance-exercising tasks**:
    * Computationally non-trivial *task*
    * Computationally intensive *test input*
* **Meaningful compound metric**:
    * We need to evaluate on multiple tasks to get statistical sense of LLM's code efficiency
    * Yet, commonly used avg. speedup is biased towards tasks with larger efficiency gaps.

Using **Differential Performance Evaluation**, we curate the EvalPerf dataset -- current version (`20240328`) includes:

* 118 performance-exercising tasks
* Each task is equipped with a computationally challenging test input generated by the SaS generator
* Differential performance score (DPS) that brings conclusions like "Your submission can outperform 80% of LLM solutions..."
* Pairwise comparison of LLMs' code efficiency over commonly passing tasks to ablate correctness impact

## Running EvalPerf

```bash
evalplus.evalperf --model {model_name} --backend [vllm|hf|openai|google|anthropic]
# model_name can be hugginface path such as `ise-uiuc/Magicoder-DS-6.7B`
```

This script overall performs four steps:

* **Step 1**: We sample 100 solutions (`n_samples`) from each LLM to evaluate
* **Step 2**: For tasks with at least 10 passing samples (`min_correct`), we perform efficiency evaluation
* **Step 3**: Produce a `{model_name}_evalperf_v{VERSION}.jsonl` file where each line includes:
  * `task_id` (str)
  * `results` (`List[Dict]`)
    * `solution` (str)
    * `pass` (bool)
    * `profiled` (bool)
    * `matching_cluster_idx` (`Optional[int]`)
    * `_num_cpu_instructions` (`Optional[int]`)
    * `dps` (`Optional[float]`)
    * `dps_norm` (`Optional[float]`)
  * `ref` (`List[Dict]`)
    * `solution` (str)
    * `score` (float; 100 based)
    * `_num_cpu_instructions` (`Optional[int]`)
  * `dps` (`Optional[float]`)
  * `dps_norm` (`Optional[float]`)
  * `pass@1` (float; 100 based)
  * `n_profiled` (`Optional[int]`)
* **Step 4**: Compute the differential performance score


## Citation

```bibtex
@inproceedings{liu2024evaluating,
  title = {Evaluating Language Models for Efficient Code Generation},
  author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
  booktitle = {First Conference on Language Modeling},
  year = {2024},
  url = {https://openreview.net/forum?id=IBCBMeAhmC},
}
```