Evaluating Language Models for Efficient Code Generation (COLM'24)
Overview
Code Efficiency Evaluation requires:
- Performance-exercising tasks:
- Computationally non-trivial task
- Computationally intensive test input
- Meaningful compound metric:
- We need to evaluate on multiple tasks to get statistical sense of LLM's code efficiency
- Yet, commonly used avg. speedup is biased towards tasks with larger efficiency gaps.
Using Differential Performance Evaluation, we curate the EvalPerf dataset -- current version (20240328) includes:
- 118 performance-exercising tasks
- Each task is equipped with a computationally challenging test input generated by the SaS generator
- Differential performance score (DPS) that brings conclusions like "Your submission can outperform 80% of LLM solutions..."
- Pairwise comparison of LLMs' code efficiency over commonly passing tasks to ablate correctness impact
Running EvalPerf
evalplus.evalperf --model {model_name} --backend [vllm|hf|openai|google|anthropic]
# model_name can be hugginface path such as `ise-uiuc/Magicoder-DS-6.7B`
This script overall performs four steps:
- Step 1: We sample 100 solutions (
n_samples) from each LLM to evaluate - Step 2: For tasks with at least 10 passing samples (
min_correct), we perform efficiency evaluation - Step 3: Produce a
{model_name}_evalperf_v{VERSION}.jsonlfile where each line includes:task_id(str)results(List[Dict])solution(str)pass(bool)profiled(bool)matching_cluster_idx(Optional[int])_num_cpu_instructions(Optional[int])dps(Optional[float])dps_norm(Optional[float])
ref(List[Dict])solution(str)score(float; 100 based)_num_cpu_instructions(Optional[int])
dps(Optional[float])dps_norm(Optional[float])pass@1(float; 100 based)n_profiled(Optional[int])
- Step 4: Compute the differential performance score
Citation
@inproceedings{liu2024evaluating,
title = {Evaluating Language Models for Efficient Code Generation},
author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
booktitle = {First Conference on Language Modeling},
year = {2024},
url = {https://openreview.net/forum?id=IBCBMeAhmC},
}