hjkim00's picture
Restore all essential files - code, configs, and MBPP/HumanEval data
24c2665 verified

Evaluating Language Models for Efficient Code Generation (COLM'24)

Overview

Code Efficiency Evaluation requires:

  • Performance-exercising tasks:
    • Computationally non-trivial task
    • Computationally intensive test input
  • Meaningful compound metric:
    • We need to evaluate on multiple tasks to get statistical sense of LLM's code efficiency
    • Yet, commonly used avg. speedup is biased towards tasks with larger efficiency gaps.

Using Differential Performance Evaluation, we curate the EvalPerf dataset -- current version (20240328) includes:

  • 118 performance-exercising tasks
  • Each task is equipped with a computationally challenging test input generated by the SaS generator
  • Differential performance score (DPS) that brings conclusions like "Your submission can outperform 80% of LLM solutions..."
  • Pairwise comparison of LLMs' code efficiency over commonly passing tasks to ablate correctness impact

Running EvalPerf

evalplus.evalperf --model {model_name} --backend [vllm|hf|openai|google|anthropic]
# model_name can be hugginface path such as `ise-uiuc/Magicoder-DS-6.7B`

This script overall performs four steps:

  • Step 1: We sample 100 solutions (n_samples) from each LLM to evaluate
  • Step 2: For tasks with at least 10 passing samples (min_correct), we perform efficiency evaluation
  • Step 3: Produce a {model_name}_evalperf_v{VERSION}.jsonl file where each line includes:
    • task_id (str)
    • results (List[Dict])
      • solution (str)
      • pass (bool)
      • profiled (bool)
      • matching_cluster_idx (Optional[int])
      • _num_cpu_instructions (Optional[int])
      • dps (Optional[float])
      • dps_norm (Optional[float])
    • ref (List[Dict])
      • solution (str)
      • score (float; 100 based)
      • _num_cpu_instructions (Optional[int])
    • dps (Optional[float])
    • dps_norm (Optional[float])
    • pass@1 (float; 100 based)
    • n_profiled (Optional[int])
  • Step 4: Compute the differential performance score

Citation

@inproceedings{liu2024evaluating,
  title = {Evaluating Language Models for Efficient Code Generation},
  author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
  booktitle = {First Conference on Language Modeling},
  year = {2024},
  url = {https://openreview.net/forum?id=IBCBMeAhmC},
}