Restore all essential files - code, configs, and MBPP/HumanEval data

24c2665 verified 4 months ago

2.56 kB

	# Evaluating Language Models for Efficient Code Generation (COLM'24)

	* [Paper](https://www.arxiv.org/abs/2408.06450)
	* [Poster](https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf)

	## Overview

	Code Efficiency Evaluation requires:

	* Performance-exercising tasks:
	* Computationally non-trivial task
	* Computationally intensive test input
	* Meaningful compound metric:
	* We need to evaluate on multiple tasks to get statistical sense of LLM's code efficiency
	* Yet, commonly used avg. speedup is biased towards tasks with larger efficiency gaps.

	Using Differential Performance Evaluation, we curate the EvalPerf dataset -- current version (`20240328`) includes:

	* 118 performance-exercising tasks
	* Each task is equipped with a computationally challenging test input generated by the SaS generator
	* Differential performance score (DPS) that brings conclusions like "Your submission can outperform 80% of LLM solutions..."
	* Pairwise comparison of LLMs' code efficiency over commonly passing tasks to ablate correctness impact

	## Running EvalPerf

	```bash
	evalplus.evalperf --model {model_name} --backend [vllm\|hf\|openai\|google\|anthropic]
	# model_name can be hugginface path such as `ise-uiuc/Magicoder-DS-6.7B`
	```

	This script overall performs four steps:

	* Step 1: We sample 100 solutions (`n_samples`) from each LLM to evaluate
	* Step 2: For tasks with at least 10 passing samples (`min_correct`), we perform efficiency evaluation
	* Step 3: Produce a `{model_name}_evalperf_v{VERSION}.jsonl` file where each line includes:
	* `task_id` (str)
	* `results` (`List[Dict]`)
	* `solution` (str)
	* `pass` (bool)
	* `profiled` (bool)
	* `matching_cluster_idx` (`Optional[int]`)
	* `_num_cpu_instructions` (`Optional[int]`)
	* `dps` (`Optional[float]`)
	* `dps_norm` (`Optional[float]`)
	* `ref` (`List[Dict]`)
	* `solution` (str)
	* `score` (float; 100 based)
	* `_num_cpu_instructions` (`Optional[int]`)
	* `dps` (`Optional[float]`)
	* `dps_norm` (`Optional[float]`)
	* `pass@1` (float; 100 based)
	* `n_profiled` (`Optional[int]`)
	* Step 4: Compute the differential performance score


	## Citation

	```bibtex
	@inproceedings{liu2024evaluating,
	title = {Evaluating Language Models for Efficient Code Generation},
	author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
	booktitle = {First Conference on Language Modeling},
	year = {2024},
	url = {https://openreview.net/forum?id=IBCBMeAhmC},
	}
	```