Leaderboard
View the LiveCodeBench leaderboard rankings
LiveCodeBench collects coding problems over time from various coding contest platforms, annotating problems with their release dates. Annotations are used to evaluate models on problem sets released in different time windows, allowing an “evaluation over time” strategy that helps detect and prevent contamination. In addition to the usual code generation task, LiveCodeBench also assesses self-repair, test output prediction, and code execution, thus providing a more holistic view of coding capabilities required for the next generation of AI programming agents.
LiveCodeBench problems are curated from coding competition platforms: LeetCode, AtCoder, and CodeForces. These websites periodically host contests containing problems that assess the coding and problem-solving skills of participants. Problems consist of a natural language problem statement along with example input-output examples, and the goal is to write a program that passes a set of hidden tests. Thousands of participants engage in the competitions, which ensures that the problems are vetted for clarity and correctness.
LiveCodeBench uses the collected problems for building its four coding scenarios
assert f(input) == generated_output passes.For each scenario, evaluation is performed using the Pass@1 metric. The metric captures the probability of generating a correct answer and is computed using the ratio of the count of correct answers over the count of total attempts, following Pass@1 = total_correct / total_attempts.
Contamination is one of the major bottlenecks in current LLM evaluations. Even within LLM coding evaluations, there have been evidential reports of contamination and overfitting on standard benchmarks like HumanEval ([1] and [2]).
For this reason, we annotate problems with release dates in LiveCodeBench: that way, for new models with a training-cutoff date D, we can compute scores on problems released after D to measure their generalization on unseen problems.
LiveCodeBench formalizes this with a “scrolling over time” feature, that allows you to select problems within a specific time window. You can try it out in the leaderboard above!
We find that:
GPT-4-Turbo is the best-performing model across most scenarios. Furthermore, its margin grows on self-repair tasks, highlighting its capability to take compiler feedback.Claude-3-Opus overtakes GPT-4-Turbo in the test output prediction scenario, highlighting stronger natural language reasoning capabilities. Mistral-Large performs considerably better on natural language reasoning tasks like test output prediction and code execution.To evaluate your code models on LiveCodeBench, you can follow these steps
git clone https://github.com/LiveCodeBench/LiveCodeBench.git
cd LiveCodeBench
pip install poetry
poetry install
python -m lcb_runner.runner.main --model {model_name} --scenario {scenario_name}
for different scenarios. For new model families, we have implemented an extensible framework and you can support new models by modifying lcb_runner/lm_styles.py and lcb_runner/prompts as described in the github README.
Finally, we are looking for collaborators and suggestions for LiveCodeBench. The dataset and code are available online, so please reach out by submitting an issue or mail.
View the LiveCodeBench leaderboard rankings