Papers
arxiv:2604.03922

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Published on Apr 5
Authors:
,
,
,
,
,

Abstract

Researchers address the challenge of selecting correct code candidates from LLM-generated outputs by developing ACES, a method that ranks tests based on their ability to distinguish correct from incorrect code through leave-one-out evaluation and AUC consistency scoring.

AI-generated summary

Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a circular dependency. Our key insight is that we need not determine test correctness at all: test votes should rank, not merely count. What matters is not how many codes pass a test, but whether the test can distinguish correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose ACES~(AUC ConsistEncy Scoring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@k on multiple code generation benchmarks.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.03922
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.03922 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.03922 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.03922 in a Space README.md to link it from this page.

Collections including this paper 1