Benchmarks Saturate When The Model Gets Smarter Than The Judge
Abstract
A revised mathematical benchmark dataset was created through manual auditing to reduce noise and improve model performance assessment accuracy.
Benchmarks are important tools to track progress in the development of Large Language Models (LLMs), yet inaccuracies in datasets and evaluation methods consistently undermine their effectiveness. Here, we present Omni-MATH-2, a manually revised version of the Omni-MATH dataset comprising a clean, exact-answer subset (n{=}4181) and a tagged, non-standard subset (n{=}247). Each problem was audited to ensure LaTeX compilability, solvability and verifiability, which involved adding missing figures or information, labeling problems requiring a proof, estimation or image, and removing clutter. This process significantly reduces dataset-induced noise, thereby providing a more precise assessment of model performance. The annotated dataset also allows us to evaluate judge-induced noise by comparing GPT-5 mini with the original Omni-Judge, revealing substantial discrepancies between judges on both the clean and tagged problem subsets. Expert annotations reveal that Omni-Judge is wrong in 96.4% of the judge disagreements, indicating its inability to differentiate between models' abilities, even well before saturation of the benchmark occurs. As problems become more challenging, we find that increasingly competent judges become essential in order to prevent judge errors from masking genuine differences between models. Finally, neither judge identifies the present failure modes for the subset of tagged problems, demonstrating that dataset quality and judge reliability are both critical to develop accurate benchmarks of model performance.
Community
arXivlens breakdown of this paper ๐ https://arxivlens.com/PaperView/Details/benchmarks-saturate-when-the-model-gets-smarter-than-the-judge-5369-da6fc596
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Estimating problem difficulty without ground truth using Large Language Model comparisons (2025)
- Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking (2026)
- DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models (2026)
- SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code (2025)
- Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements (2025)
- Idea First, Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming (2026)
- Self-Improving VLM Judges Without Human Annotations (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
