SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?
Abstract
SoundnessBench evaluates large language models' ability to assess the methodological validity of machine learning research proposals, revealing persistent optimism bias in current models.
Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with reviewer soundness sub-scores, and audited against source papers. SoundnessBench should be interpreted as a benchmark for recoverable proposal-stage soundness rather than exact prediction of full-paper review outcomes. Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases, surface features, and human audit quality suggest that this behavior is not explained by a single confounder. Our results indicate that current LLMs are not yet reliable as standalone first-gate evaluators for scientific rigor.
Community
SoundnessBench: Testing Whether LLMs Can Assess the Scientific Soundness of Research Plans
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews (2026)
- Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies (2026)
- Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks (2026)
- MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models (2026)
- ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure (2026)
- E3: Issue-Level Backtesting for Automated Research Critique (2026)
- Spurious Prompts: Can Irrelevant Prompts Steer Large Language Models? (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.30329 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper