Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization Paper • 2605.26457 • Published 30 days ago • 7
Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization Paper • 2605.26457 • Published 30 days ago • 7
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts Paper • 2606.02404 • Published 24 days ago • 57
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts Paper • 2606.02404 • Published 24 days ago • 57
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts Paper • 2606.02404 • Published 24 days ago • 57
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists Paper • 2605.20668 • Published May 20 • 12
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists Paper • 2605.20668 • Published May 20 • 12
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists Paper • 2605.20668 • Published May 20 • 12
VibeProteinBench: An Evaluation Benchmark for Language-interfaced Vibe Protein Design Paper • 2605.10978 • Published May 13 • 19
Reasoning over mathematical objects: on-policy reward modeling and test time aggregation Paper • 2603.18886 • Published Mar 19 • 6
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs Paper • 2605.09063 • Published May 9 • 82
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs Paper • 2605.09063 • Published May 9 • 82
Reasoning over mathematical objects: on-policy reward modeling and test time aggregation Paper • 2603.18886 • Published Mar 19 • 6
Measuring Sycophancy of Language Models in Multi-turn Dialogues Paper • 2505.23840 • Published May 28, 2025 • 3
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning Paper • 2507.00432 • Published Jul 1, 2025 • 79