What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models Paper • 2601.06165 • Published Jan 7 • 16
Reasoning over mathematical objects: on-policy reward modeling and test time aggregation Paper • 2603.18886 • Published Mar 19 • 6
KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context Paper • 2604.13058 • Published Mar 18 • 1
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs Paper • 2605.09063 • Published 10 days ago • 77
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs Paper • 2605.09063 • Published 10 days ago • 77
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs Paper • 2605.09063 • Published 10 days ago • 77
XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity Paper • 2605.05662 • Published 12 days ago • 11
XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity Paper • 2605.05662 • Published 12 days ago • 11
Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math Paper • 2602.06291 • Published Feb 6 • 24
Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math Paper • 2602.06291 • Published Feb 6 • 24
What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models Paper • 2601.06165 • Published Jan 7 • 16
What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models Paper • 2601.06165 • Published Jan 7 • 16
Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures Paper • 2510.24081 • Published Oct 28, 2025 • 24