Step-level Optimization for Efficient Computer-use Agents Paper • 2604.27151 • Published 8 days ago • 18
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems Paper • 2605.04018 • Published 2 days ago • 12
Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing Paper • 2601.16125 • Published Jan 22 • 13
Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing Paper • 2601.16125 • Published Jan 22 • 13
MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval Paper • 2510.09510 • Published Oct 10, 2025 • 8
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe Paper • 2511.16334 • Published Nov 20, 2025 • 96
MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval Paper • 2510.09510 • Published Oct 10, 2025 • 8
MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval Paper • 2510.09510 • Published Oct 10, 2025 • 8 • 2
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research Paper • 2507.13300 • Published Jul 17, 2025 • 20
Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers Paper • 2507.06223 • Published Jul 8, 2025 • 14
Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers Paper • 2507.02694 • Published Jul 3, 2025 • 19
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research Paper • 2505.19955 • Published May 26, 2025 • 13
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks Paper • 2507.01001 • Published Jul 1, 2025 • 46