MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome Paper • 2603.28407 • Published 3 days ago • 49
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research Paper • 2505.19955 • Published May 26, 2025 • 14
ConfTuner: Training Large Language Models to Express Their Confidence Verbally Paper • 2508.18847 • Published Aug 26, 2025 • 2