CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows? Paper • 2605.16679 • Published May 15 • 55
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research Paper • 2505.19955 • Published May 26, 2025 • 14