MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks Paper • 2601.14652 • Published 6 days ago • 1
FARE Collection FARE are Salesforce AI Research's open multi-task evaluator models. • 4 items • Updated Oct 31, 2025 • 2
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains Paper • 2510.17793 • Published Oct 20, 2025 • 3
LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild Paper • 2510.14240 • Published Oct 16, 2025 • 12
Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math Paper • 2510.13744 • Published Oct 15, 2025 • 6
J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization Paper • 2505.13346 • Published May 19, 2025 • 2