Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models Paper • 2603.01571 • Published 12 days ago • 33
RubricBench: Aligning Model-Generated Rubrics with Human Standards Paper • 2603.01562 • Published 12 days ago • 57
AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration Paper • 2602.03786 • Published Feb 3 • 87
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation Paper • 2601.09688 • Published Jan 14 • 127
AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning Paper • 2511.19304 • Published Nov 24, 2025 • 91
Scaling Latent Reasoning via Looped Language Models Paper • 2510.25741 • Published Oct 29, 2025 • 229
ReCode: Unify Plan and Action for Universal Granularity Control Paper • 2510.23564 • Published Oct 27, 2025 • 122
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows Paper • 2505.19897 • Published May 26, 2025 • 104