Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields Paper • 2606.11042 • Published 2 days ago • 14
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents Paper • 2512.12730 • Published Dec 14, 2025 • 52