ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking Paper โข 2601.06487 โข Published Jan 10 โข 53
GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents Paper โข 2511.04307 โข Published Nov 6, 2025 โข 15
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution Paper โข 2510.25726 โข Published Oct 29, 2025 โข 46
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization Paper โข 2510.08540 โข Published Oct 9, 2025 โข 109
view article Article Introducing smolagents: simple agents that write actions in code. +1 Dec 31, 2024 โข 1.18k