Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It Paper • 2606.26027 • Published 3 days ago • 15
GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents Paper • 2606.24551 • Published 5 days ago • 25
The Verification Horizon: No Silver Bullet for Coding Agent Rewards Paper • 2606.26300 • Published 3 days ago • 37
OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning Paper • 2606.26790 • Published 2 days ago • 40
PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems Paper • 2606.22388 • Published 6 days ago • 95
Qwen-AgentWorld: Language World Models for General Agents Paper • 2606.24597 • Published 4 days ago • 132
MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization Paper • 2606.19930 • Published 9 days ago • 42
MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management Paper • 2606.19926 • Published 9 days ago • 42
Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents Paper • 2606.06036 • Published 23 days ago • 75
EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge Paper • 2606.13120 • Published 16 days ago • 4
WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces Paper • 2606.09426 • Published 19 days ago • 104
FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents Paper • 2606.12087 • Published 17 days ago • 77
EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments Paper • 2606.13681 • Published 16 days ago • 142
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research Paper • 2606.07591 • Published about 1 month ago • 97
Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories Paper • 2606.02060 • Published 26 days ago • 57
Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces Paper • 2605.29288 • Published about 1 month ago • 9
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents Paper • 2605.25624 • Published May 25 • 34