PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems Paper • 2606.22388 • Published 4 days ago • 87
Qwen-AgentWorld: Language World Models for General Agents Paper • 2606.24597 • Published 2 days ago • 98
MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization Paper • 2606.19930 • Published 7 days ago • 35
MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management Paper • 2606.19926 • Published 7 days ago • 34
Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents Paper • 2606.06036 • Published 21 days ago • 73
EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge Paper • 2606.13120 • Published 14 days ago • 4
WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces Paper • 2606.09426 • Published 17 days ago • 102
FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents Paper • 2606.12087 • Published 15 days ago • 75
EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments Paper • 2606.13681 • Published 14 days ago • 140
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research Paper • 2606.07591 • Published 28 days ago • 95
Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories Paper • 2606.02060 • Published 24 days ago • 55
Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces Paper • 2605.29288 • Published 28 days ago • 9
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents Paper • 2605.25624 • Published May 25 • 34
OpenComputer: Verifiable Software Worlds for Computer-Use Agents Paper • 2605.19769 • Published May 19 • 85
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information Paper • 2605.11609 • Published May 12 • 196
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration Paper • 2605.20025 • Published May 19 • 190
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation Paper • 2605.10912 • Published May 11 • 46