CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies Paper • 2606.16613 • Published 14 days ago • 8
The Verification Horizon: No Silver Bullet for Coding Agent Rewards Paper • 2606.26300 • Published 5 days ago • 39
OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning Paper • 2606.26790 • Published 4 days ago • 45
Qwen-AgentWorld: Language World Models for General Agents Paper • 2606.24597 • Published 6 days ago • 136
Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation Paper • 2606.18844 • Published 12 days ago • 18
EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions Paper • 2606.23654 • Published 7 days ago • 80
PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems Paper • 2606.22388 • Published 8 days ago • 95
EvoEmbedding: Evolvable Representations for Long-Context Retrieval and Agentic Memory Paper • 2606.21649 • Published 10 days ago • 32
SkillHarness: Harnessing Safe Skills for Computer-Use Agents Paper • 2606.20636 • Published 27 days ago • 20
Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills Paper • 2606.11897 • Published 19 days ago • 11
EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning Paper • 2606.03108 • Published 27 days ago • 11
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement Paper • 2606.11926 • Published 19 days ago • 120
Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks Paper • 2606.12344 • Published 19 days ago • 70
EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments Paper • 2606.13681 • Published 18 days ago • 142
HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness Paper • 2606.12882 • Published 18 days ago • 13