NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers? Paper • 2606.24530 • Published 1 day ago • 46
Qwen-AgentWorld: Language World Models for General Agents Paper • 2606.24597 • Published 1 day ago • 74
Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark Paper • 2606.18648 • Published 8 days ago • 14
CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents Paper • 2606.22883 • Published 3 days ago • 31
ENPIRE: Agentic Robot Policy Self-Improvement in the Real World Paper • 2606.19980 • Published 7 days ago • 14
iOSWorld: A Benchmark for Personally Intelligent Phone Agents Paper • 2606.09764 • Published 17 days ago • 3
MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents Paper • 2606.16748 • Published 10 days ago • 6
Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients Paper • 2606.18216 • Published 9 days ago • 60
HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry Paper • 2606.14249 • Published 13 days ago • 47
EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery Paper • 2606.13662 • Published 14 days ago • 27
EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments Paper • 2606.13681 • Published 14 days ago • 140
SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning Paper • 2606.13673 • Published 14 days ago • 106
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields Paper • 2606.11042 • Published 16 days ago • 21
SWE-Explore: Benchmarking How Coding Agents Explore Repositories Paper • 2606.07297 • Published 20 days ago • 119