TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents Paper • 2606.28480 • Published 5 days ago • 40
OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks Paper • 2606.29537 • Published 3 days ago • 12
The Verification Horizon: No Silver Bullet for Coding Agent Rewards Paper • 2606.26300 • Published 7 days ago • 45
Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence Paper • 2606.15932 • Published 15 days ago • 38
Autodata: An agentic data scientist to create high quality synthetic data Paper • 2606.25996 • Published 7 days ago • 18
Qwen-AgentWorld: Language World Models for General Agents Paper • 2606.24597 • Published 8 days ago • 141
EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions Paper • 2606.23654 • Published 9 days ago • 78
PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems Paper • 2606.22388 • Published 10 days ago • 95
CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents Paper • 2606.22883 • Published 9 days ago • 37
AweAgent Meta-Data Collection Meta-data for AweAgent: https://github.com/AweAI-Team/AweAgent • 5 items • Updated 11 days ago
VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models Paper • 2606.16140 • Published 16 days ago • 120
BRDFusion: Physics Meets Generation for Urban Scene Inverse Rendering Paper • 2606.17049 • Published 16 days ago • 27
Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models Paper • 2606.16281 • Published 16 days ago • 34