PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems Paper • 2606.22388 • Published 3 days ago • 82
AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints Paper • 2606.05622 • Published 20 days ago • 43
Advancing Creative Physical Intelligence in Large Multimodal Models Paper • 2605.26396 • Published about 1 month ago • 21
Useful Memories Become Faulty When Continuously Updated by LLMs Paper • 2605.12978 • Published May 13 • 19
CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing Paper • 2605.02910 • Published May 6 • 23
PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning Paper • 2601.11957 • Published Jan 28 • 3
MedSAM3: Delving into Segment Anything with Medical Concepts Paper • 2511.19046 • Published Nov 24, 2025 • 55
Where LLM Agents Fail and How They can Learn From Failures Paper • 2509.25370 • Published Sep 29, 2025 • 12