PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems Paper • 2606.22388 • Published 10 days ago • 95 • 3
AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints Paper • 2606.05622 • Published 27 days ago • 44 • 4
CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration? Paper • 2510.24505 • Published Oct 28, 2025 • 5 • 2
CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents Paper • 2511.02734 • Published Nov 4, 2025 • 23 • 2