PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems Paper • 2606.22388 • Published 4 days ago • 85
Parametric Social Identity Injection and Diversification in Public Opinion Simulation Paper • 2603.16142 • Published 24 days ago • 1
Reinforcement Learning from Rich Feedback with Distributional DAgger Paper • 2606.05152 • Published 22 days ago • 3
SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents Paper • 2606.05761 • Published 21 days ago • 19
AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints Paper • 2606.05622 • Published 21 days ago • 43
Ψ-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues Paper • 2606.02754 • Published 23 days ago • 13
GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization Paper • 2605.31464 • Published 27 days ago • 2