Trimming the Long-Tail of Visual World Modeling Evaluation Paper • 2606.24256 • Published 8 days ago • 34
GBC: Gradient-Based Connections for Optimizing Multi-Agent Systems Paper • 2606.28187 • Published 5 days ago • 10
BioInsight: Multi-Agent Orchestration for Interactive Biomedical Knowledge Discovery Paper • 2606.20997 • Published 12 days ago • 3
PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems Paper • 2606.22388 • Published 10 days ago • 95
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces Paper • 2604.04017 • Published Apr 5 • 8
Brick-Composer: Using MLLMs for Assembly with Diverse Bricks Paper • 2606.05445 • Published 28 days ago • 8
AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints Paper • 2606.05622 • Published 27 days ago • 44
Ψ-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues Paper • 2606.02754 • Published 29 days ago • 13
AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security Paper • 2605.29801 • Published May 28 • 144
Advancing Creative Physical Intelligence in Large Multimodal Models Paper • 2605.26396 • Published May 25 • 21
Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use Paper • 2605.14038 • Published May 13 • 15
CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing Paper • 2605.02910 • Published May 6 • 23
Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration? Paper • 2603.03202 • Published Mar 3 • 18
Dancing in Chains: Strategic Persuasion in Academic Rebuttal via Theory of Mind Paper • 2601.15715 • Published Jan 22 • 14
NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems Paper • 2601.11004 • Published Jan 16 • 31