PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems Paper • 2606.22388 • Published 5 days ago • 91
Parametric Social Identity Injection and Diversification in Public Opinion Simulation Paper • 2603.16142 • Published 25 days ago • 1
Reinforcement Learning from Rich Feedback with Distributional DAgger Paper • 2606.05152 • Published 23 days ago • 3
SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents Paper • 2606.05761 • Published 22 days ago • 19
AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints Paper • 2606.05622 • Published 22 days ago • 43
Ψ-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues Paper • 2606.02754 • Published 25 days ago • 13
GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization Paper • 2605.31464 • Published 28 days ago • 2
Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data Paper • 2602.21320 • Published Feb 24 • 12
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning Paper • 2512.05111 • Published Dec 4, 2025 • 50
ExpertQA: Expert-Curated Questions and Attributed Answers Paper • 2309.07852 • Published Sep 14, 2023 • 2
DebugBench: Evaluating Debugging Capability of Large Language Models Paper • 2401.04621 • Published Jan 9, 2024 • 2
Meta Llama 3 Collection This collection hosts the transformers and original repos of the Meta Llama 3 and Llama Guard 2 releases • 5 items • Updated Dec 6, 2024 • 975
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models Paper • 2404.07839 • Published Apr 11, 2024 • 49