Human Psychometric Questionnaires Mischaracterize LLM Behavior Paper • 2509.10078 • Published 27 days ago • 36
SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research Paper • 2606.09730 • Published 17 days ago • 52
Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories Paper • 2606.02060 • Published 24 days ago • 55
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts Paper • 2606.02404 • Published 24 days ago • 57
COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation Paper • 2605.31264 • Published 27 days ago • 118
Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses Paper • 2606.02373 • Published 24 days ago • 57
KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks Paper • 2606.03458 • Published 23 days ago • 65
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval Paper • 2604.18584 • Published Apr 20 • 15