NGM: A Plug-and-Play Training-Free Memory Module for LLMs Paper • 2605.16893 • Published 4 days ago • 8
TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents Paper • 2605.16909 • Published 4 days ago • 6
DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation Paper • 2604.14683 • Published Apr 16 • 36
Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability Paper • 2604.06628 • Published Apr 8 • 324
AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents Paper • 2601.06818 • Published Jan 11 • 1
AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security Paper • 2601.18491 • Published Jan 26 • 125
Controlled Self-Evolution for Algorithmic Code Optimization Paper • 2601.07348 • Published Jan 12 • 115