AERO: Autonomous Evolutionary Reasoning Optimization via Endogenous Dual-Loop Feedback Paper • 2602.03084 • Published Feb 3
OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions Paper • 2602.05843 • Published Feb 5 • 60
TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents Paper • 2602.02196 • Published Feb 2 • 35
A^3-Bench: Benchmarking Memory-Driven Scientific Reasoning via Anchor and Attractor Activation Paper • 2601.09274 • Published Jan 14 • 85
Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs Paper • 2505.15210 • Published May 21, 2025 • 18
Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs Paper • 2505.15210 • Published May 21, 2025 • 18
Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models Paper • 2409.03155 • Published Sep 5, 2024 • 2
Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models Paper • 2409.03155 • Published Sep 5, 2024 • 2
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning Paper • 2504.00487 • Published Apr 1, 2025 • 18
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning Paper • 2504.00487 • Published Apr 1, 2025 • 18
MAPS: A Multi-Agent Framework Based on Big Seven Personality and Socratic Guidance for Multimodal Scientific Problem Solving Paper • 2503.16905 • Published Mar 21, 2025 • 54