FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning Paper • 2509.13160 • Published Sep 16, 2025 • 29
FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction Paper • 2508.11987 • Published Aug 16, 2025 • 71
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory Paper • 2508.09736 • Published Aug 13, 2025 • 58
WideSearch: Benchmarking Agentic Broad Info-Seeking Paper • 2508.07999 • Published Aug 11, 2025 • 110
Efficient Agents: Building Effective Agents While Reducing Cost Paper • 2508.02694 • Published Jul 24, 2025 • 86
Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving Paper • 2507.23726 • Published Jul 31, 2025 • 115
A Comprehensive Survey on Long Context Language Modeling Paper • 2503.17407 • Published Mar 20, 2025 • 49
Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models Paper • 2503.18923 • Published Mar 24, 2025 • 14
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models Paper • 2502.16614 • Published Feb 23, 2025 • 27
OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision Paper • 2411.07199 • Published Nov 11, 2024 • 50
FuzzCoder: Byte-level Fuzzing Test via Large Language Model Paper • 2409.01944 • Published Sep 3, 2024 • 45
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering Paper • 2408.09174 • Published Aug 17, 2024 • 52