P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains Paper • 2410.09207 • Published Oct 11, 2024
Analyzing Diffusion and Autoregressive Vision Language Models in Multimodal Embedding Space Paper • 2602.06056 • Published Jan 19
TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction Paper • 2604.22880 • Published 14 days ago • 9
A Survey of Multimodal Mathematical Reasoning: From Perception, Alignment to Reasoning Paper • 2603.08291 • Published 24 days ago
Step-level Optimization for Efficient Computer-use Agents Paper • 2604.27151 • Published 9 days ago • 18
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems Paper • 2605.04018 • Published 3 days ago • 28
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems Paper • 2605.04018 • Published 3 days ago • 28
RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation Paper • 2603.09723 • Published Mar 10 • 7
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research Paper • 2507.13300 • Published Jul 17, 2025 • 20
PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles Paper • 2510.06475 • Published Oct 7, 2025 • 2
MSRS: Evaluating Multi-Source Retrieval-Augmented Generation Paper • 2508.20867 • Published Aug 28, 2025
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering Paper • 2510.06426 • Published Oct 7, 2025 • 3
SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing Paper • 2506.04583 • Published Jun 5, 2025
FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents Paper • 2411.05764 • Published Nov 8, 2024
MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval Paper • 2510.09510 • Published Oct 10, 2025 • 8
FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain Paper • 2510.15232 • Published Oct 17, 2025 • 6
LimRank: Less is More for Reasoning-Intensive Information Reranking Paper • 2510.23544 • Published Oct 27, 2025 • 9
Measuring what Matters: Construct Validity in Large Language Model Benchmarks Paper • 2511.04703 • Published Nov 3, 2025 • 8
AlphaResearch: Accelerating New Algorithm Discovery with Language Models Paper • 2511.08522 • Published Nov 11, 2025 • 18