Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing Paper • 2601.16125 • Published 13 days ago • 13
Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing Paper • 2601.16125 • Published 13 days ago • 13
Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL Paper • 2601.09876 • Published 21 days ago • 6
LimRank: Less is More for Reasoning-Intensive Information Reranking Paper • 2510.23544 • Published Oct 27, 2025 • 9
MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval Paper • 2510.09510 • Published Oct 10, 2025 • 8
FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain Paper • 2510.15232 • Published Oct 17, 2025 • 6
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering Paper • 2510.06426 • Published Oct 7, 2025 • 3
KnowledgeMath: Knowledge-Intensive Math Word Problem Solving in Finance Domains Paper • 2311.09797 • Published Nov 16, 2023 • 1
DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data Paper • 2311.09805 • Published Nov 16, 2023 • 3
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper • 2501.12380 • Published Jan 21, 2025 • 84
PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles Paper • 2510.06475 • Published Oct 7, 2025 • 2
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering Paper • 2510.06426 • Published Oct 7, 2025 • 3
Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers Paper • 2507.02694 • Published Jul 3, 2025 • 19
Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers Paper • 2507.10787 • Published Jul 14, 2025 • 13
Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers Paper • 2507.06223 • Published Jul 8, 2025 • 14
Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers Paper • 2507.06223 • Published Jul 8, 2025 • 14
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks Paper • 2507.01001 • Published Jul 1, 2025 • 46
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks Paper • 2507.01001 • Published Jul 1, 2025 • 46