Benchmark
updated
Scaling Computer-Use Grounding via User Interface Decomposition and
Synthesis
Paper
•
2505.13227
•
Published
•
45
facebook/natural_reasoning
Viewer
•
Updated
•
1.15M
•
1.49k
•
549
Viewer
•
Updated
•
5.68M
•
12.3k
•
415
Search Arena: Analyzing Search-Augmented LLMs
Paper
•
2506.05334
•
Published
•
17
OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation
Paper
•
2506.07977
•
Published
•
41
Viewer
•
Updated
•
824
•
8.87k
•
239
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive
Programming?
Paper
•
2506.11928
•
Published
•
24
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim
Verification
Paper
•
2506.15569
•
Published
•
12
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark
for Financial LLM Evaluation
Paper
•
2506.14028
•
Published
•
93
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
Paper
•
2506.11763
•
Published
•
73
VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement
Learning
Paper
•
2506.09049
•
Published
•
37
Viewer
•
Updated
•
3.35k
•
155
•
49
Can LLMs Identify Critical Limitations within Scientific Research? A
Systematic Evaluation on AI Research Papers
Paper
•
2507.02694
•
Published
•
19
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems
at Once
Paper
•
2507.10541
•
Published
•
30
HuggingFaceTB/SmolLM3-3B-Base
Text Generation
•
3B
•
Updated
•
16.3k
•
148
AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs
Paper
•
2507.08616
•
Published
•
15
The Generative Energy Arena (GEA): Incorporating Energy Awareness in
Large Language Model (LLM) Human Evaluations
Paper
•
2507.13302
•
Published
•
5
Viewer
•
Updated
•
140
•
219
•
6
AbGen: Evaluating Large Language Models in Ablation Study Design and
Evaluation for Scientific Research
Paper
•
2507.13300
•
Published
•
20
DrafterBench: Benchmarking Large Language Models for Tasks Automation in
Civil Engineering
Paper
•
2507.11527
•
Published
•
35
Can Multimodal Foundation Models Understand Schematic Diagrams? An
Empirical Study on Information-Seeking QA over Scientific Papers
Paper
•
2507.10787
•
Published
•
13
WideSearch: Benchmarking Agentic Broad Info-Seeking
Paper
•
2508.07999
•
Published
•
110
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper
•
2508.13186
•
Published
•
19
AetherCode: Evaluating LLMs' Ability to Win In Premier Programming
Competitions
Paper
•
2508.16402
•
Published
•
14
MCP-Universe: Benchmarking Large Language Models with Real-World Model
Context Protocol Servers
Paper
•
2508.14704
•
Published
•
43
NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid
Mamba-Transformer Reasoning Model
Paper
•
2508.14444
•
Published
•
40
UQ: Assessing Language Models on Unsolved Questions
Paper
•
2508.17580
•
Published
•
15
T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image
Generation
Paper
•
2508.17472
•
Published
•
26
ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks
Paper
•
2508.15804
•
Published
•
15
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World
Tasks via MCP Servers
Paper
•
2508.20453
•
Published
•
63
DeepResearch Arena: The First Exam of LLMs' Research Abilities via
Seminar-Grounded Tasks
Paper
•
2509.01396
•
Published
•
58
Viewer
•
Updated
•
8.61k
•
600
•
15
Benchmark
•
Updated
•
12.1k
•
82.8k
•
410
On Robustness and Reliability of Benchmark-Based Evaluation of LLMs
Paper
•
2509.04013
•
Published
•
4
SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric
Knowledge
Paper
•
2509.07968
•
Published
•
14
LoCoBench: A Benchmark for Long-Context Large Language Models in Complex
Software Engineering
Paper
•
2509.09614
•
Published
•
7
GenExam: A Multidisciplinary Text-to-Image Exam
Paper
•
2509.14232
•
Published
•
21
ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark
Paper
•
2501.01290
•
Published
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP
Use
Paper
•
2509.24002
•
Published
•
174
OceanGym: A Benchmark Environment for Underwater Embodied Agents
Paper
•
2509.26536
•
Published
•
36
PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs
Paper
•
2510.09507
•
Published
•
11
PICABench: How Far Are We from Physically Realistic Image Editing?
Paper
•
2510.17681
•
Published
•
64
LiveTradeBench: Seeking Real-World Alpha with Large Language Models
Paper
•
2511.03628
•
Published
•
13
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
Paper
•
2511.15065
•
Published
•
77
M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark
Paper
•
2511.17729
•
Published
•
17
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
Paper
•
2511.20561
•
Published
•
32
RefineBench: Evaluating Refinement Capability of Language Models via Checklists
Paper
•
2511.22173
•
Published
•
15
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
Paper
•
2512.04324
•
Published
•
155
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
Paper
•
2512.12730
•
Published
•
44
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
Paper
•
2512.14051
•
Published
•
45
MMGR: Multi-Modal Generative Reasoning
Paper
•
2512.14691
•
Published
•
117
MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments
Paper
•
2512.19432
•
Published
•
13
FrontierCS: Evolving Challenges for Evolving Intelligence
Paper
•
2512.15699
•
Published
•
5
GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models
Paper
•
2512.15560
•
Published
•
25
Benchmark^2: Systematic Evaluation of LLM Benchmarks
Paper
•
2601.03986
•
Published
•
34