Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
Abstract
XpertBench presents a comprehensive benchmark for evaluating large language models across professional domains using expert-curated tasks and a novel LLM-based evaluation approach called ShotJudge.
As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains. XpertBench consists of 1,346 meticulously curated tasks across 80 categories, spanning finance, healthcare, legal services, education, and dual-track research (STEM and Humanities). These tasks are derived from over 1,000 submissions by domain experts--including researchers from elite institutions and practitioners with extensive clinical or industrial experience--ensuring superior ecological validity. Each task uses detailed rubrics with mostly 15-40 weighted checkpoints to assess professional rigor. To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases. Our empirical evaluation of state-of-the-art LLMs reveals a pronounced performance ceiling: even leading models achieve a peak success rate of only ~66%, with a mean score around 55%. Models also exhibit domain-specific divergence, showing non-overlapping strengths in quantitative reasoning versus linguistic synthesis.. These findings underscore a significant "expert-gap" in current AI systems and establish XpertBench as a critical instrument for navigating the transition from general-purpose assistants to specialized professional collaborators.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- $OneMillion-Bench: How Far are Language Agents from Human Experts? (2026)
- RubricBench: Aligning Model-Generated Rubrics with Human Standards (2026)
- QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models (2026)
- LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation (2026)
- Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research (2026)
- FIRE: A Comprehensive Benchmark for Financial Intelligence and Reasoning Evaluation (2026)
- VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.02368 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper