arxiv:2605.24213

Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild

Published on May 22

· Submitted by

Zhimin Zhao on May 26

Queen's University

Upvote

Authors:

Abstract

Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering concerns have received limited attention so far. We present an empirical study of 57 evaluation harnesses, deriving a five-stage harness model and classifying 16,560 issues by workflow stage and root cause. Most harness operational challenges concentrate in the Specification stage (41.4% of issues), where harnesses integrate external models, datasets, and scoring judges. The three most frequent root causes of operational challenges are unimplemented features (24.3%), documentation gaps (20.3%), and missing input validation (17.2%), which together account for 61.7% of classified issues, spanning both defects in existing functionality and capability gaps that block intended workflows. Root causes also vary by workflow stage: environment incompatibility and external dependency breakage account for 36.2% of provisioning issues, whereas algorithmic error (25.9%) and validation gap (22.5%) dominate assessment issues. Together, these contributions establish an empirical foundation for treating evaluation engineering as a distinct software engineering concern.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

zhiminy

Paper submitter about 21 hours ago

•

edited about 13 hours ago

🔬 We just dropped a new paper: Towards Evaluation Engineering

If you've ever fought with a broken eval harness, you're not alone — and now there's data to back that up.

Building on our earlier work on Leaderboard Operations (LBOps), which studied how foundation model leaderboards operate in the wild, we zoom in on the infrastructure underneath: evaluation harnesses themselves.

In this new paper (arXiv:2605.24213), we empirically studied 57 evaluation harnesses — from LM Eval and HELM to SWE-bench and Ragas — analyzing 19,638 GitHub issues to understand where developers actually struggle.

Some key findings:

1️⃣ We derive a 5-stage harness workflow:
Provisioning → Specification → Execution → Assessment → Reporting

2️⃣ 41.4% of all issues concentrate in the Specification stage, where harnesses integrate models, datasets, and judges

3️⃣ The top 3 root causes — unimplemented features (24.3%), documentation gaps (20.3%), and missing input validation (17.2%) — account for 61.7% of all operational challenges

4️⃣ Only 8.8% of harnesses support regression alerting, and only 22.8% support uncertainty quantification

We argue these findings establish the case for Evaluation Engineering (EvalEng) as a distinct software engineering concern: the operational counterpart to benchmark design, analogous to how DevOps sits alongside software development.

📦 We also released the annotated dataset of ~20k classified GitHub issues on Hugging Face: https://huggingface.co/datasets/zhiminy/EvalEng

Would love to hear from harness maintainers and users: what's your biggest pain point?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.24213

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.24213 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.24213 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.