Papers
arxiv:2605.24213

Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild

Published on May 22
ยท Submitted by
Zhimin Zhao
on May 26
Authors:
,
,
,
,

Abstract

Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering concerns have received limited attention so far. We present an empirical study of 57 evaluation harnesses, deriving a five-stage harness model and classifying 16,560 issues by workflow stage and root cause. Most harness operational challenges concentrate in the Specification stage (41.4% of issues), where harnesses integrate external models, datasets, and scoring judges. The three most frequent root causes of operational challenges are unimplemented features (24.3%), documentation gaps (20.3%), and missing input validation (17.2%), which together account for 61.7% of classified issues, spanning both defects in existing functionality and capability gaps that block intended workflows. Root causes also vary by workflow stage: environment incompatibility and external dependency breakage account for 36.2% of provisioning issues, whereas algorithmic error (25.9%) and validation gap (22.5%) dominate assessment issues. Together, these contributions establish an empirical foundation for treating evaluation engineering as a distinct software engineering concern.

Community

๐Ÿ”ฌ We just dropped a new paper: Towards Evaluation Engineering

If you've ever fought with a broken eval harness, you're not alone โ€” and now there's data to back that up.

Building on our earlier work on Leaderboard Operations (LBOps), which studied how foundation model leaderboards operate in the wild, we zoom in on the infrastructure underneath: evaluation harnesses themselves.

In this new paper (arXiv:2605.24213), we empirically studied 57 evaluation harnesses โ€” from LM Eval and HELM to SWE-bench and Ragas โ€” analyzing 19,638 GitHub issues to understand where developers actually struggle.

Some key findings:

1๏ธโƒฃ We derive a 5-stage harness workflow:
Provisioning โ†’ Specification โ†’ Execution โ†’ Assessment โ†’ Reporting

2๏ธโƒฃ 41.4% of all issues concentrate in the Specification stage, where harnesses integrate models, datasets, and judges

3๏ธโƒฃ The top 3 root causes โ€” unimplemented features (24.3%), documentation gaps (20.3%), and missing input validation (17.2%) โ€” account for 61.7% of all operational challenges

4๏ธโƒฃ Only 8.8% of harnesses support regression alerting, and only 22.8% support uncertainty quantification

We argue these findings establish the case for Evaluation Engineering (EvalEng) as a distinct software engineering concern: the operational counterpart to benchmark design, analogous to how DevOps sits alongside software development.

๐Ÿ“ฆ We also released the annotated dataset of ~20k classified GitHub issues on Hugging Face: https://huggingface.co/datasets/zhiminy/EvalEng

Would love to hear from harness maintainers and users: what's your biggest pain point?

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.24213
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.24213 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.24213 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.