Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild Paper โข 2605.24213 โข Published May 22 โข 14