"""Evaluation harness for the document-extraction agent (build-plan phase 5).

The harness is deliberately split into two phases so that model inference
happens exactly once and threshold tuning is free:

- **predict** (``eval.predict``) runs ``core.process_document`` over a dataset
  slice and caches each result (gold labels, predicted document, confidence,
  validation report) to ``eval/cache/`` keyed by example id. This is the only
  phase that calls a model backend and the only phase that spends API quota.
- **score** (``eval.score``) loads the cache and computes every metric plus the
  threshold sweep purely offline. Re-tuning never re-runs inference.

See ``docs/03_data_and_extraction_spec.md`` section 6 for the evaluation
methodology and ``docs/05_build_plan.md`` phase 5 for the task definition.
"""