kennethzychew's picture
phase 5: evaluation harness (SROIE)
d2a6765
Raw
History Blame Contribute Delete
818 Bytes
"""Evaluation harness for the document-extraction agent (build-plan phase 5).
The harness is deliberately split into two phases so that model inference
happens exactly once and threshold tuning is free:
- **predict** (``eval.predict``) runs ``core.process_document`` over a dataset
slice and caches each result (gold labels, predicted document, confidence,
validation report) to ``eval/cache/`` keyed by example id. This is the only
phase that calls a model backend and the only phase that spends API quota.
- **score** (``eval.score``) loads the cache and computes every metric plus the
threshold sweep purely offline. Re-tuning never re-runs inference.
See ``docs/03_data_and_extraction_spec.md`` section 6 for the evaluation
methodology and ``docs/05_build_plan.md`` phase 5 for the task definition.
"""