Spaces:

knzychw
/

document-extract-agent

Running

phase 5: evaluation harness (SROIE)

d2a6765 2 days ago

818 Bytes

	"""Evaluation harness for the document-extraction agent (build-plan phase 5).

	The harness is deliberately split into two phases so that model inference
	happens exactly once and threshold tuning is free:

	- predict (``eval.predict``) runs ``core.process_document`` over a dataset
	slice and caches each result (gold labels, predicted document, confidence,
	validation report) to ``eval/cache/`` keyed by example id. This is the only
	phase that calls a model backend and the only phase that spends API quota.
	- score (``eval.score``) loads the cache and computes every metric plus the
	threshold sweep purely offline. Re-tuning never re-runs inference.

	See ``docs/03_data_and_extraction_spec.md`` section 6 for the evaluation
	methodology and ``docs/05_build_plan.md`` phase 5 for the task definition.
	"""