"""Evaluation harness for the ScrubData planner.

Measures any planner (`callable(dirty_df) -> plan dict`) against a held-out gold set:
- JSON-schema validity of the plan
- operation-level micro-F1 vs the gold plan
- canonicalization mapping micro-F1 (the fuzzy skill rules can't do)
- end-to-end cell-recovery (executor(dirty, plan) vs known-clean reference)

Two reference systems frame every run:
- HEURISTIC (`scrubdata.mock_plan`) = the baseline a fine-tuned model must beat.
- ORACLE (the gold plan itself) = the goalpost ceiling (~100% by construction).
"""