Spaces:
Running
Running
File size: 565 Bytes
16dc556 | 1 2 3 4 5 6 7 8 9 10 11 12 13 | """Evaluation harness for the ScrubData planner.
Measures any planner (`callable(dirty_df) -> plan dict`) against a held-out gold set:
- JSON-schema validity of the plan
- operation-level micro-F1 vs the gold plan
- canonicalization mapping micro-F1 (the fuzzy skill rules can't do)
- end-to-end cell-recovery (executor(dirty, plan) vs known-clean reference)
Two reference systems frame every run:
- HEURISTIC (`scrubdata.mock_plan`) = the baseline a fine-tuned model must beat.
- ORACLE (the gold plan itself) = the goalpost ceiling (~100% by construction).
"""
|