scrubdata / eval /__init__.py
OpenAI Codex
deploy: add sponsor:openai tag (Best Use of Codex) + Codex-hardened build
16dc556
Raw
History Blame Contribute Delete
565 Bytes
"""Evaluation harness for the ScrubData planner.
Measures any planner (`callable(dirty_df) -> plan dict`) against a held-out gold set:
- JSON-schema validity of the plan
- operation-level micro-F1 vs the gold plan
- canonicalization mapping micro-F1 (the fuzzy skill rules can't do)
- end-to-end cell-recovery (executor(dirty, plan) vs known-clean reference)
Two reference systems frame every run:
- HEURISTIC (`scrubdata.mock_plan`) = the baseline a fine-tuned model must beat.
- ORACLE (the gold plan itself) = the goalpost ceiling (~100% by construction).
"""