scrubdata / eval /README.md
OpenAI Codex
deploy: add sponsor:openai tag (Best Use of Codex) + Codex-hardened build
16dc556
|
Raw
History Blame Contribute Delete
7.43 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Eval harness + goalpost

Measures any planner against a held-out synthetic gold set (seed differs from training, and gold is filtered to oracle-solvable so the ceiling is a clean 1.0).

uv run eval/run_eval.py --n 300 --seed 4242

Adopts the researched tooling: jsonschema for plan validity; set-based micro-F1 for operations and canonicalization mappings; the executor itself for end-to-end cell-recovery (the Raha-style dirty→clean comparison). promptfoo + llm-rubric will wrap the report-quality layer once a model exists.

Metrics

  • json_valid β€” plan conforms to the schema (eval/metrics.py:PLAN_SCHEMA).
  • op_f1 / op_r β€” micro-F1 / recall over (column, operation) pairs vs gold.
  • canon_f1 / canon_r β€” micro-F1 / recall over (column, rawβ†’canonical) mapping pairs. This is the fuzzy skill rules can't do β€” the whole reason for the model.
  • recovery β€” fraction of clean-reference cells recovered by executing the plan.

Baseline (measured) and the goalpost

Two reference systems frame every run:

  • ORACLE = the gold plan β†’ the ceiling.
  • HEURISTIC (scrubdata.mock_plan) = the rule-based baseline the model must beat.

Measured on the frozen 300-example gold set (eval/gold.jsonl, value_counts/aggregation format):

system json_valid op_f1 canon_f1 canon_r recovery
ORACLE (gold) 1.000 1.000 1.000 1.000 1.000
HEURISTIC (baseline) 1.000 0.932 0.189 0.129 0.637

Reading: with case-folding + typo-clustering the heuristic does the easy canonicalization (collapse to most-frequent surface), but it's still ~blind to alias/semantic canonicalization (USA→United States, NYC→New York) — canon_f1 0.19 vs the oracle's 1.0. That gap is the fine-tuned model's job. (Earlier, on the old sample-rows format, a fine-tune reached canon_f1 0.86 vs a big vanilla model's 0.45 — proving small-aligned > big-generic; the v4 retrain re-establishes this on the new format.)

🎯 Goalpost for the fine-tuned Qwen3-4B

metric baseline target ceiling
json_valid 1.000 β‰₯ 0.99 1.000
op_f1 0.932 β‰₯ 0.98 1.000
canon_f1 0.189 β‰₯ 0.85 1.000
recovery 0.637 β‰₯ 0.95 1.000

A fine-tune that hits these clearly beats the (now stronger) heuristic and approaches the oracle β€” the headline being canon_f1 0.133 β†’ β‰₯0.85 (alias-level canonicalization) and recovery 0.627 β†’ β‰₯0.95.

Plugging in the model

evaluate(planner, gold) takes any planner(dirty_df, gold_plan) -> plan dict. For the model, wrap inference (build prompt via scrubdata.prompt, parse JSON) and pass it in alongside the two reference systems. Track the table every fine-tune iteration; the per-metric delta vs baseline is the cheap regression signal.

Layer 2 β€” real out-of-distribution data (uv run eval/run_real.py)

Raha hospital (1000×20, row-aligned dirty/clean). Errors are char-substitution typos (birminghxm→birmingham) — only ~2.5% of cells. Scored with the Raha repair protocol (the right metric when data is already mostly correct):

system recovery repair_recall repair_prec broken
NO-OP (dirty as-is) 0.975 0.000 0.000 0
HEURISTIC (baseline) 0.880 0.293 0.065 2041

(Typo-clustering now fixes ~29% of the real char-substitution errors β€” up from 0. The model should push repair_recall higher and improve repair_prec.)

Reading (honest + important): the rule heuristic fixes 0 typos. Its 2021 changed cells are convention divergence, not errors — our tool parses 100%→1.0 and reformats phones, which this benchmark stores as raw text. That's product value, so raw recovery/broken understates a standardizing tool on a foreign benchmark. The honest metric here is repair_recall — did we fix the actual char-substitution typos (birminghxm→birmingham)? The heuristic can't (scores 0); cluster-canonicalization is the model's job. Two takeaways:

  1. The headline real-data metric is repair_recall (error-fixing), not recovery.
  2. Product feature surfaced: offer a "preserve original formats" toggle β€” some users want raw representation kept; standardizing is the default but should be reversible (matches PRODUCT.md's trust contract).

🎯 Real-data goalpost (fine-tuned model)

metric NO-OP HEURISTIC target note
repair_recall 0.000 0.000 β‰₯ 0.30 the real test β€” fix typos via clustering
repair_prec 0.000 0.000 β‰₯ 0.70 of cells changed, fraction that fixed an error
recovery 0.975 0.874 report-only convention-sensitive; not a pass/fail gate

The model plugs into _score(dirty, clean, model_output) exactly like the heuristic.

Data auto-fetched to data/real/hospital/ (gitignored). Add Flights/Beers/CleanML the same way for breadth.

Scale: aggregation + agentic batching (validated)

Cleaning large tables doesn't mean bigger prompts β€” it means reasoning over patterns:

  • Aggregation β€” the profiler sends per-column value_counts ([value, frequency]), so the prompt size depends on DISTINCT values, not rows. Rare typos sit at the tail next to their dominant canonical (birminghxm:1 vs birmingham:312) β€” visible at any scale.
  • Column batching β€” scrubdata.model_planner.make_batched_planner plans a wide table in small column-batches, so a 20-column table never blows one prompt.

Validated on the real Raha hospital table (1000Γ—20) with a vanilla model (no retrain): repair_recall 0.509 (fixed 259/509 typos), vs 0.000 for the old one-shot+sample-rows approach. The v4 fine-tune trains on this value_counts format.


The wide suite (current north-star)

The single-dataset hospital metric was retired as north-star (biased: one table, recall-only, convention-sensitive, abstain-blind). The current harness:

  • run_real_multi.py β€” 65-dataset suite (5 Raha real-error benchmarks + seeded error injection over 15 harvested open-data domains), scored with a churn-neutral metric (pure case/whitespace rewrites that don't restore gold count as nothing) and aggregated as a double macro (error-type Γ— domain, harmonic mean) so no single table or error type dominates. Reports REAL vs INJECTED slices separately β€” injected typos are in-distribution for frequency clustering by construction.
  • ablations.py β€” removes one grounding component at a time (reference, abstain, ambiguity margin, case-match). Caught two metric artifacts (churn inflation, reference-unsafe traps) now fixed and documented in the paper.
  • calibration.py β€” risk–coverage + ECE for the abstention confidence (AURC 0.120; 90% precision at the default threshold, β‰₯95% at 0.91).
  • pii_leak.py β€” masking leak test: 0/360 residual detectable PII.
  • pii_slice.py β€” OOD PII column typing on Gretel test: 5/5 types, 0/7 FP.
  • inject.py β€” seeded, self-verifying error injectors (typo/OCR/case/whitespace) that turn any clean table into validation data.

Baselines include OpenRefine fingerprint + kNN clustering (scrubdata/baselines.py, with blocking, as the real tool uses). Full results & discussion: docs/paper/.