Spaces:
Running
A newer version of the Gradio SDK is available: 6.19.0
Eval harness + goalpost
Measures any planner against a held-out synthetic gold set (seed differs from training, and gold is filtered to oracle-solvable so the ceiling is a clean 1.0).
uv run eval/run_eval.py --n 300 --seed 4242
Adopts the researched tooling: jsonschema for plan validity; set-based micro-F1 for
operations and canonicalization mappings; the executor itself for end-to-end
cell-recovery (the Raha-style dirtyβclean comparison). promptfoo + llm-rubric will
wrap the report-quality layer once a model exists.
Metrics
- json_valid β plan conforms to the schema (
eval/metrics.py:PLAN_SCHEMA). - op_f1 / op_r β micro-F1 / recall over
(column, operation)pairs vs gold. - canon_f1 / canon_r β micro-F1 / recall over
(column, rawβcanonical)mapping pairs. This is the fuzzy skill rules can't do β the whole reason for the model. - recovery β fraction of clean-reference cells recovered by executing the plan.
Baseline (measured) and the goalpost
Two reference systems frame every run:
- ORACLE = the gold plan β the ceiling.
- HEURISTIC (
scrubdata.mock_plan) = the rule-based baseline the model must beat.
Measured on the frozen 300-example gold set (eval/gold.jsonl, value_counts/aggregation
format):
| system | json_valid | op_f1 | canon_f1 | canon_r | recovery |
|---|---|---|---|---|---|
| ORACLE (gold) | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| HEURISTIC (baseline) | 1.000 | 0.932 | 0.189 | 0.129 | 0.637 |
Reading: with case-folding + typo-clustering the heuristic does the easy
canonicalization (collapse to most-frequent surface), but it's still ~blind to
alias/semantic canonicalization (USAβUnited States, NYCβNew York) β canon_f1
0.19 vs the oracle's 1.0. That gap is the fine-tuned model's job. (Earlier, on the old
sample-rows format, a fine-tune reached canon_f1 0.86 vs a big vanilla model's 0.45 β
proving small-aligned > big-generic; the v4 retrain re-establishes this on the new format.)
π― Goalpost for the fine-tuned Qwen3-4B
| metric | baseline | target | ceiling |
|---|---|---|---|
| json_valid | 1.000 | β₯ 0.99 | 1.000 |
| op_f1 | 0.932 | β₯ 0.98 | 1.000 |
| canon_f1 | 0.189 | β₯ 0.85 | 1.000 |
| recovery | 0.637 | β₯ 0.95 | 1.000 |
A fine-tune that hits these clearly beats the (now stronger) heuristic and approaches the oracle β the headline being canon_f1 0.133 β β₯0.85 (alias-level canonicalization) and recovery 0.627 β β₯0.95.
Plugging in the model
evaluate(planner, gold) takes any planner(dirty_df, gold_plan) -> plan dict. For
the model, wrap inference (build prompt via scrubdata.prompt, parse JSON) and pass it
in alongside the two reference systems. Track the table every fine-tune iteration; the
per-metric delta vs baseline is the cheap regression signal.
Layer 2 β real out-of-distribution data (uv run eval/run_real.py)
Raha hospital (1000Γ20, row-aligned dirty/clean). Errors are char-substitution typos
(birminghxmβbirmingham) β only ~2.5% of cells. Scored with the Raha repair
protocol (the right metric when data is already mostly correct):
| system | recovery | repair_recall | repair_prec | broken |
|---|---|---|---|---|
| NO-OP (dirty as-is) | 0.975 | 0.000 | 0.000 | 0 |
| HEURISTIC (baseline) | 0.880 | 0.293 | 0.065 | 2041 |
(Typo-clustering now fixes ~29% of the real char-substitution errors β up from 0. The model should push repair_recall higher and improve repair_prec.)
Reading (honest + important): the rule heuristic fixes 0 typos. Its 2021 changed
cells are convention divergence, not errors β our tool parses 100%β1.0 and
reformats phones, which this benchmark stores as raw text. That's product value, so raw
recovery/broken understates a standardizing tool on a foreign benchmark. The honest
metric here is repair_recall β did we fix the actual char-substitution typos
(birminghxmβbirmingham)? The heuristic can't (scores 0); cluster-canonicalization is
the model's job. Two takeaways:
- The headline real-data metric is
repair_recall(error-fixing), not recovery. - Product feature surfaced: offer a "preserve original formats" toggle β some users want raw representation kept; standardizing is the default but should be reversible (matches PRODUCT.md's trust contract).
π― Real-data goalpost (fine-tuned model)
| metric | NO-OP | HEURISTIC | target | note |
|---|---|---|---|---|
| repair_recall | 0.000 | 0.000 | β₯ 0.30 | the real test β fix typos via clustering |
| repair_prec | 0.000 | 0.000 | β₯ 0.70 | of cells changed, fraction that fixed an error |
| recovery | 0.975 | 0.874 | report-only | convention-sensitive; not a pass/fail gate |
The model plugs into _score(dirty, clean, model_output) exactly like the heuristic.
Data auto-fetched to
data/real/hospital/(gitignored). Add Flights/Beers/CleanML the same way for breadth.
Scale: aggregation + agentic batching (validated)
Cleaning large tables doesn't mean bigger prompts β it means reasoning over patterns:
- Aggregation β the profiler sends per-column
value_counts([value, frequency]), so the prompt size depends on DISTINCT values, not rows. Rare typos sit at the tail next to their dominant canonical (birminghxm:1 vsbirmingham:312) β visible at any scale. - Column batching β
scrubdata.model_planner.make_batched_plannerplans a wide table in small column-batches, so a 20-column table never blows one prompt.
Validated on the real Raha hospital table (1000Γ20) with a vanilla model (no retrain):
repair_recall 0.509 (fixed 259/509 typos), vs 0.000 for the old one-shot+sample-rows
approach. The v4 fine-tune trains on this value_counts format.
The wide suite (current north-star)
The single-dataset hospital metric was retired as north-star (biased: one table, recall-only, convention-sensitive, abstain-blind). The current harness:
run_real_multi.pyβ 65-dataset suite (5 Raha real-error benchmarks + seeded error injection over 15 harvested open-data domains), scored with a churn-neutral metric (pure case/whitespace rewrites that don't restore gold count as nothing) and aggregated as a double macro (error-type Γ domain, harmonic mean) so no single table or error type dominates. Reports REAL vs INJECTED slices separately β injected typos are in-distribution for frequency clustering by construction.ablations.pyβ removes one grounding component at a time (reference, abstain, ambiguity margin, case-match). Caught two metric artifacts (churn inflation, reference-unsafe traps) now fixed and documented in the paper.calibration.pyβ riskβcoverage + ECE for the abstention confidence (AURC 0.120; 90% precision at the default threshold, β₯95% at 0.91).pii_leak.pyβ masking leak test: 0/360 residual detectable PII.pii_slice.pyβ OOD PII column typing on Gretel test: 5/5 types, 0/7 FP.inject.pyβ seeded, self-verifying error injectors (typo/OCR/case/whitespace) that turn any clean table into validation data.
Baselines include OpenRefine fingerprint + kNN clustering (scrubdata/baselines.py,
with blocking, as the real tool uses). Full results & discussion: docs/paper/.