Spaces:
Running
Running
File size: 7,434 Bytes
16dc556 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | # Eval harness + goalpost
Measures any planner against a **held-out** synthetic gold set (seed differs from
training, and gold is filtered to oracle-solvable so the ceiling is a clean 1.0).
```bash
uv run eval/run_eval.py --n 300 --seed 4242
```
Adopts the researched tooling: `jsonschema` for plan validity; set-based micro-F1 for
operations and canonicalization mappings; the **executor itself** for end-to-end
cell-recovery (the Raha-style dirty→clean comparison). promptfoo + `llm-rubric` will
wrap the report-quality layer once a model exists.
## Metrics
- **json_valid** — plan conforms to the schema (`eval/metrics.py:PLAN_SCHEMA`).
- **op_f1 / op_r** — micro-F1 / recall over `(column, operation)` pairs vs gold.
- **canon_f1 / canon_r** — micro-F1 / recall over `(column, raw→canonical)` mapping
pairs. *This is the fuzzy skill rules can't do — the whole reason for the model.*
- **recovery** — fraction of clean-reference cells recovered by executing the plan.
## Baseline (measured) and the goalpost
Two reference systems frame every run:
- **ORACLE** = the gold plan → the ceiling.
- **HEURISTIC** (`scrubdata.mock_plan`) = the rule-based baseline the model must beat.
Measured on the frozen 300-example gold set (`eval/gold.jsonl`, **value_counts/aggregation
format**):
| system | json_valid | op_f1 | canon_f1 | canon_r | recovery |
|---|---|---|---|---|---|
| ORACLE (gold) | 1.000 | 1.000 | 1.000 | 1.000 | **1.000** |
| HEURISTIC (baseline) | 1.000 | 0.932 | **0.189** | 0.129 | **0.637** |
**Reading:** with case-folding + typo-clustering the heuristic does the *easy*
canonicalization (collapse to most-frequent surface), but it's still ~blind to
**alias/semantic** canonicalization (`USA`→`United States`, `NYC`→`New York`) — canon_f1
0.19 vs the oracle's 1.0. That gap is the fine-tuned model's job. (Earlier, on the old
sample-rows format, a fine-tune reached canon_f1 0.86 vs a big vanilla model's 0.45 —
proving small-aligned > big-generic; the v4 retrain re-establishes this on the new format.)
### 🎯 Goalpost for the fine-tuned Qwen3-4B
| metric | baseline | **target** | ceiling |
|---|---|---|---|
| json_valid | 1.000 | **≥ 0.99** | 1.000 |
| op_f1 | 0.932 | **≥ 0.98** | 1.000 |
| canon_f1 | 0.189 | **≥ 0.85** | 1.000 |
| recovery | 0.637 | **≥ 0.95** | 1.000 |
A fine-tune that hits these clearly beats the (now stronger) heuristic and approaches the
oracle — the headline being **canon_f1 0.133 → ≥0.85** (alias-level canonicalization) and
**recovery 0.627 → ≥0.95**.
## Plugging in the model
`evaluate(planner, gold)` takes any `planner(dirty_df, gold_plan) -> plan dict`. For
the model, wrap inference (build prompt via `scrubdata.prompt`, parse JSON) and pass it
in alongside the two reference systems. Track the table every fine-tune iteration; the
per-metric delta vs baseline is the cheap regression signal.
## Layer 2 — real out-of-distribution data (`uv run eval/run_real.py`)
Raha `hospital` (1000×20, row-aligned dirty/clean). Errors are char-substitution typos
(`birminghxm`→`birmingham`) — only ~2.5% of cells. Scored with the Raha **repair**
protocol (the right metric when data is already mostly correct):
| system | recovery | repair_recall | repair_prec | broken |
|---|---|---|---|---|
| NO-OP (dirty as-is) | 0.975 | 0.000 | 0.000 | 0 |
| HEURISTIC (baseline) | 0.880 | **0.293** | 0.065 | 2041 |
(Typo-clustering now fixes ~29% of the real char-substitution errors — up from 0. The
model should push repair_recall higher and improve repair_prec.)
**Reading (honest + important):** the rule heuristic fixes **0** typos. Its 2021 changed
cells are **convention divergence, not errors** — our tool parses `100%`→`1.0` and
reformats phones, which this benchmark stores as raw text. That's product value, so raw
`recovery`/`broken` *understates* a standardizing tool on a foreign benchmark. The honest
metric here is **`repair_recall`** — did we fix the actual char-substitution typos
(`birminghxm`→`birmingham`)? The heuristic can't (scores 0); cluster-canonicalization is
the model's job. Two takeaways:
1. **The headline real-data metric is `repair_recall`** (error-fixing), not recovery.
2. **Product feature surfaced:** offer a "preserve original formats" toggle — some users
want raw representation kept; standardizing is the default but should be reversible
(matches PRODUCT.md's trust contract).
### 🎯 Real-data goalpost (fine-tuned model)
| metric | NO-OP | HEURISTIC | **target** | note |
|---|---|---|---|---|
| **repair_recall** | 0.000 | 0.000 | **≥ 0.30** | the real test — fix typos via clustering |
| repair_prec | 0.000 | 0.000 | **≥ 0.70** | of cells changed, fraction that fixed an error |
| recovery | 0.975 | 0.874 | report-only | convention-sensitive; not a pass/fail gate |
The model plugs into `_score(dirty, clean, model_output)` exactly like the heuristic.
> Data auto-fetched to `data/real/hospital/` (gitignored). Add Flights/Beers/CleanML the
> same way for breadth.
## Scale: aggregation + agentic batching (validated)
Cleaning *large* tables doesn't mean bigger prompts — it means reasoning over **patterns**:
- **Aggregation** — the profiler sends per-column `value_counts` (`[value, frequency]`), so
the prompt size depends on DISTINCT values, not rows. Rare typos sit at the tail next to
their dominant canonical (`birminghxm`:1 vs `birmingham`:312) — visible at any scale.
- **Column batching** — `scrubdata.model_planner.make_batched_planner` plans a wide table
in small column-batches, so a 20-column table never blows one prompt.
**Validated** on the real Raha hospital table (1000×20) with a *vanilla* model (no retrain):
**repair_recall 0.509** (fixed 259/509 typos), vs **0.000** for the old one-shot+sample-rows
approach. The v4 fine-tune trains on this `value_counts` format.
---
## The wide suite (current north-star)
The single-dataset hospital metric was retired as north-star (biased: one table,
recall-only, convention-sensitive, abstain-blind). The current harness:
- **`run_real_multi.py`** — 65-dataset suite (5 Raha real-error benchmarks + seeded
error injection over 15 harvested open-data domains), scored with a **churn-neutral**
metric (pure case/whitespace rewrites that don't restore gold count as nothing) and
aggregated as a **double macro** (error-type × domain, harmonic mean) so no single
table or error type dominates. Reports REAL vs INJECTED slices separately — injected
typos are in-distribution for frequency clustering by construction.
- **`ablations.py`** — removes one grounding component at a time (reference, abstain,
ambiguity margin, case-match). Caught two metric artifacts (churn inflation,
reference-unsafe traps) now fixed and documented in the paper.
- **`calibration.py`** — risk–coverage + ECE for the abstention confidence
(AURC 0.120; 90% precision at the default threshold, ≥95% at 0.91).
- **`pii_leak.py`** — masking leak test: 0/360 residual detectable PII.
- **`pii_slice.py`** — OOD PII column typing on Gretel test: 5/5 types, 0/7 FP.
- **`inject.py`** — seeded, self-verifying error injectors (typo/OCR/case/whitespace)
that turn any clean table into validation data.
Baselines include OpenRefine fingerprint + kNN clustering (`scrubdata/baselines.py`,
with blocking, as the real tool uses). Full results & discussion: `docs/paper/`.
|