scrubdata / docs /paper /main.tex
OpenAI Codex
deploy: add sponsor:openai tag (Best Use of Codex) + Codex-hardened build
16dc556
Raw
History Blame Contribute Delete
66.7 kB
\documentclass[11pt]{article}
\usepackage[utf8]{inputenc} % no-op on TeXLive >= 2018 (arXiv pdflatex); explicit for safety
\usepackage[margin=1in]{geometry}
\usepackage{amsmath,amssymb}
\usepackage{booktabs}
\usepackage{graphicx}
\usepackage{url}
\usepackage{xcolor}
\usepackage[numbers]{natbib}
\input{numbers}
\title{Verified Cleaning Plans: Plan-Level Selective Prediction Turns Local LLM
Planners into Trustworthy Table Cleaners}
\author{Ricardo Alanis\\ \small{\texttt{ricardo.alanis@gmail.com}}}
\date{June 2026}
\begin{document}
\maketitle
\begin{abstract}
Cleaning messy tabular data---particularly \emph{canonicalization}, the merging of
inconsistent surface forms such as \texttt{USA}/\texttt{U.S.A}/\texttt{united states}
into one canonical value---resists rule-based automation and is routinely done by hand.
We present ScrubData, an architecture built around a trust contract. A local LLM
\emph{planner} reads an aggregated column profile (per-value frequency counts,
invariant to row count) and \emph{proposes} a structured JSON cleaning plan; a
deterministic executor \emph{applies} it, making every change auditable and reversible;
and \emph{plan-level selective prediction} --- a deterministic verifier that scores
every proposed mapping and drops low-confidence entries to review flags --- extends
abstention from cell-level confidence to the plan itself. The verified union of the
gated model plan with a reference-grounded heuristic is the architecture's operating
point: a zero-configuration, zero-label system that repairs 41\% of the hospital
benchmark's 509 real errors at \unionGatePrec{} precision (strongest of three training
seeds; 3-seed mean \unionGateThreeSeedPrec{} at \unionGateThreeSeedCov{} coverage,
$\pm$ = 95\% CI),
with every declined merge surfaced for review. Four deterministic capabilities ---
profile-level \emph{suspect surfacing} for high-cardinality columns, reconciliation
against a pluggable \emph{entity reference} built from open vocabularies,
\emph{cross-row majority voting} over repeated-entity groups, and
\emph{convention-conservatism} gates --- carry the system to never-seen tables:
macro F1 \unseenMacroF{} at \unseenMacroDamage{} damage over the 35 unseen-source
pairs of a \nPairs-pair benchmark, and \emph{zero silent edits} across \nWild{} wild
tables plus a \nTrust-table trust audit, released together as the \textsc{WildClean}
benchmark.
Finally, we report where the capability lives. \emph{Execution-verified} synthetic
supervision --- a training example is kept only if executing its plan provably
recovers the known-clean table --- buys the 4B fine-tune real in-distribution skill
and the most precise gated planner at usable coverage (\modelGatePrec{} precision at
\modelGateCov{} coverage); but five further retrains and a three-arm GRPO pilot leave
held-out generalization statistically bounded (TOST against a pre-registered margin),
while two of three zero-shot 24--31B open-weights planners (devstral-24B, gemma4-31B)
dropped into the \emph{identical} harness exceed the fine-tune's operating point
(\scalePrecBig{} precision at \scaleCovBig{} coverage) with no task training. The
architecture is planner-agnostic: it converts capability gains into trustworthy
operating points without retraining. The shipped system runs entirely locally on commodity hardware;
no data leaves the machine (the scaling-arm planners were measured via hosted
endpoints; all are locally deployable open weights).
\end{abstract}
\section{Introduction}
A large share of practical data work is cleaning: a sales export where the same country
is spelled four ways, a hospital roster where \texttt{birminghxm} should be
\texttt{birmingham}, a CRM dump with mixed date formats and duplicated contacts. The
fuzzy half of this work---recognizing that distinct surface forms denote the same
entity---is exactly what rules do poorly and humans do slowly.
Large language models can do this fuzzy matching, but deploying them as cell editors has
three problems. First, \emph{trust}: a model that edits cells directly can silently
corrupt data, and its errors are unauditable. Second, \emph{cost and privacy}: shipping
every row of a private table to a hosted frontier model is expensive and often
unacceptable. Third, \emph{hallucination}: asked for a canonical form, a generative model
will invent one, and on tail entities it will invent wrong ones.
ScrubData addresses all three with an architecture in which the model never touches
data. A profiler aggregates each column into a value-frequency distribution; a small
local model reads the profile and \emph{proposes} a JSON cleaning plan; a deterministic
pandas executor \emph{applies} it. The plan is the complete, inspectable, reversible
specification of every change---there are no silent edits by construction
(\S\ref{sec:method}). Because the prompt scales with the number of \emph{distinct}
values rather than rows, a million-row table profiles like a hundred-row one.
This paper makes five contributions:
\begin{enumerate}
\item \textbf{A planner/executor decomposition with plan-level selective prediction}:
the model proposes, a deterministic engine executes with full lineage, and a
deterministic verifier gates every proposed mapping, extending abstention to the plan
itself. The verified union of the gated model plan with a reference-grounded
heuristic repairs 41\% of hospital's \hospErrors{} real errors at \unionGatePrec{}
precision with zero configuration and zero labels (\S\ref{sec:method},
\S\ref{sec:verifier}, \S\ref{sec:ws1results}).
\item \textbf{\textsc{WildClean} and an un-gameable evaluation}: a 65-dataset suite
(real-error benchmarks plus seeded error injection over 15 harvested open-data
domains) scored with a churn-neutral, convention-tolerant metric that cannot be
inflated by mass rewriting, with damage and silent edits scored alongside repair F1,
degenerate baselines pinning the metric's floor and ceiling, and the scorer itself
validated against 30 adversarial known-by-construction cases (\S\ref{sec:eval},
\S\ref{sec:degenerate}).
\item \textbf{Four deterministic capabilities that carry never-seen-table
generalization}: bounded suspect surfacing for high-cardinality columns, generic
entity-reference reconciliation with an exact-hit typing floor, cross-row majority
voting with a false-consensus guard, and convention-conservatism gates --- each
motivated by a measured failure regime and gated by the verifier
(\S\ref{sec:capabilities}, \S\ref{sec:wild}).
\item \textbf{Execution-verified synthetic supervision}, the training method behind
the 4B planner instantiation: every training example is validated by running the
executor on the (dirty table, plan) pair and checking that the known-clean table is
recovered; non-recovering examples are discarded (\S\ref{sec:sft}).
\item \textbf{A unified finding on where capability lives in this architecture}: five
further supervised fine-tunes and a three-arm GRPO pilot with the executor as a
verifiable reward leave held-out generalization statistically bounded (TOST), while
two of three zero-shot 24--31B planners dropped into the same harness exceed the
fine-tune's operating point --- deterministic machinery plus plan-level verification carry the
generalization that exists, and raw planner capability, not task fine-tuning, scales
it (\S\ref{sec:negative}, \S\ref{sec:scaling}).
\end{enumerate}
We deliberately report a negative-flavored finding alongside the positive ones: on
\emph{injected} typos, classical frequency clustering remains a strong baseline---by
construction, injection places the canonical form in the column, which is clustering's
ideal regime. The advantage of grounding is concentrated where it matters: real errors,
tail entities absent from the column, and adversarial near-misses where acting at all is
wrong (\S\ref{sec:results}).
\section{Related Work}
\label{sec:related}
\textbf{Error detection and repair.} Raha and Baran~\cite{raha} established
configuration-free error detection and correction benchmarks (hospital, beers, flights,
rayyan), which we adopt as out-of-distribution evaluation. HoloClean~\cite{holoclean}
combines integrity constraints, external reference data, and statistics in probabilistic
repair, demonstrating that external signals can veto statistically plausible but wrong
repairs---an insight our reference-veto inherits. GARF~\cite{garf} learns repair rules
self-supervised from the data itself; it also demonstrates the structural limit we
observe for frequency-only methods: a lone categorical column offers no co-occurring
signal to vote against an error.
\textbf{The 2025--26 landscape.} Post-Cocoon work concentrates on zero-label
\emph{detection}: ZeroED~\cite{zeroed} (cloud-LLM cluster labeling, hospital
detection F1 0.81, collapsing to 0.27 on smaller models), ForestED~\cite{forested}
(LLM-induced decision trees, 0.756), and Auto-Test~\cite{autotest} (corpus-mined
semantic-domain constraints, no LLM at inference) --- none performs zero-label
\emph{repair}. GIDCL~\cite{gidcl} sets the labeled-class repair ceiling
(hospital \gidclHosp{} with 20 labels and a LoRA trained per cleaned table);
Cocoon~\cite{cocoon} remains an unreproduced preprint (15 citing papers, none a
reproduction). Two concurrent results corroborate facets of this paper's central
negative finding that machinery, not weights, carries cleaning generalization: a
study showing even frontier models cannot correct table distortions without
explicit priors~\cite{distort}, and a large multi-agent-debate evaluation in which
LLM self-critique \emph{degrades} repair and only an adversarially separate,
execution-grounded critic helps~\cite{debate} --- the architecture our verifier
instantiates. Spreadsheet-RL~\cite{spreadsheetrl} reports the complementary
positive case: with full-scale RL infrastructure and execution-verified rewards,
a 4B model's spreadsheet-manipulation skill \emph{does} move (12.0\%
$\rightarrow$ 23.4\%) --- consistent with our reading that the gap between our
\$30 pilot and such results is infrastructure scale, a boundary we state rather
than blur (\S\ref{sec:negative}).
\textbf{LLMs for data wrangling.} Narayan et al.~\cite{wrangle} showed frontier
foundation models handle entity matching and imputation few-shot;
Jellyfish~\cite{jellyfish} and Table-GPT~\cite{tablegpt} fine-tune mid-size models for
data tasks. RetClean~\cite{retclean} is closest in spirit: retrieval from data lakes
grounds cell repair, with the key empirical split that parametric knowledge suffices on
world-known head values but collapses on the tail---motivating retrieval. Our work
differs in the planner/executor decomposition (the model emits no cell values, only
plans), in execution-verified supervision, and in the calibrated-abstention contract.
\textbf{Entity linking over tables.} TURL~\cite{turl} and TableLlama~\cite{tablellama}
inject candidate entities into table understanding; Belotti et al.~\cite{belotti}
show retriever coverage is the accuracy ceiling for table entity disambiguation and that
long candidate lists hurt smaller models. RACOON~\cite{racoon} shows inference-time KG
retrieval lifts a frozen model substantially, supporting our choice to ground at
inference rather than bake aliases into weights (TURL's out-of-domain collapse is the
cautionary result). MTab~\cite{mtab} established type-constrained matching with
abstention in semantic table annotation.
\textbf{Clustering-based cleaning tools.} The de-facto practitioner baseline is
OpenRefine: key-collision (fingerprint) clustering plus a nearest-neighbour mode; we
reimplement both faithfully, including blocking, and compare head-to-head.
\textbf{Selective prediction.} Risk--coverage analysis and calibration
metrics~\cite{selective} formalize ``knowing when not to act''; to our knowledge their
application to data-cleaning merge decisions is new.
\textbf{Small specialized models.} OpenMed~\cite{openmed} fine-tunes sub-500M encoders
to state-of-the-art biomedical NER, the sister result to our thesis that small
specialized models beat large generic ones on narrow structured tasks; we adopt their
released PII token classifiers for column typing (\S\ref{sec:pii}).
\section{Method}
\label{sec:method}
\subsection{Planner / executor decomposition}
A \emph{profiler} reduces each column to a typed summary: detected semantic type, missing
counts, issue flags, and a value--frequency distribution capped at 80 distinct values
(high-cardinality columns are summarized by their head). The \emph{planner}---either a
deterministic heuristic or our fine-tuned 4B model---maps the profile (plus three sample
rows) to a JSON plan: a list of per-column operations drawn from a closed vocabulary
(\texttt{canonicalize\_categories} with an explicit mapping, \texttt{parse\_date},
\texttt{standardize\_phone}, \texttt{mask\_pii}, \ldots), table operations, and review
flags. The \emph{executor} applies the plan with pure pandas transforms. The plan is the
only channel through which data changes: every diff is attributable to a named operation
with a rationale, the original table is never mutated, and abstentions are first-class
plan objects. We export per-run decision summaries as OpenTelemetry GenAI spans.
\subsection{Execution-verified synthetic supervision}
\label{sec:sft}
Training pairs are generated by corrupting clean synthetic tables with realistic noise
(casing, aliases, single-character typos with Zipf-distributed long-tail categorical
columns of 30--80 distinct values) while recording the ground-truth plan. The defining
step is \emph{verification by execution}: a candidate example is kept only if
$\textsc{Execute}(\text{dirty}, \text{plan}) = \text{clean}$ cell-for-cell. This closes
the loop between supervision and semantics---a plan that would not actually clean the
table can never become a training label. We augment with real supervision derived from
paired dirty/clean benchmarks by aligning cells and keeping only \emph{learnable}
canonicalizations (a surface form that is a string variant of its target and never a
legitimate value elsewhere), which excludes unlearnable per-cell corrections such as
divergent flight times. The fine-tune is QLoRA (rank 32) over Qwen3-4B-Instruct in
bf16; one practical finding is that the base model's tool-calling prior dominates
free-running generation even after convergent fine-tuning (loss 0.16) and must be
suppressed at decode time by banning the two tool-call tokens.
\subsection{Reference-grounded canonicalization with abstention}
\label{sec:grounding}
For columns whose values reconcile to a known concept type (countries, administrative
regions, cities), canonical forms are never generated: a fuzzy retriever (normalized
edit similarity with first-character blocking and length prefilters) matches each
distinct value against the type-scoped reference (ISO/pycountry; GeoNames cities500,
196k entries). A value maps to a canonical only if (i) similarity clears a threshold
$\tau{=}0.84$, (ii) the best--second-best margin clears $0.03$ (ambiguity veto: a value
equally close to \texttt{Box} and \texttt{Boaz} abstains), and (iii) the canonical is
cast to the column's observed case convention. Near-misses ($0.70{\le}s{<}\tau$) are
surfaced as review flags. The same wrapper grounds the \emph{model} planner: for
reference-typed columns the model's free-generated mapping is replaced by the grounded
one, so the model can add coverage but never invent a canonical for a grounded type.
\subsection{Plan-level selective prediction: the verified union planner}
\label{sec:verifier}
Grounding constrains reference-typed columns, but the planner's \emph{free}
canonicalization mappings on non-grounded columns remain unguarded---and they are where
real-data precision dies (the fine-tune's raw hospital plan: \hospModelPrecVSix{}
precision at \hospModelRecallVSix{} recall). Rather than retrain, we extend abstention
to the plan itself. A deterministic \emph{verifier} scores every proposed mapping entry
$raw{\to}canon$ with contract-preserving evidence (no cell values emitted, no gold
access): three hard gates distilled from the model's measured failure classes---a value
occurring ${\ge}3$ times is data, not a typo (\emph{errors are rare}); the target must
be a frequent column value clearly dominating the source (no mapping one typo onto
another); digit-bearing codes repair only when the letter part is near-identical---then
a confidence combining edit similarity with frequency support. Entries below a
threshold $\tau$ are dropped to review flags; abstention stays first-class. Sweeping
$\tau$ yields a plan-level precision--coverage curve. The shipped composition,
the \emph{verified union planner}, is the verifier-gated model plan ($\tau{=}0.5$)
unioned with the grounded heuristic's mappings (the model wins per surface form);
the same code path is the product default.
\subsection{Visibility and consensus: four deterministic capabilities}
\label{sec:capabilities}
Four further mechanisms, each motivated by a measured failure regime on never-seen
tables, complete the deterministic machinery. \textbf{(a) Suspect surfacing.} The
profile's value-frequency view is capped, so high-cardinality columns hide their
dirty cells from any planner. Every column profile now carries a bounded
\texttt{suspect\_values} section: rare anomalous surfaces with evidence-backed
repair candidates (frequency dominance, edit similarity, reference membership).
The heuristic planner repairs from suspects under a strict verifier bar
($\tau_{hc}{=}0.8$) and flags the rest. \textbf{(b) Generic entity reference.}
Open vocabularies (SemTab ToughTables aliases --- derived excluding our benchmark
tables; MusicBrainz search-hint misspellings; RxNorm; Wikidata; ROR) register as a
pluggable reference type. Because the reference is broad, entity-typing a column
additionally requires that ${\ge}20\%$ of its distinct values match the reference
\emph{exactly} --- fuzzy coverage alone over-fires on name-like columns (measured).
This resolves the regime where every surface in a column is unique (no in-column
frequency signal exists at all): five such benchmark tables go from 0.0 to
\ttFOne{} F1 at \emph{zero} damage. \textbf{(c) Cross-row majority voting.} Tables
that repeat a real-world entity across rows (a flight reported by many sources)
carry their own repair signal. A detection step finds compact-token key columns
with small groups (median multiplicity 3--30) and columns whose groups show
\emph{majority-bearing} disagreement with per-group information; a table-level
operation then resolves thin dissenting minorities to the group majority. A
\emph{false-consensus} guard declines when minority shares look like legitimate
correlated updates rather than reporting errors (mean minority share ${\ge}0.25$)
--- a flat volume cap was measured to destroy the legitimate dense-disagreement
regime and replaced. \textbf{(d) Convention conservatism.} The planner never
re-formats an internally consistent column: date and percent ops are gated on
dominant-shape inconsistency (digit and alpha runs collapsed; 90\% rule),
ZIP/postal-named columns are never typed as phones or dates, and Excel-serial
date typing requires a date-suggestive column name. Suppressed minority values
surface as review flags --- abstention is visible, never silent. The verifier
enforces the same gates on model-emitted plans at the verification boundary.
\subsection{PII as a second task instance}
\label{sec:pii}
The identical contract covers PII: a deterministic tier types columns by checksum and
pattern validators (Luhn, IBAN mod-97, SSN/email/phone) over distinct values; an
optional 44M OpenMed-PII token classifier~\cite{openmed} extends coverage to names and
addresses, gated by a sensitive-type allowlist and a column-level coverage vote; and
masking, salted hashing, and join-stable pseudonymization are deterministic executor
operations. Measured briefly: the classifier, though trained on sentence-level
clinical text, transfers to bare cell values --- \piiNameBare{} detection on
person-name cells and \piiAddrBare{} on address cells ($n{=}40$ sampled cells each);
the validator tier, evaluated out-of-distribution on per-type columns from the Gretel
PII test split, types 5/5 covered PII types correctly with 0/7 false positives on
negative columns drawn from real open data; and after deterministic masking,
re-running all validators over the output finds \piiLeakRate{} residual PII ---
residual PII \emph{detectable by our validators}, a circularity we note explicitly:
the leak test can only see what the validator tier sees.
\section{Evaluation Design}
\label{sec:eval}
\textbf{Suite.} Five real-error benchmarks (Raha) plus seeded error injection
(typo/OCR/case/whitespace) over 15 harvested open-data domains (NYC, Chicago, SF, LA,
Seattle, Texas, WA portals; GitHub) $\approx$ 65 datasets per seed. We aggregate as a
\emph{double macro}---mean over error types of mean over datasets, harmonically combined
with the domain macro---so no single table or error type dominates:
\begin{equation*}
\textsc{north} \;=\; \operatorname{HM}\Biggl(
\underbrace{\frac{1}{|T|}\sum_{t \in T}\frac{1}{|D_t|}\sum_{d \in D_t} F_1(d)}_{\text{error-type macro}},\;
\underbrace{\frac{1}{|G|}\sum_{g \in G}\frac{1}{|D_g|}\sum_{d \in D_g} F_1(d)}_{\text{domain macro}}
\Biggr),
\end{equation*}
where $T$ is the set of error types, $G$ the set of data domains, $D_t$ (resp.\ $D_g$)
the datasets carrying error type $t$ (domain $g$), and $\operatorname{HM}$ the harmonic
mean.
\textbf{Churn-neutral metric.} A cell change that is case/whitespace-equivalent to the
input but does not restore the gold counts as nothing: not a fix, not a change, not
damage. Without this, mass case-rewriting inflates precision (we observed $+0.12$
NORTH from \emph{removing} case matching before the correction); with it, fixing a
case-injected error requires actually acting. We additionally report
\emph{damage}---the rate of semantically corrupting clean cells---and an adversarial
\emph{abstain slice} whose traps are garbage strings (not single-edit variants of any
reference entity; an earlier trap set mis-scored grounding for correctly mapping
\texttt{Boazz}$\to$\texttt{Boaz}). We report both repairs of these metric artifacts as
evidence that gameability must be tested, not assumed.
\textbf{Real vs.\ injected.} Injected typos are in-distribution for frequency
clustering by construction (the canonical is present and dominant in the column), so we
report the real-error and injected slices separately. A TableEG-style audit
quantifies the gap (\texttt{eval/inject\_validity.py}): the injector covers three
of nine error classes (Jensen--Shannon divergence 0.526 bits from the pooled real
distribution over 163{,}607 real errors), and injected-only evaluation would
invert the fingerprint-clustering ranking --- exactly the overstatement the
separate-slice reporting prevents.
\textbf{Scorer validation.} Following GroUSE-style evaluator
testing~\cite{grouse}, the scorer itself is validated against 30 adversarial
known-by-construction cases: a no-op plan must score 0 fixes and 0 damage, an
oracle plan exactly 1.0, vandalizing $k$ of $m$ clean cells must score damage
$k/m$ at precision 0, pure churn (case/whitespace rewrites that do not restore
gold) must count as nothing although a naive scorer would count it, fixes must
require actually acting, and silent edits must trip the audit. All 30 pass
against the shipped scorer unmodified. We additionally cross-score every system
under the \emph{original} Raha/Baran cell-repair protocol side by side with ours
(\texttt{eval/cross\_scoring.py}): rankings agree at Kendall $\tau_b{=}1.0$ on
three of five datasets, and the disagreements cut both ways --- raw string
equality denies credit for numerically-correct serialization restorations (our
movies\_1 repairs), while churn-neutrality charges Baran for load-time
normalizer rewrites its own protocol hides (hospital precision
$0.908\!\to\!0.783$). Neither metric family flatters us uniformly, and our Baran
reproduction calibrates against its published Table~3 within $+0.02$ on three of
the four shared datasets.
\textbf{Contamination.} The Raha-suite benchmarks have been public on GitHub since
2019 and sit inside every modern base model's training window; we treat them as
potentially contaminated and split our claims accordingly. A verbatim-completion
probe makes the concern concrete: prompted with five fields of a gold hospital row,
a frontier-class model reproduces \textbf{25\%} of the held-out cells exactly
(30/120 cells over 30 rows, exact-substring match), versus \textbf{0\%} (0/120) on a
date-stamped post-training-cutoff wild harvest under the identical protocol
(\texttt{eval/contamination\_probe.py}). The rate is an upper bound on memorization
--- some completions are guessable from the given fields --- but it is not zero, so
results on legacy-public benchmarks (including the all-hospital
Table~\ref{tab:scaling}, whose zero-shot planners may partially benefit from
memorized gold) carry this caveat, while the architecture's trust claims
(zero silent edits, damage accounting, abstention) rest on the date-stamped wild
and GitTables slices, where the probe finds nothing to complete.
\section{Results}
\label{sec:results}
\subsection{Plan-level selective prediction on real errors}
\label{sec:ws1results}
On hospital's \hospErrors{} real errors, the verifier transforms the fine-tune from
unshippable to precise (Figure~\ref{fig:pc}): the raw model plan repairs
\hospModelRecallVSix{} of errors at \hospModelPrecVSix{} precision; gated at $\tau{=}0.5$ it
reaches \modelGatePrec{} precision at \modelGateCov{} coverage (146 of 147 committed
changes correct). The union with the grounded heuristic buys coverage back:
\textbf{\unionGatePrec{} precision at \unionGateCov{} coverage} (\unionChanged{}
changes, \unionFixed{} correct). This turns the system's promise into a measured
sentence: \emph{zero-configuration and zero labels, repair 41\% of real errors at
${\ge}0.90$ precision, with every declined merge surfaced for review}. For context,
Baran given oracle error positions and 20 gold-labeled tuples per dataset reaches
\realFBaran{} F1 on the same slice (\S\ref{sec:ws4})---selective prediction does not
close a supervised gap, but it makes the zero-label operating point trustworthy, which
is the regime our user occupies. Precision is flat ($0.89$--$0.91$) for
$\tau\in[0.2,0.8]$, so the operating point is not threshold-brittle, and the result is
seed-robust: across three training seeds of the same data recipe the union operating
point is \unionGateThreeSeedPrec{} precision at \unionGateThreeSeedCov{} coverage
(the shipped adapter is the strongest seed), with every seed clearing the
$0.70$-precision/$0.30$-coverage bar decisively. All 3-seed intervals in this paper
are normal-approximation 95\% CIs ($1.96\,\sigma/\sqrt{3}$); the $t$-based
interval at $n{=}3$ is ${\sim}2.7\times$ wider ($\pm 0.031$ here) and every
qualitative claim survives it --- the weakest seed alone clears the bar.
\textbf{Candidate-constrained planning (negative result).} We also tested constraining
the planner's \emph{inputs}: the profiler emits evidence-backed (variant$\,\to\,$
canonical) candidate pairs (frequency dominance, edit similarity, reference membership)
and the model may only select among them, with a deterministic check dropping
off-candidate mappings to review flags. As a standalone guard it is strong---the raw
plan's precision rises from \hospModelPrecVSix{} to \pairsRawPrec{} with no verifier at
all---but composed with the verifier and union it reaches \pairsUnionPrec{} precision
at \pairsUnionCov{} coverage, slightly \emph{below} the unconstrained pipeline at the
same $\tau$: the candidate cap (top-3 per surface) removes some correct repairs the
verifier would have kept, and the two mechanisms gate the same failure class. We ship
the verifier and keep candidate constraining available but off by default, reporting
this as a measured redundancy rather than a stacked win.
\begin{figure}[t]
\centering
\includegraphics[width=0.62\linewidth]{fig_precision_coverage}
\caption{Plan-level precision--coverage on hospital (509 real errors), sweeping the
verifier threshold $\tau$. The union planner dominates the raw model plan; the shipped
operating point ($\tau{=}0.5$) is annotated.}
\label{fig:pc}
\end{figure}
\subsection{The 4B fine-tune as one planner instantiation}
On frozen synthetic gold, the fine-tuned 4B planner reaches canonicalization micro-F1
\canonFMultiSeed{} --- versus \canonFBig{} for a much larger zero-shot generalist
prompted identically and \canonFHeur{} for the rule heuristic (best single run
\canonFOursBest; operation-F1 \opFOurs, JSON validity \jsonValidOurs). On real hospital
typos the synthetic-only fine-tune scores 0.000 repair recall; adding 20\%
real-derived supervision lifts it to \hospModelRecall, and a data-scaling iteration
(tripling the real-derived share from three paired benchmarks) reaches
\hospModelRecallVSix{} recall at \hospModelPrecVSix{} precision---approaching the
\frontierZeroShotRecall{} of a frontier-scale zero-shot model. The scaling gain is seed-robust: $+0.09$
canonicalization F1 over the base mix under identical protocol, with non-overlapping
3-seed confidence intervals. Real, execution-verified pairs are what transfer:
the same iteration found frequency-derived and algorithm-cleaned labels both
\emph{reduce} quality, consistent with our grounding thesis.
\subsection{Grounding vs.\ clustering}
With the errors-are-rare frequency gates now in both paths, grounding and frequency
clustering are comparable on hospital alone (repairs-only, churn-neutral:
\hospPrecGrounded{} precision at \hospRecallGrounded{} recall grounded vs
\hospPrecFreq{} at \hospRecallFreq{} clustering---hospital's dominant errors are
in-column typos, clustering's best case). Grounding's margin appears where references
matter: across the five-benchmark real-error macro it reaches \ablFullRealF{} versus
\ablNoGroundRealF{} for the frequency-clustering ablation ($+29\%$), and it carries
the behavioral guarantees below.
On the full suite against OpenRefine (Table~\ref{tab:money}), the result splits
cleanly by regime, and we report both. On the \emph{real-error} slice---the regime the
tool exists for---grounded cleaning reaches REAL-F1 \realFGrounded{}, $3.9\times$
OpenRefine kNN (\realFORKnn) and $5.7\times$ fingerprint (\realFORFp), with seed CIs of
$\pm$\northGroundedCI. Provenance: the grounded and OpenRefine rows of
Table~\ref{tab:money} are regenerated at the current system head (2026-06-12,
post-capability, scorer fix in); the dagger rows keep their original capture
provenance. The June-10 freeze system measured REAL-F1 \realFGroundedFreeze{} on the
same protocol --- the $+0.05$ difference is the measured contribution of the four
deterministic capabilities (\S\ref{sec:capabilities}) on the real-error slice. On the \emph{injected} slice, fingerprint clustering wins
(\injFORFp{} vs \injFGrounded) at near-zero damage: our case/whitespace injectors are
exactly the perturbations key-collision normalizes away, so this is its home game and
we say so. kNN clustering---the method that, like us, attempts typo repair---loses on
both slices while incurring the highest damage among baselines (\damageORKnn), the
no-reference over-merging failure the grounding was built to prevent. The shipped
verified-union system's suite row (REAL-F1 \modelRealF, damage \modelDamage) shows the
grounding wrapper and heuristic union carry entity canonicalization on these datasets ---
the model's contribution concentrates on the synthetic regime and hospital repair
(\S\ref{sec:ws1results}), and the verifier cuts its suite damage to \modelDamage,
$6\times$ below the grounded heuristic's \damageGrounded{} (HEAD damage vs the union
row's freeze-time capture --- a disclosed basis mix). Within our own ablations
(June-10 freeze basis throughout), removing grounding cedes $22\%$ of real-error
F1 (\ablNoGroundRealF{} vs \ablFullRealF) and forfeits the behavioral guarantees:
perfect abstention on adversarial traps (\ablFullAbstain) versus
\ablNoAbstainAbstain{} without abstention, and reference-vetoed wrong merges (e.g.\
\texttt{guntxrsvillx}$\to$\texttt{huntsville}).
\begin{table}[t]
\centering
\caption{Wide-suite comparison, 3 injection seeds, churn-neutral metric. NORTH is the
double-macro harmonic mean; REAL-F1 is the real-error slice. Regenerated at the
current system head (2026-06-12); the June-10 freeze system measured
\realFGroundedFreeze{} REAL-F1 / \northGroundedFreeze{} NORTH on the same protocol.}
\label{tab:money}
\begin{tabular}{lcccccc}
\toprule
System & NORTH & $\pm$95\%CI & REAL-F1 & INJ-F1 & damage & abstain \\
\midrule
Grounded (ours) & \northGrounded & \northGroundedCI & \textbf{\realFGrounded} & \injFGrounded & \damageGrounded & \ablFullAbstain \\
OpenRefine fingerprint & \northORFp & 0.000 & \realFORFp & \injFORFp & \damageORFp & 1.000 \\
OpenRefine kNN & \northORKnn & 0.002 & \realFORKnn & 0.148 & \damageORKnn & 1.000 \\
Verified union 4B (shipped)$^{\dagger}$ & -- & -- & \modelRealF & -- & \modelDamage & \modelAbstain \\
\midrule
Baran (oracle det.\ + 20 labels)$^{\ddagger}$ & -- & -- & \realFBaran & -- & \damageBaran & -- \\
Jellyfish-13B (ED+DI)$^{\ddagger}$ & -- & -- & \realFJelly & -- & \damageJelly & -- \\
\bottomrule
\end{tabular}
\smallskip
{\small $^{\dagger}$single seed, REAL + typo-injected slice only (GPU cost); other rows
are 3-seed means. $^{\ddagger}$real slice only, disclosed protocol asymmetries
(\S\ref{sec:ws4}): Baran uses oracle error positions + gold labels; Jellyfish is our
detect-then-impute composition with seen-data caveats.}
\end{table}
\subsection{Generalization to never-seen tables}
\label{sec:wild}
The freeze-version system above was then pointed at data it had never seen, under
three new harnesses (all released with this paper as the \textsc{WildClean} bundle).
\textbf{(1) Paired bench}: \nPairs{} dirty/gold pairs spanning the Raha suite, SemTab
ToughTables, government open-data typo corpora, entity-matching tables, and
LLM-cleaning evaluation sets. On the 35 pairs from sources absent from training ---
a count that coincidentally equals, but is distinct from, the \nWild{} gold-free wild
tables of harness~(2) below --- the
post-freeze system scores \textbf{macro F1 \unseenMacroF{} at damage
\unseenMacroDamage}. The largest single contribution is the regime
\S\ref{sec:capabilities}(b) unlocks: on five all-unique entity tables where no
in-column frequency signal exists, F1 moves from $0.0$ to \ttFOne{} at zero damage.
Cross-row voting (\S\ref{sec:capabilities}c) is the second: flights---many sources
reporting the same flight---goes from \flightsBaseF{} to \flightsVoteF{} F1
heuristic-only, and the heuristic hospital path doubles from \hospBaseHeur{} to
\hospVoteHeur{}. The hospital union gate is invariant under all of this
(\unionGatePrec{} at \unionGateCov). \textbf{(2) Wild bench}: \nWild{} uncurated
in-the-wild tables (open-data portals, GitHub, Kaggle) with no gold; we score seeded
inject--recovery on each table's own data (mean recovery \wildRecovery{} over the 34
tables with inject scores; one table has none) plus a
behavioral audit: every run yields a valid plan, every changed cell is attributable
to a logged operation --- \textbf{zero silent edits across all \nWild{} tables}.
\textbf{(3) Trust audit at scale}: \nTrust{} GitTables tables, same property ---
\nTrust{}/\nTrust{} valid plans, zero crashes, zero silent edits. The held-out-source
generalization metric (train and evaluation drawn from disjoint benchmark sources)
remains low in absolute terms (GEN-F1 \genFTwo{}, variant-recall \genVRTwo{}, damage
\genDamageTwo): cleaning unfamiliar tables is far from solved, and we report the
number to anchor the next section's claim about \emph{where} the capability that does
exist actually lives.
\subsection{Where capability lives: a bounded null for fine-tuning}
\label{sec:negative}
Every attempt to move never-seen-table performance through the model weights failed;
every gain in \S\ref{sec:wild} came from deterministic machinery plus the verifier.
Five further supervised fine-tunes --- adding 109k harvested real-world alias pairs
(ToughTables-derived, MusicBrainz search hints, RxNorm, OpenFlights), error-dense
episode mixes, and a suspects-contract retrain --- left held-out GEN-F1
\emph{statistically bounded}: every retrain's delta is positive but negligible (mean
$+0.003$), never approaching the pre-registered $\delta{=}0.05$. ``Bounded'' is a
tested equivalence claim, not an eyeballed one~\cite{lakens}: across the five-retrain
series the mean held-out GEN-F1 delta (retrain minus champion) is $+0.0028$
(90\% bootstrap CI $[+0.0008, +0.0049]$, strictly positive;
10{,}000 resamples, seed 42; per-dataset granularity, $n{=}15$ over 3 held-out
sources $\times$ 5 retrains --- per-pair deltas do not exist for the retrain
series, so within-retrain deltas are clustered and we add a retrain-level
robustness check, $n{=}5$ macro deltas), and TOST rejects effects beyond the
pre-registered SESOI of $\pm 0.05$ ($p = 8.0\times10^{-16}$; retrain-level check
$p = 8.3\times10^{-8}$). One disclosure sharpens the clustering caveat: two
retrains' held-out rows are \emph{bit-identical} --- mechanically verified as
verifier-collapse, not a data error (their raw plans share zero mapping entries,
9 vs.\ 82 on flights, yet the verifier kills all of both, so each union
degenerates to the same deterministic plan;
\texttt{eval/results/equivalence\_coincidence.json}) --- so the $n{=}15$ rows
carry fewer independent observations than their count suggests, which is exactly
why the $n{=}5$ retrain-level test is the one we lean on. The collapse itself is
the finding in miniature: different weights, same held-out behavior, because the
verifier and the deterministic machinery decide what survives. Two reconciliations make the claim auditable. First, the
basis: the equivalence series is scored against the champion's absolute GEN-F1 of
\genChampionBasis{}, while the \genFTwo{} of \S\ref{sec:wild} is the \emph{shipped
system} at the post-freeze HEAD with all deterministic capabilities --- the
equivalence series scores each retrain's model-union path at its own capture time,
so the two figures share a metric but not a basis. Second, the SESOI: weight
interventions move GEN-F1 by at most $0.005$, while the deterministic machinery of
\S\ref{sec:capabilities} moved the unseen-pair macro from $0.10$ to \unseenMacroF{}
--- $\delta{=}0.05$ sits an order of magnitude above the measured weight effect and
well below the machinery effect, which is exactly the boundary the test is meant to
police. Mixing harvested pairs into the training blend
\emph{diluted} the synthetic skill the executor verifies (a monotonic dilution law
across mix ratios). A GRPO pilot using the executor as a verifiable reward (the
direction RLVR table work~\cite{tabler1} motivates) was negative in all three arms at
4B/LoRA scale: the main arm and a KL-anchored variant degraded plan-format validity,
and a random-reward control arm reproduced the same drift, identifying it as an RL
artifact rather than signal~\cite{spurious}. We state this as a \emph{bounded} null,
not a universal one: at 4B/LoRA scale, under our propose/execute protocol and
training budgets, no weight intervention we ran produced measurable movement in
never-seen-table repair --- profiling visibility, reference grounding, cross-row
consensus, convention conservatism, and plan-level verification carry the capability
that exists. The bound is explicit: results with full-scale RL infrastructure
(execution-verified rewards on multi-GPU RLVR stacks~\cite{spreadsheetrl,tabler1})
show task skill moving at the same parameter scale, so our claim is about what
SFT-and-pilot-RL buy in this protocol class, not about reinforcement learning in
general. A second explicit bound: every weight experiment here uses the Qwen3
family --- and the very work we cite to explain the control arm's drift documents
that random-reward GRPO effects are themselves family-sensitive~\cite{spurious}
--- so the null is stated for Qwen3-class models pending a cross-family
replication. Concurrent evaluations corroborate the mechanism from independent
directions~\cite{distort,debate}. The practical corollary is unusual but actionable:
a contributor who wants to improve a system like this should write a deterministic
capability and gate it with the verifier, not collect more training data.
The null extends to test-time compute --- with one instructive exception that
\emph{confirms} the architecture claim. Self-consistency \emph{voting} over
$N{=}16$ temperature-0.7 samples (cell-edit-level majority, run through the
identical verifier--union pipeline) yields 0.906 precision at 0.454 coverage
versus 0.9055 at 0.4519 for matched greedy decoding on the same local runtime ---
a null at matched precision, the visibility law from the test-time side: voting
cannot surface repairs the profile does not expose, and it actively discards
verified-recoverable coverage. But pooling \emph{every} mapping from all 16
samples and letting the verifier filter the union gives the best operating point
we measure for the 4B: \textbf{0.911 precision at 0.483 coverage} ($+0.6$ points
precision, $+7.1$ points coverage over the shipped gate; an independent $N{=}8$
replication reproduces the \emph{voted} point to $\pm 0.0003$ precision /
$\pm 0.002$ coverage, and the greedy anchor exactly). The lesson is the paper's thesis in miniature: sampling
helps only as a \emph{candidate generator}; consensus adds nothing the verifier
does not already provide --- pool candidates, verify, do not vote. Separately,
the local capture path itself (Q8 quantization with grammar-constrained decoding)
is worth $+3.9$ points of coverage over the original Modal capture at equal
precision.
\subsection{Zero-label capability scaling: the verifier harness is planner-agnostic}
\label{sec:scaling}
The negative result bounds what fine-tuning small weights buys; it says nothing
about raw capability. To separate the two we dropped zero-shot, $\leq$32B
open-weights planners --- with \emph{no} task training --- into the identical
hospital pipeline the 4B fine-tune uses: same prompt contract, same
verify($\tau{=}0.5$), same union with the grounded heuristic
(Table~\ref{tab:scaling}). devstral-small-2-24B and gemma4-31B both reach
\textbf{\scalePrecBig{} precision at \scaleCovBig{} coverage} --- exceeding the
fine-tune's union point of \unionGatePrec{} at \unionGateCov{} --- while
nemotron-30B reaches \scalePrecNemo{} at \scaleCovNemo{} with JSON-plan validity
0.4 (validity is part of the measurement: a planner that cannot reliably emit the
plan schema loses coverage before capability is measured). gpt-oss-20B is
excluded as a serving failure, documented rather than scored as capability: the
hosted proxy returned empty content on every planning call despite full-length
generation. The arm is multi-family (Mistral, Google, NVIDIA), which addresses
the single-family bound of \S\ref{sec:negative} for the inference side; the
weight-training null itself remains Qwen3-scoped. Disclosure: these models were
measured via hosted inference for speed; all are $\leq$32B open weights and
locally deployable in principle. The interpretation we draw is the paper's
sharpest: SFT at 4B does not buy held-out generalization (\S\ref{sec:negative}),
but raw capability at 24--31B does lift the same harness --- the verifier/union
architecture is the portable contribution, converting any sufficiently capable
planner into a trustworthy cleaner.
\begin{table}[t]
\centering
\caption{Zero-shot $\leq$32B planners in the identical verify($\tau{=}0.5$)+union
harness, hospital's \hospErrors{} real errors. Validity = fraction of planning
calls returning schema-valid JSON. Runtime = wall-clock for the planning calls on
hosted endpoints (single capture, no seeds; the 4B row is a prior Modal A100
capture with no comparable local figure). Each scaling row is a single capture;
the primary evidence is the union coverage delta ($+0.07$) at matched-or-better
precision, not any single cell. For context, 16-sample pooling lifts the 4B
fine-tune to $0.911@0.483$ at $16\times$ planning compute
(\S\ref{sec:negative}); the 24--31B planners reach $0.915@0.485$ in a single
greedy pass --- single-pass capability versus test-time compute, both converted
into trustworthy operating points by the same verifier. Bold marks the best union operating point.
gpt-oss-20B excluded (serving failure: empty
proxy responses, not measurable capability).
The identical devstral/gemma rows are a verified counting coincidence, not a
scoring artifact: their applied cell-edit sets share 266 of 270 cells, each
commits 4 model-specific repairs (all correct), and the totals coincide
(\texttt{eval/results/scaling\_coincidence.json}).
}
\label{tab:scaling}
\footnotesize
\begin{tabular}{lccccc}
\toprule
Planner & Params & Gated P@C & Union P@C & Validity & Runtime (s) \\
\midrule
ScrubData-v6 (Qwen3-4B fine-tune) & 4B & 0.993 @ 0.287 & 0.905 @ 0.413 & --- & --- \\
devstral-small-2 (Mistral) & 24B & 0.943 @ 0.426 & \textbf{0.915 @ 0.485} & 1.0 & \runtimeDevstral \\
nemotron-3-nano (NVIDIA) & 30B & 1.000 @ 0.138 & 0.877 @ 0.336 & 0.4 & \runtimeNemo \\
gemma4 (Google) & 31B & 0.943 @ 0.426 & \textbf{0.915 @ 0.485} & 1.0 & \runtimeGemma \\
\bottomrule
\end{tabular}
\end{table}
\subsection{Ablations}
All ablations are 3-seed means (CIs $\le\pm0.003$). Removing abstention costs $-0.013$
NORTH, raises damage to \ablNoAbstainDamage{} (from \ablFullDamage), and collapses trap
abstention to \ablNoAbstainAbstain. Removing the ambiguity margin costs $-0.006$ with
$+0.001$ damage. Removing case matching costs $-0.002$ under the churn-neutral metric
(and \emph{gained} $+0.12$ under the uncorrected metric---the artifact). Replacing
grounding with frequency clustering gains $+0.020$ NORTH, all of it from the injected
slice (\S\ref{sec:eval}), while ceding $-0.039$ real-error F1---the trade the system
refuses by design.
\subsection{Learned-repair baselines under disclosed protocols}
\label{sec:ws4}
We additionally run two learned-repair baselines on the real-error (Raha) slice,
under the identical churn-neutral metric but with honestly disclosed protocol
asymmetries. \textbf{Baran}~\cite{raha} is semi-supervised: we run its reference
configuration---oracle error positions from the dirty/gold diff plus 20 gold-labeled
tuples per dataset (its package default), without the optional Wikipedia-pretrained
value models. It reaches REAL-F1 \realFBaran{}$\,\pm$\realFBaranCI{} (3 label-sampling
seeds) at \damageBaran{} damage---an upper bound under a strictly more informed
protocol than ours (zero labels, no oracle detection); with oracle positions it can
essentially only edit true-error cells, so its near-zero damage is structural.
\textbf{Jellyfish-13B}~\cite{jellyfish} publishes per-cell error detection and
imputation but no repair task; we compose the two (detect, then impute flagged cells
with the attribute masked) --- a pipeline of our construction, not theirs. It scores
REAL-F1 \realFJelly{} at \damageJelly{} damage (single seed, recommended decoding;
note hospital is in its instruction-tuning data and flights/rayyan in its published
evaluation suite, so these numbers may flatter it). Neither baseline is run on the
56-spec injected suite (computationally and methodologically out of scope for
semi-supervised and per-cell-LLM repair); their NORTH/INJ-F1 cells in
Table~\ref{tab:money} are blank by design. The comparison locates our contribution:
zero-config systems (ours, OpenRefine) occupy a different protocol class from
supervised repair, and the verifier (\S\ref{sec:ws1results}) is what makes the
zero-config class precise enough to trust, not what closes the labeled gap.
Table~\ref{tab:perdataset} breaks the real-error slice down per dataset at HEAD.
The verified-union rows are reported with their honest shape: off hospital the
union turns ultra-conservative --- on rayyan it commits 12 changes at 0.001
damage; on beers it holds precision 0.546 at recall 0.018. The gate's precision
premise transfers as \emph{safety} (union damage stays at 0.001--0.080) but not
as coverage. The movies\_1 union cell ($^{q}$: local Q8 capture, the disclosed
quantized protocol) is the instructive worst case: on entity-rich name columns
the quantized planner proposes plausible-but-wrong merges
(\texttt{The Longest Day}$\,\to\,$\texttt{The Longest Yard}); the verifier kills
most, and what leaks through is damage within the disclosed band with zero
credited fixes --- the planner contributes nothing there, and the system's value
is that it \emph{contains} a bad planner rather than amplifying it. This directly
answers the co-adaptation concern: hospital is where the model's learned mappings
live, and elsewhere the system abstains or contains rather than guesses.
\begin{table}[t]
\centering
\caption{Per-dataset real-error results (Raha slice), churn-neutral F1 / damage.
Grounded is the HEAD deterministic system; OR = OpenRefine reimplementations;
Union is the verified union planner ($\tau{=}0.5$) where a captured model plan
exists (movies\_1 capture pending); Baran uses oracle error positions + 20 gold
labels (mean of 3 label-sampling seeds) and is a supervised reference, not a
peer.}
\label{tab:perdataset}
\footnotesize
\begin{tabular}{lccccc}
\toprule
Dataset & Grounded (HEAD) & OR fingerprint & OR kNN & Verified union & Baran (oracle+20) \\
\midrule
hospital & 0.258 / .066 & 0.000 / .000 & 0.189 / .083 & 0.567 / .001 & 0.827 / .004 \\
beers & 0.025 / .005 & 0.194 / .000 & 0.086 / .074 & 0.035 / .001 & 0.918 / .000 \\
flights & 0.127 / .082 & 0.000 / .000 & 0.014 / .065 & 0.035 / .080 & 1.000 / .000 \\
rayyan & 0.000 / .118 & 0.000 / .001 & 0.002 / .008 & 0.000 / .001 & 0.402 / .010 \\
movies\_1 & 0.714 / .025 & 0.002 / .018 & 0.001 / .072 & 0.000 / .025$^{q}$ & 0.909 / .001 \\
\midrule
macro F1 & \realFOursHead & 0.039 & 0.058 & --- & \realFBaran \\
\bottomrule
\end{tabular}
\end{table}
\subsection{A matched label budget separates the supervision regimes}
\label{sec:labelcurve}
The Baran comparison above is two points (zero labels, twenty labels); the
matched-budget curve in Figure~\ref{fig:labelcurve} measures what each label is
worth to each system on the same five-dataset real-error macro. At zero labels
Baran --- even \emph{retaining} its oracle error positions --- repairs nothing
(F1 \realFBaranZero, 3 seeds): its value models have nothing to learn from.
ScrubData operates at \realFOursHead{} with zero configuration. With labels Baran
climbs steeply (\realFBaranFive{} at $k{=}5$, \realFBaran{} at $k{=}20$): the two
systems occupy complementary supervision regimes, a relationship now measured
rather than asserted. ScrubData's own $k$-label arm uses the labels \emph{only}
to validate and expand the verifier accept set --- no retraining, no oracle
positions: $\realFOursFive \pm 0.023$ at $k{=}5$ and $\realFOursTwenty \pm 0.012$
at $k{=}20$ (3 label-sampling seeds). The disclosed asymmetry stands at every
budget: Baran keeps oracle error positions throughout, so the curve is an upper
bound in its favor.
\begin{figure}[t]
\centering
\includegraphics[width=0.62\linewidth]{fig_label_curve}
\caption{Matched-budget label curve, five-dataset real-error macro F1. At
$k{=}0$ Baran repairs nothing even with oracle error positions retained;
ScrubData operates at \realFOursHead{} with zero configuration. With labels
Baran climbs steeply --- complementary supervision regimes, measured. Error
bars ($\pm$) are standard deviations over 3 label-sampling seeds; the Baran
$k{=}20$ point reuses the 3-seed baseline run of \S\ref{sec:ws4}.}
\label{fig:labelcurve}
\end{figure}
\subsection{Degenerate baselines and cost-weighted damage}
\label{sec:degenerate}
Four degenerate policies pin the metric's floor and ceiling on the full 42-pair
bench (Table~\ref{tab:degenerate}). No-op and oracle land exactly at 0 and 1;
abstain-all is score-identical to no-op because the repair metric is flag-blind
by design (abstentions are audited separately); seeded random editing of 5\% of
cells is vandalism the metric must punish. Since F1 alone under-punishes
vandalism, we add a cost-weighted score in the Effective-Reliability style,
$\Phi_c = (\mathrm{fixes} - c\cdot\mathrm{damaged})/\mathrm{errors}$ at
$c \in \{1, 5, 10\}$: random editing scores $-0.49$ to $-4.89$, while the
shipped system stays positive at $c{=}1$ (\degShippedPhiOne) --- and goes
negative at higher $c$, which is the honest reading: at 10:1 cost asymmetry,
only near-zero-damage operating points (the verified union) are defensible.
One disclosure: the oracle acceptance check itself surfaced a scorer artifact
--- 3 cells in 1.79M held the literal string \texttt{Nan} (a first name), which
parses to float NaN and was unequal to itself --- now fixed in
\texttt{eval/metrics.py} with a regression test; published numbers shift by
less than $10^{-4}$.
\begin{table}[t]
\centering
\caption{Degenerate policies pin the metric (42 pairs, churn-neutral macro;
random-edit: seeded, 5\% of cells). $\Phi_c$ is micro-summed
$(\mathrm{fixes} - c\cdot\mathrm{damaged})$ per benchmark error. ``Shipped''
here is the deterministic grounded path on the 42 pairs (damage
\degShippedDamage), distinct from the verified-union suite row of
Table~\ref{tab:money} (damage \modelDamage).}
\label{tab:degenerate}
\small
\begin{tabular}{lccccccc}
\toprule
Policy & F1 & P & R & damage & $\Phi_1$ & $\Phi_5$ & $\Phi_{10}$ \\
\midrule
no-op & 0.000 & 1.000 & 0.000 & 0.000 & $0.00$ & $0.00$ & $0.00$ \\
abstain-all & 0.000 & 1.000 & 0.000 & 0.000 & $0.00$ & $0.00$ & $0.00$ \\
random-edit & 0.000 & 0.001 & 0.001 & 0.049 & $-0.49$ & $-2.45$ & $-4.89$ \\
oracle & 1.000 & 1.000 & 1.000 & 0.000 & $+1.00$ & $+1.00$ & $+1.00$ \\
shipped & \degShippedF & \degShippedP & 0.308 & \degShippedDamage & $+0.13$ & $-1.37$ & $-3.26$ \\
\bottomrule
\end{tabular}
\end{table}
\subsection{Calibration of abstention}
\label{sec:calibration}
On a probe of reference-entity typos plus garbage traps, retrieval confidence is a
usable selective-prediction signal: AURC \aurc, ECE \ece{} (over-confident;
temperature scaling is future work), and
precision rises monotonically with threshold---\precAtDefault{} precision at the default
$\tau{=}0.84$ (coverage \covAtDefault), and $\geq$95\% precision at
$\tau{=}\threshNinetyFive$ (coverage \covNinetyFive). Figure~\ref{fig:rc} shows the
risk--coverage curve.
\begin{figure}[t]
\centering
\includegraphics[width=0.62\linewidth]{fig_risk_coverage}
\caption{Risk--coverage for grounded city reconciliation (650 probes). Operating points
annotated; the confidence supports thresholded abstention.}
\label{fig:rc}
\end{figure}
\section{Limitations}
Reference coverage is the recall ceiling: entities absent from the taxonomy abstain by
design, which is safe but not helpful; coverage work (larger gazetteers, ROR for
organizations) moves recall directly. Our damage metric is convention-tolerant for case
and whitespace but still counts alias expansion (\texttt{NYC}$\to$\texttt{New York}) as
damage when the gold keeps the alias---a value-level convention question we leave open.
The confidence signal is over-confident (ECE \ece); temperature scaling is future
work. The injected half of the suite, while seeded and reproducible, inherits the
injector's error model; we mitigate with the real-error slice and report both. All
weight-training experiments (SFT and GRPO) use a single model family (Qwen3), so
the negative result of \S\ref{sec:negative} is family-scoped until replicated on a
second family. PII
coverage is English-only, and we make no de-identification guarantee. Finally, the
fine-tune headline is reported with multi-seed confidence intervals, but the wide-suite
model row is single-seed for cost reasons and scoped as such.
\section{Conclusion}
A planner/executor decomposition with plan-level selective prediction --- the model
proposes, a deterministic engine executes, a verifier gates every mapping --- turns
LLM data cleaning from a trust liability into an auditable system: every change is a
named, reversible operation; uncertain actions become review flags rather than silent
corruptions; and the evaluation itself is built to resist gaming. The post-freeze
program sharpened the architecture into a finding: across
five further fine-tunes and a three-arm GRPO pilot, the weights never moved
never-seen-table performance --- deterministic visibility, grounding, consensus, and
verification did, at zero silent edits across \nWild{} wild tables and a
\nTrust{}-table trust audit. The scaling arm completes the picture: the bounded null
is about fine-tuning small weights, not about capability --- two of three zero-shot
24--31B planners dropped into the unchanged verifier harness exceed the
fine-tune's operating point (\S\ref{sec:scaling}), so the architecture is
planner-agnostic: capability gains arrive as better operating points without
retraining. The shipped system runs
entirely locally on commodity hardware and no data leaves the machine; the
scaling-arm planners were measured via hosted endpoints, but all are locally
deployable open weights. We believe the recipe---propose/execute decomposition,
verification-by-execution, retrieval-grounded outputs, and selective prediction over
deterministic capabilities---is a template for deploying small specialized models on
other structured tasks.
\section*{Reproducibility}
\begin{sloppypar}
The model weights are public:
\url{https://huggingface.co/ricalanis/scrubdata-qwen3-4b-v6-q8}. Code, evaluation
suite, and result artifacts are released at the project repository,
\url{https://github.com/ricalanis/scrubdata-hackathon} (public upon publication,
available to reviewers from the initial submission). The \textsc{WildClean}
bundle --- redistributable dirty/gold pairs, the GitTables audit slice, open
vocabularies, result JSONs, and license-gated loaders for the non-redistributable
pairs --- is a public Hugging Face dataset
(\url{https://huggingface.co/datasets/ricalanis/wildclean}). The shipped product
planner is the identical code path measured here (\texttt{scrubdata/active.py}).
\end{sloppypar}
\paragraph{Release integrity.} Our own reproducibility QA discovered that the
published Q8\_0 GGUF was corrupted by an export bug (the export declared a wrong
end-of-generation token id inside the Qwen3 vocabulary, degenerating into
tool-call loops on all runtimes; a base-model control isolated the fault to the
export, not the adapter). It has been re-exported from the v6 adapter and
replaced under the same filename, with both sha256 checksums recorded in the
model card's Integrity section. Third-party reproduction of the model-path
numbers additionally requires constrained decoding on long prompts ---
\texttt{format=json} under Ollama, or
\texttt{suppress\_tokens=[151657,151658]} under transformers --- which is now
documented in the model card and \texttt{notebooks/Modelfile}.
\paragraph{Setup.} Clone the repository and run \texttt{uv sync} (Python 3.12;
\texttt{uv} resolves the pinned environment). The non-redistributable benchmark
pairs materialize from their original sources with the \textsc{WildClean}
\texttt{loaders.py}. Model-path results additionally need the released Q8\_0 GGUF
served by a local Ollama (\texttt{SCRUBDATA\_MODEL}); every deterministic-path
number runs with no model at all. Baran runs in the separate pinned environment
documented at the top of \texttt{eval/run\_baran.py}; Jellyfish-13B runs remotely
via Modal.
\paragraph{One command per reported number} (all from the repository root, at the
released revision):
\begin{center}
\footnotesize
\begin{tabular}{@{}ll@{}}
\toprule
Reported result & Command \\
\midrule
Wide-suite comparison (Table~\ref{tab:money}) & \texttt{python -m eval.run\_real\_multi --out eval/results} \\
Precision--coverage curve + gate & \texttt{python -m eval.precision\_curve} \\
\quad (Figure~\ref{fig:pc}, \S\ref{sec:ws1results}) & \texttt{\ \ --plan eval/results/v6\_hospital\_raw\_plan.json --union} \\
Ablations & \texttt{python -m eval.ablations} \\
Calibration (Figure~\ref{fig:rc}) & \texttt{python -m eval.calibration} \\
PII leak test & \texttt{python -m eval.pii\_leak} \\
Baran baseline & \texttt{python eval/run\_baran.py}, then \\
& \texttt{python -m eval.baselines\_learned --score-baran} \\
Jellyfish baseline & \texttt{modal run scripts/modal\_jellyfish.py} \\
\midrule
Paired bench (\S\ref{sec:wild}) & \texttt{python -m eval.paired\_bench} \\
Wild bench (\S\ref{sec:wild}) & \texttt{python -m eval.wild\_bench} \\
GitTables trust audit (\S\ref{sec:wild}) & \texttt{python -m eval.gittables\_audit} \\
Held-out-source generalization & \texttt{python -m eval.generalization} \\
\midrule
Scorer validation (\S\ref{sec:eval}) & \texttt{python -m pytest tests/test\_wildclean\_scorer.py} \\
Degenerate baselines (Table~\ref{tab:degenerate}) & \texttt{python -m eval.degenerate} \\
TOST equivalence (\S\ref{sec:negative}) & \texttt{python -m eval.equivalence} \\
Label curve (Figure~\ref{fig:labelcurve}) & \texttt{python -m eval.label\_curve} \\
Per-dataset table (Table~\ref{tab:perdataset}) & \texttt{python -m eval.raha\_table} \\
Self-consistency vote/pool (\S\ref{sec:negative}) & \texttt{python -m eval.sc\_rerank --model scrubdata-ft --n 16} \\
Scaling arm (Table~\ref{tab:scaling}) & \texttt{python -m eval.scaling\_arm} \\
\bottomrule
\end{tabular}
\end{center}
\begin{thebibliography}{20}
\bibitem{raha} M.~Mahdavi, Z.~Abedjan, R.~Castro Fernandez, S.~Madden, M.~Ouzzani,
M.~Stonebraker, N.~Tang. Raha: A Configuration-Free Error Detection System. SIGMOD
2019; M.~Mahdavi, Z.~Abedjan. Baran: Effective Error Correction via a Unified Context
Representation and Transfer Learning. PVLDB 13(11):1948--1961, 2020.
\bibitem{holoclean} T.~Rekatsinas, X.~Chu, I.~F.~Ilyas, C.~R\'e. HoloClean: Holistic
Data Repairs with Probabilistic Inference. PVLDB 10(11), 2017. arXiv:1702.00820.
\bibitem{garf} J.~Peng, D.~Shen, N.~Tang, T.~Liu, Y.~Kou, T.~Nie, H.~Cui, G.~Yu.
Self-Supervised and Interpretable Data Cleaning with Sequence Generative Adversarial
Networks (GARF). PVLDB 16(3):433--446, 2022.
\bibitem{wrangle} A.~Narayan, I.~Chami, L.~Orr, S.~Arora, C.~R\'e. Can Foundation
Models Wrangle Your Data? PVLDB 16(4):738--746, 2022. arXiv:2205.09911.
\bibitem{jellyfish} H.~Zhang, Y.~Dong, C.~Xiao, M.~Oyamada. Jellyfish:
Instruction-Tuning Local Large Language Models for Data Preprocessing. EMNLP 2024.
arXiv:2312.01678.
\bibitem{cocoon} S.~Zhang, Z.~Huang, E.~Wu. Data Cleaning Using Large Language Models
(Cocoon). arXiv:2410.15547, 2024 (preprint; no published reproduction).
\bibitem{zeroed} W.~Ni, K.~Zhang, X.~Miao, X.~Zhao, Y.~Wu, Y.~Wang, J.~Yin. ZeroED:
Hybrid Zero-Shot Error Detection Through Large Language Model Reasoning. ICDE 2025.
arXiv:2504.05345.
\bibitem{forested} M.~Wang, J.~Wang, Q.~Liu, X.~Xu, Z.~Xing, L.~Zhu, W.~Zhang.
Ensembling LLM-Induced Decision Trees for Explainable and Robust Error Detection.
arXiv:2512.07246, 2025 (preprint).
\bibitem{autotest} Q.~Chen, Y.~He, R.~C.-W.~Wong, W.~Cui, S.~Ge, H.~Zhang, D.~Zhang,
S.~Chaudhuri. Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error
Detection in Tables. SIGMOD 2025. arXiv:2504.10762.
\bibitem{gidcl} M.~Yan, Y.~Wang, Y.~Wang, X.~Miao, J.~Li. GIDCL: A Graph-Enhanced
Interpretable Data Cleaning Framework with Large Language Models. Proc.\ ACM Manag.\
Data 2(6), Article 236, 2024 (SIGMOD).
\bibitem{spreadsheetrl} B.~Chi, Y.~Xie, M.~Wu, J.~Yang, J.~Jiang, Z.~Li, et al.
Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks
via Reinforcement Learning. arXiv:2605.22642, 2026.
\bibitem{distort} A.~Dutta, H.~Nigam, H.~Hasanbeig, A.~Radhakrishna, S.~Gulwani.
An Empirical Investigation of Robustness in Large Language Models under Tabular
Distortions. arXiv:2601.05009, 2026.
\bibitem{debate} C.~Parmar, A.~Mehta, H.~Wu, J.~Ramamurthy, S.~Medhekar. When Helping
Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning. arXiv:2606.02866, 2026.
\bibitem{tabler1} Z.~Yang, L.~Chen, A.~Cohan, Y.~Zhao. Table-R1: Inference-Time
Scaling for Table Reasoning. EMNLP 2025. arXiv:2505.23621.
\bibitem{spurious} R.~Shao, S.~S.~Li, R.~Xin, S.~Geng, Y.~Wang, et al. Spurious
Rewards: Rethinking Training Signals in RLVR. arXiv:2506.10947, 2025.
\bibitem{tablegpt} P.~Li, Y.~He, D.~Yashar, W.~Cui, S.~Ge, H.~Zhang, D.~Rifinski
Fainman, D.~Zhang, S.~Chaudhuri. Table-GPT: Table Fine-tuned GPT for Diverse Table
Tasks. Proc.\ ACM Manag.\ Data 2(3), Article 176, 2024 (SIGMOD). arXiv:2310.09263.
\bibitem{retclean} Z.~A.~Naeem, M.~S.~Ahmad, M.~Eltabakh, M.~Ouzzani, N.~Tang.
RetClean: Retrieval-Based Data Cleaning Using LLMs and Data Lakes. PVLDB 17(12), 2024
(demo). arXiv:2303.16909.
\bibitem{turl} X.~Deng, H.~Sun, A.~Lees, Y.~Wu, C.~Yu. TURL: Table Understanding
through Representation Learning. PVLDB 14(3):307--319, 2021. arXiv:2006.14806.
\bibitem{tablellama} T.~Zhang, X.~Yue, Y.~Li, H.~Sun. TableLlama: Towards Open Large
Generalist Models for Tables. NAACL 2024. arXiv:2311.09206.
\bibitem{belotti} F.~Belotti, F.~Dadda, M.~Cremaschi, R.~Avogadro, M.~Palmonari.
Evaluating LLMs on Entity Disambiguation in Tables. arXiv:2408.06423, 2024 (preprint).
\bibitem{racoon} L.~L.~Wei, G.~Xiao, M.~Balazinska. RACOON: An LLM-based Framework for
Retrieval-Augmented Column Type Annotation with a Knowledge Graph. arXiv:2409.14556,
2024 (preprint).
\bibitem{mtab} P.~Nguyen, N.~Kertkeidkachorn, R.~Ichise, H.~Takeda. MTab: Matching
Tabular Data to Knowledge Graph using Probability Models. SemTab/ISWC 2019.
arXiv:1910.00246.
\bibitem{selective} R.~El-Yaniv, Y.~Wiener. On the Foundations of Noise-free Selective
Classification. JMLR 11:1605--1641, 2010; Y.~Geifman, R.~El-Yaniv. Selective
Classification for Deep Neural Networks. NeurIPS 2017.
\bibitem{openmed} M.~Panahi. OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art
Transformers for Biomedical NER Across 12 Public Datasets. arXiv:2508.01630, 2025
(preprint).
\bibitem{lakens} D.~Lakens. Equivalence Tests: A Practical Primer for t Tests,
Correlations, and Meta-Analyses. Social Psychological and Personality Science
8(4):355--362, 2017.
\bibitem{grouse} S.~Muller, A.~Loison, B.~Omrani, G.~Viaud. GroUSE: A Benchmark
to Evaluate Evaluators in Grounded Question Answering. COLING 2025.
arXiv:2409.06595.
\end{thebibliography}
\end{document}