Spaces:

build-small-hackathon
/

scrubdata

Running

OpenAI Codex

deploy: add sponsor:openai tag (Best Use of Codex) + Codex-hardened build

16dc556 15 days ago

66.7 kB

	\documentclass[11pt]{article}
	\usepackage[utf8]{inputenc} % no-op on TeXLive >= 2018 (arXiv pdflatex); explicit for safety
	\usepackage[margin=1in]{geometry}
	\usepackage{amsmath,amssymb}
	\usepackage{booktabs}
	\usepackage{graphicx}
	\usepackage{url}
	\usepackage{xcolor}
	\usepackage[numbers]{natbib}
	\input{numbers}

	\title{Verified Cleaning Plans: Plan-Level Selective Prediction Turns Local LLM
	Planners into Trustworthy Table Cleaners}
	\author{Ricardo Alanis\\ \small{\texttt{ricardo.alanis@gmail.com}}}
	\date{June 2026}

	\begin{document}
	\maketitle

	\begin{abstract}
	Cleaning messy tabular data---particularly \emph{canonicalization}, the merging of
	inconsistent surface forms such as \texttt{USA}/\texttt{U.S.A}/\texttt{united states}
	into one canonical value---resists rule-based automation and is routinely done by hand.
	We present ScrubData, an architecture built around a trust contract. A local LLM
	\emph{planner} reads an aggregated column profile (per-value frequency counts,
	invariant to row count) and \emph{proposes} a structured JSON cleaning plan; a
	deterministic executor \emph{applies} it, making every change auditable and reversible;
	and \emph{plan-level selective prediction} --- a deterministic verifier that scores
	every proposed mapping and drops low-confidence entries to review flags --- extends
	abstention from cell-level confidence to the plan itself. The verified union of the
	gated model plan with a reference-grounded heuristic is the architecture's operating
	point: a zero-configuration, zero-label system that repairs 41\% of the hospital
	benchmark's 509 real errors at \unionGatePrec{} precision (strongest of three training
	seeds; 3-seed mean \unionGateThreeSeedPrec{} at \unionGateThreeSeedCov{} coverage,
	$\pm$ = 95\% CI),
	with every declined merge surfaced for review. Four deterministic capabilities ---
	profile-level \emph{suspect surfacing} for high-cardinality columns, reconciliation
	against a pluggable \emph{entity reference} built from open vocabularies,
	\emph{cross-row majority voting} over repeated-entity groups, and
	\emph{convention-conservatism} gates --- carry the system to never-seen tables:
	macro F1 \unseenMacroF{} at \unseenMacroDamage{} damage over the 35 unseen-source
	pairs of a \nPairs-pair benchmark, and \emph{zero silent edits} across \nWild{} wild
	tables plus a \nTrust-table trust audit, released together as the \textsc{WildClean}
	benchmark.

	Finally, we report where the capability lives. \emph{Execution-verified} synthetic
	supervision --- a training example is kept only if executing its plan provably
	recovers the known-clean table --- buys the 4B fine-tune real in-distribution skill
	and the most precise gated planner at usable coverage (\modelGatePrec{} precision at
	\modelGateCov{} coverage); but five further retrains and a three-arm GRPO pilot leave
	held-out generalization statistically bounded (TOST against a pre-registered margin),
	while two of three zero-shot 24--31B open-weights planners (devstral-24B, gemma4-31B)
	dropped into the \emph{identical} harness exceed the fine-tune's operating point
	(\scalePrecBig{} precision at \scaleCovBig{} coverage) with no task training. The
	architecture is planner-agnostic: it converts capability gains into trustworthy
	operating points without retraining. The shipped system runs entirely locally on commodity hardware;
	no data leaves the machine (the scaling-arm planners were measured via hosted
	endpoints; all are locally deployable open weights).
	\end{abstract}

	\section{Introduction}

	A large share of practical data work is cleaning: a sales export where the same country
	is spelled four ways, a hospital roster where \texttt{birminghxm} should be
	\texttt{birmingham}, a CRM dump with mixed date formats and duplicated contacts. The
	fuzzy half of this work---recognizing that distinct surface forms denote the same
	entity---is exactly what rules do poorly and humans do slowly.

	Large language models can do this fuzzy matching, but deploying them as cell editors has
	three problems. First, \emph{trust}: a model that edits cells directly can silently
	corrupt data, and its errors are unauditable. Second, \emph{cost and privacy}: shipping
	every row of a private table to a hosted frontier model is expensive and often
	unacceptable. Third, \emph{hallucination}: asked for a canonical form, a generative model
	will invent one, and on tail entities it will invent wrong ones.

	ScrubData addresses all three with an architecture in which the model never touches
	data. A profiler aggregates each column into a value-frequency distribution; a small
	local model reads the profile and \emph{proposes} a JSON cleaning plan; a deterministic
	pandas executor \emph{applies} it. The plan is the complete, inspectable, reversible
	specification of every change---there are no silent edits by construction
	(\S\ref{sec:method}). Because the prompt scales with the number of \emph{distinct}
	values rather than rows, a million-row table profiles like a hundred-row one.

	This paper makes five contributions:
	\begin{enumerate}
	\item \textbf{A planner/executor decomposition with plan-level selective prediction}:
	the model proposes, a deterministic engine executes with full lineage, and a
	deterministic verifier gates every proposed mapping, extending abstention to the plan
	itself. The verified union of the gated model plan with a reference-grounded
	heuristic repairs 41\% of hospital's \hospErrors{} real errors at \unionGatePrec{}
	precision with zero configuration and zero labels (\S\ref{sec:method},
	\S\ref{sec:verifier}, \S\ref{sec:ws1results}).
	\item \textbf{\textsc{WildClean} and an un-gameable evaluation}: a 65-dataset suite
	(real-error benchmarks plus seeded error injection over 15 harvested open-data
	domains) scored with a churn-neutral, convention-tolerant metric that cannot be
	inflated by mass rewriting, with damage and silent edits scored alongside repair F1,
	degenerate baselines pinning the metric's floor and ceiling, and the scorer itself
	validated against 30 adversarial known-by-construction cases (\S\ref{sec:eval},
	\S\ref{sec:degenerate}).
	\item \textbf{Four deterministic capabilities that carry never-seen-table
	generalization}: bounded suspect surfacing for high-cardinality columns, generic
	entity-reference reconciliation with an exact-hit typing floor, cross-row majority
	voting with a false-consensus guard, and convention-conservatism gates --- each
	motivated by a measured failure regime and gated by the verifier
	(\S\ref{sec:capabilities}, \S\ref{sec:wild}).
	\item \textbf{Execution-verified synthetic supervision}, the training method behind
	the 4B planner instantiation: every training example is validated by running the
	executor on the (dirty table, plan) pair and checking that the known-clean table is
	recovered; non-recovering examples are discarded (\S\ref{sec:sft}).
	\item \textbf{A unified finding on where capability lives in this architecture}: five
	further supervised fine-tunes and a three-arm GRPO pilot with the executor as a
	verifiable reward leave held-out generalization statistically bounded (TOST), while
	two of three zero-shot 24--31B planners dropped into the same harness exceed the
	fine-tune's operating point --- deterministic machinery plus plan-level verification carry the
	generalization that exists, and raw planner capability, not task fine-tuning, scales
	it (\S\ref{sec:negative}, \S\ref{sec:scaling}).
	\end{enumerate}

	We deliberately report a negative-flavored finding alongside the positive ones: on
	\emph{injected} typos, classical frequency clustering remains a strong baseline---by
	construction, injection places the canonical form in the column, which is clustering's
	ideal regime. The advantage of grounding is concentrated where it matters: real errors,
	tail entities absent from the column, and adversarial near-misses where acting at all is
	wrong (\S\ref{sec:results}).

	\section{Related Work}
	\label{sec:related}

	\textbf{Error detection and repair.} Raha and Baran~\cite{raha} established
	configuration-free error detection and correction benchmarks (hospital, beers, flights,
	rayyan), which we adopt as out-of-distribution evaluation. HoloClean~\cite{holoclean}
	combines integrity constraints, external reference data, and statistics in probabilistic
	repair, demonstrating that external signals can veto statistically plausible but wrong
	repairs---an insight our reference-veto inherits. GARF~\cite{garf} learns repair rules
	self-supervised from the data itself; it also demonstrates the structural limit we
	observe for frequency-only methods: a lone categorical column offers no co-occurring
	signal to vote against an error.

	\textbf{The 2025--26 landscape.} Post-Cocoon work concentrates on zero-label
	\emph{detection}: ZeroED~\cite{zeroed} (cloud-LLM cluster labeling, hospital
	detection F1 0.81, collapsing to 0.27 on smaller models), ForestED~\cite{forested}
	(LLM-induced decision trees, 0.756), and Auto-Test~\cite{autotest} (corpus-mined
	semantic-domain constraints, no LLM at inference) --- none performs zero-label
	\emph{repair}. GIDCL~\cite{gidcl} sets the labeled-class repair ceiling
	(hospital \gidclHosp{} with 20 labels and a LoRA trained per cleaned table);
	Cocoon~\cite{cocoon} remains an unreproduced preprint (15 citing papers, none a
	reproduction). Two concurrent results corroborate facets of this paper's central
	negative finding that machinery, not weights, carries cleaning generalization: a
	study showing even frontier models cannot correct table distortions without
	explicit priors~\cite{distort}, and a large multi-agent-debate evaluation in which
	LLM self-critique \emph{degrades} repair and only an adversarially separate,
	execution-grounded critic helps~\cite{debate} --- the architecture our verifier
	instantiates. Spreadsheet-RL~\cite{spreadsheetrl} reports the complementary
	positive case: with full-scale RL infrastructure and execution-verified rewards,
	a 4B model's spreadsheet-manipulation skill \emph{does} move (12.0\%
	$\rightarrow$ 23.4\%) --- consistent with our reading that the gap between our
	\$30 pilot and such results is infrastructure scale, a boundary we state rather
	than blur (\S\ref{sec:negative}).

	\textbf{LLMs for data wrangling.} Narayan et al.~\cite{wrangle} showed frontier
	foundation models handle entity matching and imputation few-shot;
	Jellyfish~\cite{jellyfish} and Table-GPT~\cite{tablegpt} fine-tune mid-size models for
	data tasks. RetClean~\cite{retclean} is closest in spirit: retrieval from data lakes
	grounds cell repair, with the key empirical split that parametric knowledge suffices on
	world-known head values but collapses on the tail---motivating retrieval. Our work
	differs in the planner/executor decomposition (the model emits no cell values, only
	plans), in execution-verified supervision, and in the calibrated-abstention contract.

	\textbf{Entity linking over tables.} TURL~\cite{turl} and TableLlama~\cite{tablellama}
	inject candidate entities into table understanding; Belotti et al.~\cite{belotti}
	show retriever coverage is the accuracy ceiling for table entity disambiguation and that
	long candidate lists hurt smaller models. RACOON~\cite{racoon} shows inference-time KG
	retrieval lifts a frozen model substantially, supporting our choice to ground at
	inference rather than bake aliases into weights (TURL's out-of-domain collapse is the
	cautionary result). MTab~\cite{mtab} established type-constrained matching with
	abstention in semantic table annotation.

	\textbf{Clustering-based cleaning tools.} The de-facto practitioner baseline is
	OpenRefine: key-collision (fingerprint) clustering plus a nearest-neighbour mode; we
	reimplement both faithfully, including blocking, and compare head-to-head.

	\textbf{Selective prediction.} Risk--coverage analysis and calibration
	metrics~\cite{selective} formalize ``knowing when not to act''; to our knowledge their
	application to data-cleaning merge decisions is new.

	\textbf{Small specialized models.} OpenMed~\cite{openmed} fine-tunes sub-500M encoders
	to state-of-the-art biomedical NER, the sister result to our thesis that small
	specialized models beat large generic ones on narrow structured tasks; we adopt their
	released PII token classifiers for column typing (\S\ref{sec:pii}).

	\section{Method}
	\label{sec:method}

	\subsection{Planner / executor decomposition}
	A \emph{profiler} reduces each column to a typed summary: detected semantic type, missing
	counts, issue flags, and a value--frequency distribution capped at 80 distinct values
	(high-cardinality columns are summarized by their head). The \emph{planner}---either a
	deterministic heuristic or our fine-tuned 4B model---maps the profile (plus three sample
	rows) to a JSON plan: a list of per-column operations drawn from a closed vocabulary
	(\texttt{canonicalize\_categories} with an explicit mapping, \texttt{parse\_date},
	\texttt{standardize\_phone}, \texttt{mask\_pii}, \ldots), table operations, and review
	flags. The \emph{executor} applies the plan with pure pandas transforms. The plan is the
	only channel through which data changes: every diff is attributable to a named operation
	with a rationale, the original table is never mutated, and abstentions are first-class
	plan objects. We export per-run decision summaries as OpenTelemetry GenAI spans.

	\subsection{Execution-verified synthetic supervision}
	\label{sec:sft}
	Training pairs are generated by corrupting clean synthetic tables with realistic noise
	(casing, aliases, single-character typos with Zipf-distributed long-tail categorical
	columns of 30--80 distinct values) while recording the ground-truth plan. The defining
	step is \emph{verification by execution}: a candidate example is kept only if
	$\textsc{Execute}(\text{dirty}, \text{plan}) = \text{clean}$ cell-for-cell. This closes
	the loop between supervision and semantics---a plan that would not actually clean the
	table can never become a training label. We augment with real supervision derived from
	paired dirty/clean benchmarks by aligning cells and keeping only \emph{learnable}
	canonicalizations (a surface form that is a string variant of its target and never a
	legitimate value elsewhere), which excludes unlearnable per-cell corrections such as
	divergent flight times. The fine-tune is QLoRA (rank 32) over Qwen3-4B-Instruct in
	bf16; one practical finding is that the base model's tool-calling prior dominates
	free-running generation even after convergent fine-tuning (loss 0.16) and must be
	suppressed at decode time by banning the two tool-call tokens.

	\subsection{Reference-grounded canonicalization with abstention}
	\label{sec:grounding}
	For columns whose values reconcile to a known concept type (countries, administrative
	regions, cities), canonical forms are never generated: a fuzzy retriever (normalized
	edit similarity with first-character blocking and length prefilters) matches each
	distinct value against the type-scoped reference (ISO/pycountry; GeoNames cities500,
	196k entries). A value maps to a canonical only if (i) similarity clears a threshold
	$\tau{=}0.84$, (ii) the best--second-best margin clears $0.03$ (ambiguity veto: a value
	equally close to \texttt{Box} and \texttt{Boaz} abstains), and (iii) the canonical is
	cast to the column's observed case convention. Near-misses ($0.70{\le}s{<}\tau$) are
	surfaced as review flags. The same wrapper grounds the \emph{model} planner: for
	reference-typed columns the model's free-generated mapping is replaced by the grounded
	one, so the model can add coverage but never invent a canonical for a grounded type.

	\subsection{Plan-level selective prediction: the verified union planner}
	\label{sec:verifier}
	Grounding constrains reference-typed columns, but the planner's \emph{free}
	canonicalization mappings on non-grounded columns remain unguarded---and they are where
	real-data precision dies (the fine-tune's raw hospital plan: \hospModelPrecVSix{}
	precision at \hospModelRecallVSix{} recall). Rather than retrain, we extend abstention
	to the plan itself. A deterministic \emph{verifier} scores every proposed mapping entry
	$raw{\to}canon$ with contract-preserving evidence (no cell values emitted, no gold
	access): three hard gates distilled from the model's measured failure classes---a value
	occurring ${\ge}3$ times is data, not a typo (\emph{errors are rare}); the target must
	be a frequent column value clearly dominating the source (no mapping one typo onto
	another); digit-bearing codes repair only when the letter part is near-identical---then
	a confidence combining edit similarity with frequency support. Entries below a
	threshold $\tau$ are dropped to review flags; abstention stays first-class. Sweeping
	$\tau$ yields a plan-level precision--coverage curve. The shipped composition,
	the \emph{verified union planner}, is the verifier-gated model plan ($\tau{=}0.5$)
	unioned with the grounded heuristic's mappings (the model wins per surface form);
	the same code path is the product default.

	\subsection{Visibility and consensus: four deterministic capabilities}
	\label{sec:capabilities}
	Four further mechanisms, each motivated by a measured failure regime on never-seen
	tables, complete the deterministic machinery. \textbf{(a) Suspect surfacing.} The
	profile's value-frequency view is capped, so high-cardinality columns hide their
	dirty cells from any planner. Every column profile now carries a bounded
	\texttt{suspect\_values} section: rare anomalous surfaces with evidence-backed
	repair candidates (frequency dominance, edit similarity, reference membership).
	The heuristic planner repairs from suspects under a strict verifier bar
	($\tau_{hc}{=}0.8$) and flags the rest. \textbf{(b) Generic entity reference.}
	Open vocabularies (SemTab ToughTables aliases --- derived excluding our benchmark
	tables; MusicBrainz search-hint misspellings; RxNorm; Wikidata; ROR) register as a
	pluggable reference type. Because the reference is broad, entity-typing a column
	additionally requires that ${\ge}20\%$ of its distinct values match the reference
	\emph{exactly} --- fuzzy coverage alone over-fires on name-like columns (measured).
	This resolves the regime where every surface in a column is unique (no in-column
	frequency signal exists at all): five such benchmark tables go from 0.0 to
	\ttFOne{} F1 at \emph{zero} damage. \textbf{(c) Cross-row majority voting.} Tables
	that repeat a real-world entity across rows (a flight reported by many sources)
	carry their own repair signal. A detection step finds compact-token key columns
	with small groups (median multiplicity 3--30) and columns whose groups show
	\emph{majority-bearing} disagreement with per-group information; a table-level
	operation then resolves thin dissenting minorities to the group majority. A
	\emph{false-consensus} guard declines when minority shares look like legitimate
	correlated updates rather than reporting errors (mean minority share ${\ge}0.25$)
	--- a flat volume cap was measured to destroy the legitimate dense-disagreement
	regime and replaced. \textbf{(d) Convention conservatism.} The planner never
	re-formats an internally consistent column: date and percent ops are gated on
	dominant-shape inconsistency (digit and alpha runs collapsed; 90\% rule),
	ZIP/postal-named columns are never typed as phones or dates, and Excel-serial
	date typing requires a date-suggestive column name. Suppressed minority values
	surface as review flags --- abstention is visible, never silent. The verifier
	enforces the same gates on model-emitted plans at the verification boundary.

	\subsection{PII as a second task instance}
	\label{sec:pii}
	The identical contract covers PII: a deterministic tier types columns by checksum and
	pattern validators (Luhn, IBAN mod-97, SSN/email/phone) over distinct values; an
	optional 44M OpenMed-PII token classifier~\cite{openmed} extends coverage to names and
	addresses, gated by a sensitive-type allowlist and a column-level coverage vote; and
	masking, salted hashing, and join-stable pseudonymization are deterministic executor
	operations. Measured briefly: the classifier, though trained on sentence-level
	clinical text, transfers to bare cell values --- \piiNameBare{} detection on
	person-name cells and \piiAddrBare{} on address cells ($n{=}40$ sampled cells each);
	the validator tier, evaluated out-of-distribution on per-type columns from the Gretel
	PII test split, types 5/5 covered PII types correctly with 0/7 false positives on
	negative columns drawn from real open data; and after deterministic masking,
	re-running all validators over the output finds \piiLeakRate{} residual PII ---
	residual PII \emph{detectable by our validators}, a circularity we note explicitly:
	the leak test can only see what the validator tier sees.

	\section{Evaluation Design}
	\label{sec:eval}
	\textbf{Suite.} Five real-error benchmarks (Raha) plus seeded error injection
	(typo/OCR/case/whitespace) over 15 harvested open-data domains (NYC, Chicago, SF, LA,
	Seattle, Texas, WA portals; GitHub) $\approx$ 65 datasets per seed. We aggregate as a
	\emph{double macro}---mean over error types of mean over datasets, harmonically combined
	with the domain macro---so no single table or error type dominates:
	\begin{equation*}
	\textsc{north} \;=\; \operatorname{HM}\Biggl(
	\underbrace{\frac{1}{\|T\|}\sum_{t \in T}\frac{1}{\|D_t\|}\sum_{d \in D_t} F_1(d)}_{\text{error-type macro}},\;
	\underbrace{\frac{1}{\|G\|}\sum_{g \in G}\frac{1}{\|D_g\|}\sum_{d \in D_g} F_1(d)}_{\text{domain macro}}
	\Biggr),
	\end{equation*}
	where $T$ is the set of error types, $G$ the set of data domains, $D_t$ (resp.\ $D_g$)
	the datasets carrying error type $t$ (domain $g$), and $\operatorname{HM}$ the harmonic
	mean.

	\textbf{Churn-neutral metric.} A cell change that is case/whitespace-equivalent to the
	input but does not restore the gold counts as nothing: not a fix, not a change, not
	damage. Without this, mass case-rewriting inflates precision (we observed $+0.12$
	NORTH from \emph{removing} case matching before the correction); with it, fixing a
	case-injected error requires actually acting. We additionally report
	\emph{damage}---the rate of semantically corrupting clean cells---and an adversarial
	\emph{abstain slice} whose traps are garbage strings (not single-edit variants of any
	reference entity; an earlier trap set mis-scored grounding for correctly mapping
	\texttt{Boazz}$\to$\texttt{Boaz}). We report both repairs of these metric artifacts as
	evidence that gameability must be tested, not assumed.

	\textbf{Real vs.\ injected.} Injected typos are in-distribution for frequency
	clustering by construction (the canonical is present and dominant in the column), so we
	report the real-error and injected slices separately. A TableEG-style audit
	quantifies the gap (\texttt{eval/inject\_validity.py}): the injector covers three
	of nine error classes (Jensen--Shannon divergence 0.526 bits from the pooled real
	distribution over 163{,}607 real errors), and injected-only evaluation would
	invert the fingerprint-clustering ranking --- exactly the overstatement the
	separate-slice reporting prevents.

	\textbf{Scorer validation.} Following GroUSE-style evaluator
	testing~\cite{grouse}, the scorer itself is validated against 30 adversarial
	known-by-construction cases: a no-op plan must score 0 fixes and 0 damage, an
	oracle plan exactly 1.0, vandalizing $k$ of $m$ clean cells must score damage
	$k/m$ at precision 0, pure churn (case/whitespace rewrites that do not restore
	gold) must count as nothing although a naive scorer would count it, fixes must
	require actually acting, and silent edits must trip the audit. All 30 pass
	against the shipped scorer unmodified. We additionally cross-score every system
	under the \emph{original} Raha/Baran cell-repair protocol side by side with ours
	(\texttt{eval/cross\_scoring.py}): rankings agree at Kendall $\tau_b{=}1.0$ on
	three of five datasets, and the disagreements cut both ways --- raw string
	equality denies credit for numerically-correct serialization restorations (our
	movies\_1 repairs), while churn-neutrality charges Baran for load-time
	normalizer rewrites its own protocol hides (hospital precision
	$0.908\!\to\!0.783$). Neither metric family flatters us uniformly, and our Baran
	reproduction calibrates against its published Table~3 within $+0.02$ on three of
	the four shared datasets.

	\textbf{Contamination.} The Raha-suite benchmarks have been public on GitHub since
	2019 and sit inside every modern base model's training window; we treat them as
	potentially contaminated and split our claims accordingly. A verbatim-completion
	probe makes the concern concrete: prompted with five fields of a gold hospital row,
	a frontier-class model reproduces \textbf{25\%} of the held-out cells exactly
	(30/120 cells over 30 rows, exact-substring match), versus \textbf{0\%} (0/120) on a
	date-stamped post-training-cutoff wild harvest under the identical protocol
	(\texttt{eval/contamination\_probe.py}). The rate is an upper bound on memorization
	--- some completions are guessable from the given fields --- but it is not zero, so
	results on legacy-public benchmarks (including the all-hospital
	Table~\ref{tab:scaling}, whose zero-shot planners may partially benefit from
	memorized gold) carry this caveat, while the architecture's trust claims
	(zero silent edits, damage accounting, abstention) rest on the date-stamped wild
	and GitTables slices, where the probe finds nothing to complete.

	\section{Results}
	\label{sec:results}

	\subsection{Plan-level selective prediction on real errors}
	\label{sec:ws1results}
	On hospital's \hospErrors{} real errors, the verifier transforms the fine-tune from
	unshippable to precise (Figure~\ref{fig:pc}): the raw model plan repairs
	\hospModelRecallVSix{} of errors at \hospModelPrecVSix{} precision; gated at $\tau{=}0.5$ it
	reaches \modelGatePrec{} precision at \modelGateCov{} coverage (146 of 147 committed
	changes correct). The union with the grounded heuristic buys coverage back:
	\textbf{\unionGatePrec{} precision at \unionGateCov{} coverage} (\unionChanged{}
	changes, \unionFixed{} correct). This turns the system's promise into a measured
	sentence: \emph{zero-configuration and zero labels, repair 41\% of real errors at
	${\ge}0.90$ precision, with every declined merge surfaced for review}. For context,
	Baran given oracle error positions and 20 gold-labeled tuples per dataset reaches
	\realFBaran{} F1 on the same slice (\S\ref{sec:ws4})---selective prediction does not
	close a supervised gap, but it makes the zero-label operating point trustworthy, which
	is the regime our user occupies. Precision is flat ($0.89$--$0.91$) for
	$\tau\in[0.2,0.8]$, so the operating point is not threshold-brittle, and the result is
	seed-robust: across three training seeds of the same data recipe the union operating
	point is \unionGateThreeSeedPrec{} precision at \unionGateThreeSeedCov{} coverage
	(the shipped adapter is the strongest seed), with every seed clearing the
	$0.70$-precision/$0.30$-coverage bar decisively. All 3-seed intervals in this paper
	are normal-approximation 95\% CIs ($1.96\,\sigma/\sqrt{3}$); the $t$-based
	interval at $n{=}3$ is ${\sim}2.7\times$ wider ($\pm 0.031$ here) and every
	qualitative claim survives it --- the weakest seed alone clears the bar.

	\textbf{Candidate-constrained planning (negative result).} We also tested constraining
	the planner's \emph{inputs}: the profiler emits evidence-backed (variant$\,\to\,$
	canonical) candidate pairs (frequency dominance, edit similarity, reference membership)
	and the model may only select among them, with a deterministic check dropping
	off-candidate mappings to review flags. As a standalone guard it is strong---the raw
	plan's precision rises from \hospModelPrecVSix{} to \pairsRawPrec{} with no verifier at
	all---but composed with the verifier and union it reaches \pairsUnionPrec{} precision
	at \pairsUnionCov{} coverage, slightly \emph{below} the unconstrained pipeline at the
	same $\tau$: the candidate cap (top-3 per surface) removes some correct repairs the
	verifier would have kept, and the two mechanisms gate the same failure class. We ship
	the verifier and keep candidate constraining available but off by default, reporting
	this as a measured redundancy rather than a stacked win.

	\begin{figure}[t]
	\centering
	\includegraphics[width=0.62\linewidth]{fig_precision_coverage}
	\caption{Plan-level precision--coverage on hospital (509 real errors), sweeping the
	verifier threshold $\tau$. The union planner dominates the raw model plan; the shipped
	operating point ($\tau{=}0.5$) is annotated.}
	\label{fig:pc}
	\end{figure}

	\subsection{The 4B fine-tune as one planner instantiation}
	On frozen synthetic gold, the fine-tuned 4B planner reaches canonicalization micro-F1
	\canonFMultiSeed{} --- versus \canonFBig{} for a much larger zero-shot generalist
	prompted identically and \canonFHeur{} for the rule heuristic (best single run
	\canonFOursBest; operation-F1 \opFOurs, JSON validity \jsonValidOurs). On real hospital
	typos the synthetic-only fine-tune scores 0.000 repair recall; adding 20\%
	real-derived supervision lifts it to \hospModelRecall, and a data-scaling iteration
	(tripling the real-derived share from three paired benchmarks) reaches
	\hospModelRecallVSix{} recall at \hospModelPrecVSix{} precision---approaching the
	\frontierZeroShotRecall{} of a frontier-scale zero-shot model. The scaling gain is seed-robust: $+0.09$
	canonicalization F1 over the base mix under identical protocol, with non-overlapping
	3-seed confidence intervals. Real, execution-verified pairs are what transfer:
	the same iteration found frequency-derived and algorithm-cleaned labels both
	\emph{reduce} quality, consistent with our grounding thesis.

	\subsection{Grounding vs.\ clustering}
	With the errors-are-rare frequency gates now in both paths, grounding and frequency
	clustering are comparable on hospital alone (repairs-only, churn-neutral:
	\hospPrecGrounded{} precision at \hospRecallGrounded{} recall grounded vs
	\hospPrecFreq{} at \hospRecallFreq{} clustering---hospital's dominant errors are
	in-column typos, clustering's best case). Grounding's margin appears where references
	matter: across the five-benchmark real-error macro it reaches \ablFullRealF{} versus
	\ablNoGroundRealF{} for the frequency-clustering ablation ($+29\%$), and it carries
	the behavioral guarantees below.
	On the full suite against OpenRefine (Table~\ref{tab:money}), the result splits
	cleanly by regime, and we report both. On the \emph{real-error} slice---the regime the
	tool exists for---grounded cleaning reaches REAL-F1 \realFGrounded{}, $3.9\times$
	OpenRefine kNN (\realFORKnn) and $5.7\times$ fingerprint (\realFORFp), with seed CIs of
	$\pm$\northGroundedCI. Provenance: the grounded and OpenRefine rows of
	Table~\ref{tab:money} are regenerated at the current system head (2026-06-12,
	post-capability, scorer fix in); the dagger rows keep their original capture
	provenance. The June-10 freeze system measured REAL-F1 \realFGroundedFreeze{} on the
	same protocol --- the $+0.05$ difference is the measured contribution of the four
	deterministic capabilities (\S\ref{sec:capabilities}) on the real-error slice. On the \emph{injected} slice, fingerprint clustering wins
	(\injFORFp{} vs \injFGrounded) at near-zero damage: our case/whitespace injectors are
	exactly the perturbations key-collision normalizes away, so this is its home game and
	we say so. kNN clustering---the method that, like us, attempts typo repair---loses on
	both slices while incurring the highest damage among baselines (\damageORKnn), the
	no-reference over-merging failure the grounding was built to prevent. The shipped
	verified-union system's suite row (REAL-F1 \modelRealF, damage \modelDamage) shows the
	grounding wrapper and heuristic union carry entity canonicalization on these datasets ---
	the model's contribution concentrates on the synthetic regime and hospital repair
	(\S\ref{sec:ws1results}), and the verifier cuts its suite damage to \modelDamage,
	$6\times$ below the grounded heuristic's \damageGrounded{} (HEAD damage vs the union
	row's freeze-time capture --- a disclosed basis mix). Within our own ablations
	(June-10 freeze basis throughout), removing grounding cedes $22\%$ of real-error
	F1 (\ablNoGroundRealF{} vs \ablFullRealF) and forfeits the behavioral guarantees:
	perfect abstention on adversarial traps (\ablFullAbstain) versus
	\ablNoAbstainAbstain{} without abstention, and reference-vetoed wrong merges (e.g.\
	\texttt{guntxrsvillx}$\to$\texttt{huntsville}).

	\begin{table}[t]
	\centering
	\caption{Wide-suite comparison, 3 injection seeds, churn-neutral metric. NORTH is the
	double-macro harmonic mean; REAL-F1 is the real-error slice. Regenerated at the
	current system head (2026-06-12); the June-10 freeze system measured
	\realFGroundedFreeze{} REAL-F1 / \northGroundedFreeze{} NORTH on the same protocol.}
	\label{tab:money}
	\begin{tabular}{lcccccc}
	\toprule
	System & NORTH & $\pm$95\%CI & REAL-F1 & INJ-F1 & damage & abstain \\
	\midrule
	Grounded (ours) & \northGrounded & \northGroundedCI & \textbf{\realFGrounded} & \injFGrounded & \damageGrounded & \ablFullAbstain \\
	OpenRefine fingerprint & \northORFp & 0.000 & \realFORFp & \injFORFp & \damageORFp & 1.000 \\
	OpenRefine kNN & \northORKnn & 0.002 & \realFORKnn & 0.148 & \damageORKnn & 1.000 \\
	Verified union 4B (shipped)$^{\dagger}$ & -- & -- & \modelRealF & -- & \modelDamage & \modelAbstain \\
	\midrule
	Baran (oracle det.\ + 20 labels)$^{\ddagger}$ & -- & -- & \realFBaran & -- & \damageBaran & -- \\
	Jellyfish-13B (ED+DI)$^{\ddagger}$ & -- & -- & \realFJelly & -- & \damageJelly & -- \\
	\bottomrule
	\end{tabular}

	\smallskip
	{\small $^{\dagger}$single seed, REAL + typo-injected slice only (GPU cost); other rows
	are 3-seed means. $^{\ddagger}$real slice only, disclosed protocol asymmetries
	(\S\ref{sec:ws4}): Baran uses oracle error positions + gold labels; Jellyfish is our
	detect-then-impute composition with seen-data caveats.}
	\end{table}

	\subsection{Generalization to never-seen tables}
	\label{sec:wild}
	The freeze-version system above was then pointed at data it had never seen, under
	three new harnesses (all released with this paper as the \textsc{WildClean} bundle).
	\textbf{(1) Paired bench}: \nPairs{} dirty/gold pairs spanning the Raha suite, SemTab
	ToughTables, government open-data typo corpora, entity-matching tables, and
	LLM-cleaning evaluation sets. On the 35 pairs from sources absent from training ---
	a count that coincidentally equals, but is distinct from, the \nWild{} gold-free wild
	tables of harness~(2) below --- the
	post-freeze system scores \textbf{macro F1 \unseenMacroF{} at damage
	\unseenMacroDamage}. The largest single contribution is the regime
	\S\ref{sec:capabilities}(b) unlocks: on five all-unique entity tables where no
	in-column frequency signal exists, F1 moves from $0.0$ to \ttFOne{} at zero damage.
	Cross-row voting (\S\ref{sec:capabilities}c) is the second: flights---many sources
	reporting the same flight---goes from \flightsBaseF{} to \flightsVoteF{} F1
	heuristic-only, and the heuristic hospital path doubles from \hospBaseHeur{} to
	\hospVoteHeur{}. The hospital union gate is invariant under all of this
	(\unionGatePrec{} at \unionGateCov). \textbf{(2) Wild bench}: \nWild{} uncurated
	in-the-wild tables (open-data portals, GitHub, Kaggle) with no gold; we score seeded
	inject--recovery on each table's own data (mean recovery \wildRecovery{} over the 34
	tables with inject scores; one table has none) plus a
	behavioral audit: every run yields a valid plan, every changed cell is attributable
	to a logged operation --- \textbf{zero silent edits across all \nWild{} tables}.
	\textbf{(3) Trust audit at scale}: \nTrust{} GitTables tables, same property ---
	\nTrust{}/\nTrust{} valid plans, zero crashes, zero silent edits. The held-out-source
	generalization metric (train and evaluation drawn from disjoint benchmark sources)
	remains low in absolute terms (GEN-F1 \genFTwo{}, variant-recall \genVRTwo{}, damage
	\genDamageTwo): cleaning unfamiliar tables is far from solved, and we report the
	number to anchor the next section's claim about \emph{where} the capability that does
	exist actually lives.

	\subsection{Where capability lives: a bounded null for fine-tuning}
	\label{sec:negative}
	Every attempt to move never-seen-table performance through the model weights failed;
	every gain in \S\ref{sec:wild} came from deterministic machinery plus the verifier.
	Five further supervised fine-tunes --- adding 109k harvested real-world alias pairs
	(ToughTables-derived, MusicBrainz search hints, RxNorm, OpenFlights), error-dense
	episode mixes, and a suspects-contract retrain --- left held-out GEN-F1
	\emph{statistically bounded}: every retrain's delta is positive but negligible (mean
	$+0.003$), never approaching the pre-registered $\delta{=}0.05$. ``Bounded'' is a
	tested equivalence claim, not an eyeballed one~\cite{lakens}: across the five-retrain
	series the mean held-out GEN-F1 delta (retrain minus champion) is $+0.0028$
	(90\% bootstrap CI $[+0.0008, +0.0049]$, strictly positive;
	10{,}000 resamples, seed 42; per-dataset granularity, $n{=}15$ over 3 held-out
	sources $\times$ 5 retrains --- per-pair deltas do not exist for the retrain
	series, so within-retrain deltas are clustered and we add a retrain-level
	robustness check, $n{=}5$ macro deltas), and TOST rejects effects beyond the
	pre-registered SESOI of $\pm 0.05$ ($p = 8.0\times10^{-16}$; retrain-level check
	$p = 8.3\times10^{-8}$). One disclosure sharpens the clustering caveat: two
	retrains' held-out rows are \emph{bit-identical} --- mechanically verified as
	verifier-collapse, not a data error (their raw plans share zero mapping entries,
	9 vs.\ 82 on flights, yet the verifier kills all of both, so each union
	degenerates to the same deterministic plan;
	\texttt{eval/results/equivalence\_coincidence.json}) --- so the $n{=}15$ rows
	carry fewer independent observations than their count suggests, which is exactly
	why the $n{=}5$ retrain-level test is the one we lean on. The collapse itself is
	the finding in miniature: different weights, same held-out behavior, because the
	verifier and the deterministic machinery decide what survives. Two reconciliations make the claim auditable. First, the
	basis: the equivalence series is scored against the champion's absolute GEN-F1 of
	\genChampionBasis{}, while the \genFTwo{} of \S\ref{sec:wild} is the \emph{shipped
	system} at the post-freeze HEAD with all deterministic capabilities --- the
	equivalence series scores each retrain's model-union path at its own capture time,
	so the two figures share a metric but not a basis. Second, the SESOI: weight
	interventions move GEN-F1 by at most $0.005$, while the deterministic machinery of
	\S\ref{sec:capabilities} moved the unseen-pair macro from $0.10$ to \unseenMacroF{}
	--- $\delta{=}0.05$ sits an order of magnitude above the measured weight effect and
	well below the machinery effect, which is exactly the boundary the test is meant to
	police. Mixing harvested pairs into the training blend
	\emph{diluted} the synthetic skill the executor verifies (a monotonic dilution law
	across mix ratios). A GRPO pilot using the executor as a verifiable reward (the
	direction RLVR table work~\cite{tabler1} motivates) was negative in all three arms at
	4B/LoRA scale: the main arm and a KL-anchored variant degraded plan-format validity,
	and a random-reward control arm reproduced the same drift, identifying it as an RL
	artifact rather than signal~\cite{spurious}. We state this as a \emph{bounded} null,
	not a universal one: at 4B/LoRA scale, under our propose/execute protocol and
	training budgets, no weight intervention we ran produced measurable movement in
	never-seen-table repair --- profiling visibility, reference grounding, cross-row
	consensus, convention conservatism, and plan-level verification carry the capability
	that exists. The bound is explicit: results with full-scale RL infrastructure
	(execution-verified rewards on multi-GPU RLVR stacks~\cite{spreadsheetrl,tabler1})
	show task skill moving at the same parameter scale, so our claim is about what
	SFT-and-pilot-RL buy in this protocol class, not about reinforcement learning in
	general. A second explicit bound: every weight experiment here uses the Qwen3
	family --- and the very work we cite to explain the control arm's drift documents
	that random-reward GRPO effects are themselves family-sensitive~\cite{spurious}
	--- so the null is stated for Qwen3-class models pending a cross-family
	replication. Concurrent evaluations corroborate the mechanism from independent
	directions~\cite{distort,debate}. The practical corollary is unusual but actionable:
	a contributor who wants to improve a system like this should write a deterministic
	capability and gate it with the verifier, not collect more training data.

	The null extends to test-time compute --- with one instructive exception that
	\emph{confirms} the architecture claim. Self-consistency \emph{voting} over
	$N{=}16$ temperature-0.7 samples (cell-edit-level majority, run through the
	identical verifier--union pipeline) yields 0.906 precision at 0.454 coverage
	versus 0.9055 at 0.4519 for matched greedy decoding on the same local runtime ---
	a null at matched precision, the visibility law from the test-time side: voting
	cannot surface repairs the profile does not expose, and it actively discards
	verified-recoverable coverage. But pooling \emph{every} mapping from all 16
	samples and letting the verifier filter the union gives the best operating point
	we measure for the 4B: \textbf{0.911 precision at 0.483 coverage} ($+0.6$ points
	precision, $+7.1$ points coverage over the shipped gate; an independent $N{=}8$
	replication reproduces the \emph{voted} point to $\pm 0.0003$ precision /
	$\pm 0.002$ coverage, and the greedy anchor exactly). The lesson is the paper's thesis in miniature: sampling
	helps only as a \emph{candidate generator}; consensus adds nothing the verifier
	does not already provide --- pool candidates, verify, do not vote. Separately,
	the local capture path itself (Q8 quantization with grammar-constrained decoding)
	is worth $+3.9$ points of coverage over the original Modal capture at equal
	precision.

	\subsection{Zero-label capability scaling: the verifier harness is planner-agnostic}
	\label{sec:scaling}
	The negative result bounds what fine-tuning small weights buys; it says nothing
	about raw capability. To separate the two we dropped zero-shot, $\leq$32B
	open-weights planners --- with \emph{no} task training --- into the identical
	hospital pipeline the 4B fine-tune uses: same prompt contract, same
	verify($\tau{=}0.5$), same union with the grounded heuristic
	(Table~\ref{tab:scaling}). devstral-small-2-24B and gemma4-31B both reach
	\textbf{\scalePrecBig{} precision at \scaleCovBig{} coverage} --- exceeding the
	fine-tune's union point of \unionGatePrec{} at \unionGateCov{} --- while
	nemotron-30B reaches \scalePrecNemo{} at \scaleCovNemo{} with JSON-plan validity
	0.4 (validity is part of the measurement: a planner that cannot reliably emit the
	plan schema loses coverage before capability is measured). gpt-oss-20B is
	excluded as a serving failure, documented rather than scored as capability: the
	hosted proxy returned empty content on every planning call despite full-length
	generation. The arm is multi-family (Mistral, Google, NVIDIA), which addresses
	the single-family bound of \S\ref{sec:negative} for the inference side; the
	weight-training null itself remains Qwen3-scoped. Disclosure: these models were
	measured via hosted inference for speed; all are $\leq$32B open weights and
	locally deployable in principle. The interpretation we draw is the paper's
	sharpest: SFT at 4B does not buy held-out generalization (\S\ref{sec:negative}),
	but raw capability at 24--31B does lift the same harness --- the verifier/union
	architecture is the portable contribution, converting any sufficiently capable
	planner into a trustworthy cleaner.

	\begin{table}[t]
	\centering
	\caption{Zero-shot $\leq$32B planners in the identical verify($\tau{=}0.5$)+union
	harness, hospital's \hospErrors{} real errors. Validity = fraction of planning
	calls returning schema-valid JSON. Runtime = wall-clock for the planning calls on
	hosted endpoints (single capture, no seeds; the 4B row is a prior Modal A100
	capture with no comparable local figure). Each scaling row is a single capture;
	the primary evidence is the union coverage delta ($+0.07$) at matched-or-better
	precision, not any single cell. For context, 16-sample pooling lifts the 4B
	fine-tune to $0.911@0.483$ at $16\times$ planning compute
	(\S\ref{sec:negative}); the 24--31B planners reach $0.915@0.485$ in a single
	greedy pass --- single-pass capability versus test-time compute, both converted
	into trustworthy operating points by the same verifier. Bold marks the best union operating point.
	gpt-oss-20B excluded (serving failure: empty
	proxy responses, not measurable capability).
	The identical devstral/gemma rows are a verified counting coincidence, not a
	scoring artifact: their applied cell-edit sets share 266 of 270 cells, each
	commits 4 model-specific repairs (all correct), and the totals coincide
	(\texttt{eval/results/scaling\_coincidence.json}).
	}
	\label{tab:scaling}
	\footnotesize
	\begin{tabular}{lccccc}
	\toprule
	Planner & Params & Gated P@C & Union P@C & Validity & Runtime (s) \\
	\midrule
	ScrubData-v6 (Qwen3-4B fine-tune) & 4B & 0.993 @ 0.287 & 0.905 @ 0.413 & --- & --- \\
	devstral-small-2 (Mistral) & 24B & 0.943 @ 0.426 & \textbf{0.915 @ 0.485} & 1.0 & \runtimeDevstral \\
	nemotron-3-nano (NVIDIA) & 30B & 1.000 @ 0.138 & 0.877 @ 0.336 & 0.4 & \runtimeNemo \\
	gemma4 (Google) & 31B & 0.943 @ 0.426 & \textbf{0.915 @ 0.485} & 1.0 & \runtimeGemma \\
	\bottomrule
	\end{tabular}
	\end{table}

	\subsection{Ablations}
	All ablations are 3-seed means (CIs $\le\pm0.003$). Removing abstention costs $-0.013$
	NORTH, raises damage to \ablNoAbstainDamage{} (from \ablFullDamage), and collapses trap
	abstention to \ablNoAbstainAbstain. Removing the ambiguity margin costs $-0.006$ with
	$+0.001$ damage. Removing case matching costs $-0.002$ under the churn-neutral metric
	(and \emph{gained} $+0.12$ under the uncorrected metric---the artifact). Replacing
	grounding with frequency clustering gains $+0.020$ NORTH, all of it from the injected
	slice (\S\ref{sec:eval}), while ceding $-0.039$ real-error F1---the trade the system
	refuses by design.

	\subsection{Learned-repair baselines under disclosed protocols}
	\label{sec:ws4}
	We additionally run two learned-repair baselines on the real-error (Raha) slice,
	under the identical churn-neutral metric but with honestly disclosed protocol
	asymmetries. \textbf{Baran}~\cite{raha} is semi-supervised: we run its reference
	configuration---oracle error positions from the dirty/gold diff plus 20 gold-labeled
	tuples per dataset (its package default), without the optional Wikipedia-pretrained
	value models. It reaches REAL-F1 \realFBaran{}$\,\pm$\realFBaranCI{} (3 label-sampling
	seeds) at \damageBaran{} damage---an upper bound under a strictly more informed
	protocol than ours (zero labels, no oracle detection); with oracle positions it can
	essentially only edit true-error cells, so its near-zero damage is structural.
	\textbf{Jellyfish-13B}~\cite{jellyfish} publishes per-cell error detection and
	imputation but no repair task; we compose the two (detect, then impute flagged cells
	with the attribute masked) --- a pipeline of our construction, not theirs. It scores
	REAL-F1 \realFJelly{} at \damageJelly{} damage (single seed, recommended decoding;
	note hospital is in its instruction-tuning data and flights/rayyan in its published
	evaluation suite, so these numbers may flatter it). Neither baseline is run on the
	56-spec injected suite (computationally and methodologically out of scope for
	semi-supervised and per-cell-LLM repair); their NORTH/INJ-F1 cells in
	Table~\ref{tab:money} are blank by design. The comparison locates our contribution:
	zero-config systems (ours, OpenRefine) occupy a different protocol class from
	supervised repair, and the verifier (\S\ref{sec:ws1results}) is what makes the
	zero-config class precise enough to trust, not what closes the labeled gap.

	Table~\ref{tab:perdataset} breaks the real-error slice down per dataset at HEAD.
	The verified-union rows are reported with their honest shape: off hospital the
	union turns ultra-conservative --- on rayyan it commits 12 changes at 0.001
	damage; on beers it holds precision 0.546 at recall 0.018. The gate's precision
	premise transfers as \emph{safety} (union damage stays at 0.001--0.080) but not
	as coverage. The movies\_1 union cell ($^{q}$: local Q8 capture, the disclosed
	quantized protocol) is the instructive worst case: on entity-rich name columns
	the quantized planner proposes plausible-but-wrong merges
	(\texttt{The Longest Day}$\,\to\,$\texttt{The Longest Yard}); the verifier kills
	most, and what leaks through is damage within the disclosed band with zero
	credited fixes --- the planner contributes nothing there, and the system's value
	is that it \emph{contains} a bad planner rather than amplifying it. This directly
	answers the co-adaptation concern: hospital is where the model's learned mappings
	live, and elsewhere the system abstains or contains rather than guesses.

	\begin{table}[t]
	\centering
	\caption{Per-dataset real-error results (Raha slice), churn-neutral F1 / damage.
	Grounded is the HEAD deterministic system; OR = OpenRefine reimplementations;
	Union is the verified union planner ($\tau{=}0.5$) where a captured model plan
	exists (movies\_1 capture pending); Baran uses oracle error positions + 20 gold
	labels (mean of 3 label-sampling seeds) and is a supervised reference, not a
	peer.}
	\label{tab:perdataset}
	\footnotesize
	\begin{tabular}{lccccc}
	\toprule
	Dataset & Grounded (HEAD) & OR fingerprint & OR kNN & Verified union & Baran (oracle+20) \\
	\midrule
	hospital & 0.258 / .066 & 0.000 / .000 & 0.189 / .083 & 0.567 / .001 & 0.827 / .004 \\
	beers & 0.025 / .005 & 0.194 / .000 & 0.086 / .074 & 0.035 / .001 & 0.918 / .000 \\
	flights & 0.127 / .082 & 0.000 / .000 & 0.014 / .065 & 0.035 / .080 & 1.000 / .000 \\
	rayyan & 0.000 / .118 & 0.000 / .001 & 0.002 / .008 & 0.000 / .001 & 0.402 / .010 \\
	movies\_1 & 0.714 / .025 & 0.002 / .018 & 0.001 / .072 & 0.000 / .025$^{q}$ & 0.909 / .001 \\
	\midrule
	macro F1 & \realFOursHead & 0.039 & 0.058 & --- & \realFBaran \\
	\bottomrule
	\end{tabular}
	\end{table}

	\subsection{A matched label budget separates the supervision regimes}
	\label{sec:labelcurve}
	The Baran comparison above is two points (zero labels, twenty labels); the
	matched-budget curve in Figure~\ref{fig:labelcurve} measures what each label is
	worth to each system on the same five-dataset real-error macro. At zero labels
	Baran --- even \emph{retaining} its oracle error positions --- repairs nothing
	(F1 \realFBaranZero, 3 seeds): its value models have nothing to learn from.
	ScrubData operates at \realFOursHead{} with zero configuration. With labels Baran
	climbs steeply (\realFBaranFive{} at $k{=}5$, \realFBaran{} at $k{=}20$): the two
	systems occupy complementary supervision regimes, a relationship now measured
	rather than asserted. ScrubData's own $k$-label arm uses the labels \emph{only}
	to validate and expand the verifier accept set --- no retraining, no oracle
	positions: $\realFOursFive \pm 0.023$ at $k{=}5$ and $\realFOursTwenty \pm 0.012$
	at $k{=}20$ (3 label-sampling seeds). The disclosed asymmetry stands at every
	budget: Baran keeps oracle error positions throughout, so the curve is an upper
	bound in its favor.

	\begin{figure}[t]
	\centering
	\includegraphics[width=0.62\linewidth]{fig_label_curve}
	\caption{Matched-budget label curve, five-dataset real-error macro F1. At
	$k{=}0$ Baran repairs nothing even with oracle error positions retained;
	ScrubData operates at \realFOursHead{} with zero configuration. With labels
	Baran climbs steeply --- complementary supervision regimes, measured. Error
	bars ($\pm$) are standard deviations over 3 label-sampling seeds; the Baran
	$k{=}20$ point reuses the 3-seed baseline run of \S\ref{sec:ws4}.}
	\label{fig:labelcurve}
	\end{figure}

	\subsection{Degenerate baselines and cost-weighted damage}
	\label{sec:degenerate}
	Four degenerate policies pin the metric's floor and ceiling on the full 42-pair
	bench (Table~\ref{tab:degenerate}). No-op and oracle land exactly at 0 and 1;
	abstain-all is score-identical to no-op because the repair metric is flag-blind
	by design (abstentions are audited separately); seeded random editing of 5\% of
	cells is vandalism the metric must punish. Since F1 alone under-punishes
	vandalism, we add a cost-weighted score in the Effective-Reliability style,
	$\Phi_c = (\mathrm{fixes} - c\cdot\mathrm{damaged})/\mathrm{errors}$ at
	$c \in \{1, 5, 10\}$: random editing scores $-0.49$ to $-4.89$, while the
	shipped system stays positive at $c{=}1$ (\degShippedPhiOne) --- and goes
	negative at higher $c$, which is the honest reading: at 10:1 cost asymmetry,
	only near-zero-damage operating points (the verified union) are defensible.

	One disclosure: the oracle acceptance check itself surfaced a scorer artifact
	--- 3 cells in 1.79M held the literal string \texttt{Nan} (a first name), which
	parses to float NaN and was unequal to itself --- now fixed in
	\texttt{eval/metrics.py} with a regression test; published numbers shift by
	less than $10^{-4}$.

	\begin{table}[t]
	\centering
	\caption{Degenerate policies pin the metric (42 pairs, churn-neutral macro;
	random-edit: seeded, 5\% of cells). $\Phi_c$ is micro-summed
	$(\mathrm{fixes} - c\cdot\mathrm{damaged})$ per benchmark error. ``Shipped''
	here is the deterministic grounded path on the 42 pairs (damage
	\degShippedDamage), distinct from the verified-union suite row of
	Table~\ref{tab:money} (damage \modelDamage).}
	\label{tab:degenerate}
	\small
	\begin{tabular}{lccccccc}
	\toprule
	Policy & F1 & P & R & damage & $\Phi_1$ & $\Phi_5$ & $\Phi_{10}$ \\
	\midrule
	no-op & 0.000 & 1.000 & 0.000 & 0.000 & $0.00$ & $0.00$ & $0.00$ \\
	abstain-all & 0.000 & 1.000 & 0.000 & 0.000 & $0.00$ & $0.00$ & $0.00$ \\
	random-edit & 0.000 & 0.001 & 0.001 & 0.049 & $-0.49$ & $-2.45$ & $-4.89$ \\
	oracle & 1.000 & 1.000 & 1.000 & 0.000 & $+1.00$ & $+1.00$ & $+1.00$ \\
	shipped & \degShippedF & \degShippedP & 0.308 & \degShippedDamage & $+0.13$ & $-1.37$ & $-3.26$ \\
	\bottomrule
	\end{tabular}
	\end{table}

	\subsection{Calibration of abstention}
	\label{sec:calibration}
	On a probe of reference-entity typos plus garbage traps, retrieval confidence is a
	usable selective-prediction signal: AURC \aurc, ECE \ece{} (over-confident;
	temperature scaling is future work), and
	precision rises monotonically with threshold---\precAtDefault{} precision at the default
	$\tau{=}0.84$ (coverage \covAtDefault), and $\geq$95\% precision at
	$\tau{=}\threshNinetyFive$ (coverage \covNinetyFive). Figure~\ref{fig:rc} shows the
	risk--coverage curve.

	\begin{figure}[t]
	\centering
	\includegraphics[width=0.62\linewidth]{fig_risk_coverage}
	\caption{Risk--coverage for grounded city reconciliation (650 probes). Operating points
	annotated; the confidence supports thresholded abstention.}
	\label{fig:rc}
	\end{figure}

	\section{Limitations}
	Reference coverage is the recall ceiling: entities absent from the taxonomy abstain by
	design, which is safe but not helpful; coverage work (larger gazetteers, ROR for
	organizations) moves recall directly. Our damage metric is convention-tolerant for case
	and whitespace but still counts alias expansion (\texttt{NYC}$\to$\texttt{New York}) as
	damage when the gold keeps the alias---a value-level convention question we leave open.
	The confidence signal is over-confident (ECE \ece); temperature scaling is future
	work. The injected half of the suite, while seeded and reproducible, inherits the
	injector's error model; we mitigate with the real-error slice and report both. All
	weight-training experiments (SFT and GRPO) use a single model family (Qwen3), so
	the negative result of \S\ref{sec:negative} is family-scoped until replicated on a
	second family. PII
	coverage is English-only, and we make no de-identification guarantee. Finally, the
	fine-tune headline is reported with multi-seed confidence intervals, but the wide-suite
	model row is single-seed for cost reasons and scoped as such.

	\section{Conclusion}
	A planner/executor decomposition with plan-level selective prediction --- the model
	proposes, a deterministic engine executes, a verifier gates every mapping --- turns
	LLM data cleaning from a trust liability into an auditable system: every change is a
	named, reversible operation; uncertain actions become review flags rather than silent
	corruptions; and the evaluation itself is built to resist gaming. The post-freeze
	program sharpened the architecture into a finding: across
	five further fine-tunes and a three-arm GRPO pilot, the weights never moved
	never-seen-table performance --- deterministic visibility, grounding, consensus, and
	verification did, at zero silent edits across \nWild{} wild tables and a
	\nTrust{}-table trust audit. The scaling arm completes the picture: the bounded null
	is about fine-tuning small weights, not about capability --- two of three zero-shot
	24--31B planners dropped into the unchanged verifier harness exceed the
	fine-tune's operating point (\S\ref{sec:scaling}), so the architecture is
	planner-agnostic: capability gains arrive as better operating points without
	retraining. The shipped system runs
	entirely locally on commodity hardware and no data leaves the machine; the
	scaling-arm planners were measured via hosted endpoints, but all are locally
	deployable open weights. We believe the recipe---propose/execute decomposition,
	verification-by-execution, retrieval-grounded outputs, and selective prediction over
	deterministic capabilities---is a template for deploying small specialized models on
	other structured tasks.

	\section*{Reproducibility}
	\begin{sloppypar}
	The model weights are public:
	\url{https://huggingface.co/ricalanis/scrubdata-qwen3-4b-v6-q8}. Code, evaluation
	suite, and result artifacts are released at the project repository,
	\url{https://github.com/ricalanis/scrubdata-hackathon} (public upon publication,
	available to reviewers from the initial submission). The \textsc{WildClean}
	bundle --- redistributable dirty/gold pairs, the GitTables audit slice, open
	vocabularies, result JSONs, and license-gated loaders for the non-redistributable
	pairs --- is a public Hugging Face dataset
	(\url{https://huggingface.co/datasets/ricalanis/wildclean}). The shipped product
	planner is the identical code path measured here (\texttt{scrubdata/active.py}).
	\end{sloppypar}

	\paragraph{Release integrity.} Our own reproducibility QA discovered that the
	published Q8\_0 GGUF was corrupted by an export bug (the export declared a wrong
	end-of-generation token id inside the Qwen3 vocabulary, degenerating into
	tool-call loops on all runtimes; a base-model control isolated the fault to the
	export, not the adapter). It has been re-exported from the v6 adapter and
	replaced under the same filename, with both sha256 checksums recorded in the
	model card's Integrity section. Third-party reproduction of the model-path
	numbers additionally requires constrained decoding on long prompts ---
	\texttt{format=json} under Ollama, or
	\texttt{suppress\_tokens=[151657,151658]} under transformers --- which is now
	documented in the model card and \texttt{notebooks/Modelfile}.

	\paragraph{Setup.} Clone the repository and run \texttt{uv sync} (Python 3.12;
	\texttt{uv} resolves the pinned environment). The non-redistributable benchmark
	pairs materialize from their original sources with the \textsc{WildClean}
	\texttt{loaders.py}. Model-path results additionally need the released Q8\_0 GGUF
	served by a local Ollama (\texttt{SCRUBDATA\_MODEL}); every deterministic-path
	number runs with no model at all. Baran runs in the separate pinned environment
	documented at the top of \texttt{eval/run\_baran.py}; Jellyfish-13B runs remotely
	via Modal.

	\paragraph{One command per reported number} (all from the repository root, at the
	released revision):

	\begin{center}
	\footnotesize
	\begin{tabular}{@{}ll@{}}
	\toprule
	Reported result & Command \\
	\midrule
	Wide-suite comparison (Table~\ref{tab:money}) & \texttt{python -m eval.run\_real\_multi --out eval/results} \\
	Precision--coverage curve + gate & \texttt{python -m eval.precision\_curve} \\
	\quad (Figure~\ref{fig:pc}, \S\ref{sec:ws1results}) & \texttt{\ \ --plan eval/results/v6\_hospital\_raw\_plan.json --union} \\
	Ablations & \texttt{python -m eval.ablations} \\
	Calibration (Figure~\ref{fig:rc}) & \texttt{python -m eval.calibration} \\
	PII leak test & \texttt{python -m eval.pii\_leak} \\
	Baran baseline & \texttt{python eval/run\_baran.py}, then \\
	& \texttt{python -m eval.baselines\_learned --score-baran} \\
	Jellyfish baseline & \texttt{modal run scripts/modal\_jellyfish.py} \\
	\midrule
	Paired bench (\S\ref{sec:wild}) & \texttt{python -m eval.paired\_bench} \\
	Wild bench (\S\ref{sec:wild}) & \texttt{python -m eval.wild\_bench} \\
	GitTables trust audit (\S\ref{sec:wild}) & \texttt{python -m eval.gittables\_audit} \\
	Held-out-source generalization & \texttt{python -m eval.generalization} \\
	\midrule
	Scorer validation (\S\ref{sec:eval}) & \texttt{python -m pytest tests/test\_wildclean\_scorer.py} \\
	Degenerate baselines (Table~\ref{tab:degenerate}) & \texttt{python -m eval.degenerate} \\
	TOST equivalence (\S\ref{sec:negative}) & \texttt{python -m eval.equivalence} \\
	Label curve (Figure~\ref{fig:labelcurve}) & \texttt{python -m eval.label\_curve} \\
	Per-dataset table (Table~\ref{tab:perdataset}) & \texttt{python -m eval.raha\_table} \\
	Self-consistency vote/pool (\S\ref{sec:negative}) & \texttt{python -m eval.sc\_rerank --model scrubdata-ft --n 16} \\
	Scaling arm (Table~\ref{tab:scaling}) & \texttt{python -m eval.scaling\_arm} \\
	\bottomrule
	\end{tabular}
	\end{center}

	\begin{thebibliography}{20}
	\bibitem{raha} M.~Mahdavi, Z.~Abedjan, R.~Castro Fernandez, S.~Madden, M.~Ouzzani,
	M.~Stonebraker, N.~Tang. Raha: A Configuration-Free Error Detection System. SIGMOD
	2019; M.~Mahdavi, Z.~Abedjan. Baran: Effective Error Correction via a Unified Context
	Representation and Transfer Learning. PVLDB 13(11):1948--1961, 2020.
	\bibitem{holoclean} T.~Rekatsinas, X.~Chu, I.~F.~Ilyas, C.~R\'e. HoloClean: Holistic
	Data Repairs with Probabilistic Inference. PVLDB 10(11), 2017. arXiv:1702.00820.
	\bibitem{garf} J.~Peng, D.~Shen, N.~Tang, T.~Liu, Y.~Kou, T.~Nie, H.~Cui, G.~Yu.
	Self-Supervised and Interpretable Data Cleaning with Sequence Generative Adversarial
	Networks (GARF). PVLDB 16(3):433--446, 2022.
	\bibitem{wrangle} A.~Narayan, I.~Chami, L.~Orr, S.~Arora, C.~R\'e. Can Foundation
	Models Wrangle Your Data? PVLDB 16(4):738--746, 2022. arXiv:2205.09911.
	\bibitem{jellyfish} H.~Zhang, Y.~Dong, C.~Xiao, M.~Oyamada. Jellyfish:
	Instruction-Tuning Local Large Language Models for Data Preprocessing. EMNLP 2024.
	arXiv:2312.01678.
	\bibitem{cocoon} S.~Zhang, Z.~Huang, E.~Wu. Data Cleaning Using Large Language Models
	(Cocoon). arXiv:2410.15547, 2024 (preprint; no published reproduction).
	\bibitem{zeroed} W.~Ni, K.~Zhang, X.~Miao, X.~Zhao, Y.~Wu, Y.~Wang, J.~Yin. ZeroED:
	Hybrid Zero-Shot Error Detection Through Large Language Model Reasoning. ICDE 2025.
	arXiv:2504.05345.
	\bibitem{forested} M.~Wang, J.~Wang, Q.~Liu, X.~Xu, Z.~Xing, L.~Zhu, W.~Zhang.
	Ensembling LLM-Induced Decision Trees for Explainable and Robust Error Detection.
	arXiv:2512.07246, 2025 (preprint).
	\bibitem{autotest} Q.~Chen, Y.~He, R.~C.-W.~Wong, W.~Cui, S.~Ge, H.~Zhang, D.~Zhang,
	S.~Chaudhuri. Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error
	Detection in Tables. SIGMOD 2025. arXiv:2504.10762.
	\bibitem{gidcl} M.~Yan, Y.~Wang, Y.~Wang, X.~Miao, J.~Li. GIDCL: A Graph-Enhanced
	Interpretable Data Cleaning Framework with Large Language Models. Proc.\ ACM Manag.\
	Data 2(6), Article 236, 2024 (SIGMOD).
	\bibitem{spreadsheetrl} B.~Chi, Y.~Xie, M.~Wu, J.~Yang, J.~Jiang, Z.~Li, et al.
	Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks
	via Reinforcement Learning. arXiv:2605.22642, 2026.
	\bibitem{distort} A.~Dutta, H.~Nigam, H.~Hasanbeig, A.~Radhakrishna, S.~Gulwani.
	An Empirical Investigation of Robustness in Large Language Models under Tabular
	Distortions. arXiv:2601.05009, 2026.
	\bibitem{debate} C.~Parmar, A.~Mehta, H.~Wu, J.~Ramamurthy, S.~Medhekar. When Helping
	Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning. arXiv:2606.02866, 2026.
	\bibitem{tabler1} Z.~Yang, L.~Chen, A.~Cohan, Y.~Zhao. Table-R1: Inference-Time
	Scaling for Table Reasoning. EMNLP 2025. arXiv:2505.23621.
	\bibitem{spurious} R.~Shao, S.~S.~Li, R.~Xin, S.~Geng, Y.~Wang, et al. Spurious
	Rewards: Rethinking Training Signals in RLVR. arXiv:2506.10947, 2025.
	\bibitem{tablegpt} P.~Li, Y.~He, D.~Yashar, W.~Cui, S.~Ge, H.~Zhang, D.~Rifinski
	Fainman, D.~Zhang, S.~Chaudhuri. Table-GPT: Table Fine-tuned GPT for Diverse Table
	Tasks. Proc.\ ACM Manag.\ Data 2(3), Article 176, 2024 (SIGMOD). arXiv:2310.09263.
	\bibitem{retclean} Z.~A.~Naeem, M.~S.~Ahmad, M.~Eltabakh, M.~Ouzzani, N.~Tang.
	RetClean: Retrieval-Based Data Cleaning Using LLMs and Data Lakes. PVLDB 17(12), 2024
	(demo). arXiv:2303.16909.
	\bibitem{turl} X.~Deng, H.~Sun, A.~Lees, Y.~Wu, C.~Yu. TURL: Table Understanding
	through Representation Learning. PVLDB 14(3):307--319, 2021. arXiv:2006.14806.
	\bibitem{tablellama} T.~Zhang, X.~Yue, Y.~Li, H.~Sun. TableLlama: Towards Open Large
	Generalist Models for Tables. NAACL 2024. arXiv:2311.09206.
	\bibitem{belotti} F.~Belotti, F.~Dadda, M.~Cremaschi, R.~Avogadro, M.~Palmonari.
	Evaluating LLMs on Entity Disambiguation in Tables. arXiv:2408.06423, 2024 (preprint).
	\bibitem{racoon} L.~L.~Wei, G.~Xiao, M.~Balazinska. RACOON: An LLM-based Framework for
	Retrieval-Augmented Column Type Annotation with a Knowledge Graph. arXiv:2409.14556,
	2024 (preprint).
	\bibitem{mtab} P.~Nguyen, N.~Kertkeidkachorn, R.~Ichise, H.~Takeda. MTab: Matching
	Tabular Data to Knowledge Graph using Probability Models. SemTab/ISWC 2019.
	arXiv:1910.00246.
	\bibitem{selective} R.~El-Yaniv, Y.~Wiener. On the Foundations of Noise-free Selective
	Classification. JMLR 11:1605--1641, 2010; Y.~Geifman, R.~El-Yaniv. Selective
	Classification for Deep Neural Networks. NeurIPS 2017.
	\bibitem{openmed} M.~Panahi. OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art
	Transformers for Biomedical NER Across 12 Public Datasets. arXiv:2508.01630, 2025
	(preprint).
	\bibitem{lakens} D.~Lakens. Equivalence Tests: A Practical Primer for t Tests,
	Correlations, and Meta-Analyses. Social Psychological and Personality Science
	8(4):355--362, 2017.
	\bibitem{grouse} S.~Muller, A.~Loison, B.~Omrani, G.~Viaud. GroUSE: A Benchmark
	to Evaluate Evaluators in Grounded Question Answering. COLING 2025.
	arXiv:2409.06595.
	\end{thebibliography}

	\end{document}