Spaces:
Running
Running
| \documentclass[11pt]{article} | |
| \usepackage[utf8]{inputenc} % no-op on TeXLive >= 2018 (arXiv pdflatex); explicit for safety | |
| \usepackage[margin=1in]{geometry} | |
| \usepackage{amsmath,amssymb} | |
| \usepackage{booktabs} | |
| \usepackage{graphicx} | |
| \usepackage{url} | |
| \usepackage{xcolor} | |
| \usepackage[numbers]{natbib} | |
| \input{numbers} | |
| \title{Verified Cleaning Plans: Plan-Level Selective Prediction Turns Local LLM | |
| Planners into Trustworthy Table Cleaners} | |
| \author{Ricardo Alanis\\ \small{\texttt{ricardo.alanis@gmail.com}}} | |
| \date{June 2026} | |
| \begin{document} | |
| \maketitle | |
| \begin{abstract} | |
| Cleaning messy tabular data---particularly \emph{canonicalization}, the merging of | |
| inconsistent surface forms such as \texttt{USA}/\texttt{U.S.A}/\texttt{united states} | |
| into one canonical value---resists rule-based automation and is routinely done by hand. | |
| We present ScrubData, an architecture built around a trust contract. A local LLM | |
| \emph{planner} reads an aggregated column profile (per-value frequency counts, | |
| invariant to row count) and \emph{proposes} a structured JSON cleaning plan; a | |
| deterministic executor \emph{applies} it, making every change auditable and reversible; | |
| and \emph{plan-level selective prediction} --- a deterministic verifier that scores | |
| every proposed mapping and drops low-confidence entries to review flags --- extends | |
| abstention from cell-level confidence to the plan itself. The verified union of the | |
| gated model plan with a reference-grounded heuristic is the architecture's operating | |
| point: a zero-configuration, zero-label system that repairs 41\% of the hospital | |
| benchmark's 509 real errors at \unionGatePrec{} precision (strongest of three training | |
| seeds; 3-seed mean \unionGateThreeSeedPrec{} at \unionGateThreeSeedCov{} coverage, | |
| $\pm$ = 95\% CI), | |
| with every declined merge surfaced for review. Four deterministic capabilities --- | |
| profile-level \emph{suspect surfacing} for high-cardinality columns, reconciliation | |
| against a pluggable \emph{entity reference} built from open vocabularies, | |
| \emph{cross-row majority voting} over repeated-entity groups, and | |
| \emph{convention-conservatism} gates --- carry the system to never-seen tables: | |
| macro F1 \unseenMacroF{} at \unseenMacroDamage{} damage over the 35 unseen-source | |
| pairs of a \nPairs-pair benchmark, and \emph{zero silent edits} across \nWild{} wild | |
| tables plus a \nTrust-table trust audit, released together as the \textsc{WildClean} | |
| benchmark. | |
| Finally, we report where the capability lives. \emph{Execution-verified} synthetic | |
| supervision --- a training example is kept only if executing its plan provably | |
| recovers the known-clean table --- buys the 4B fine-tune real in-distribution skill | |
| and the most precise gated planner at usable coverage (\modelGatePrec{} precision at | |
| \modelGateCov{} coverage); but five further retrains and a three-arm GRPO pilot leave | |
| held-out generalization statistically bounded (TOST against a pre-registered margin), | |
| while two of three zero-shot 24--31B open-weights planners (devstral-24B, gemma4-31B) | |
| dropped into the \emph{identical} harness exceed the fine-tune's operating point | |
| (\scalePrecBig{} precision at \scaleCovBig{} coverage) with no task training. The | |
| architecture is planner-agnostic: it converts capability gains into trustworthy | |
| operating points without retraining. The shipped system runs entirely locally on commodity hardware; | |
| no data leaves the machine (the scaling-arm planners were measured via hosted | |
| endpoints; all are locally deployable open weights). | |
| \end{abstract} | |
| \section{Introduction} | |
| A large share of practical data work is cleaning: a sales export where the same country | |
| is spelled four ways, a hospital roster where \texttt{birminghxm} should be | |
| \texttt{birmingham}, a CRM dump with mixed date formats and duplicated contacts. The | |
| fuzzy half of this work---recognizing that distinct surface forms denote the same | |
| entity---is exactly what rules do poorly and humans do slowly. | |
| Large language models can do this fuzzy matching, but deploying them as cell editors has | |
| three problems. First, \emph{trust}: a model that edits cells directly can silently | |
| corrupt data, and its errors are unauditable. Second, \emph{cost and privacy}: shipping | |
| every row of a private table to a hosted frontier model is expensive and often | |
| unacceptable. Third, \emph{hallucination}: asked for a canonical form, a generative model | |
| will invent one, and on tail entities it will invent wrong ones. | |
| ScrubData addresses all three with an architecture in which the model never touches | |
| data. A profiler aggregates each column into a value-frequency distribution; a small | |
| local model reads the profile and \emph{proposes} a JSON cleaning plan; a deterministic | |
| pandas executor \emph{applies} it. The plan is the complete, inspectable, reversible | |
| specification of every change---there are no silent edits by construction | |
| (\S\ref{sec:method}). Because the prompt scales with the number of \emph{distinct} | |
| values rather than rows, a million-row table profiles like a hundred-row one. | |
| This paper makes five contributions: | |
| \begin{enumerate} | |
| \item \textbf{A planner/executor decomposition with plan-level selective prediction}: | |
| the model proposes, a deterministic engine executes with full lineage, and a | |
| deterministic verifier gates every proposed mapping, extending abstention to the plan | |
| itself. The verified union of the gated model plan with a reference-grounded | |
| heuristic repairs 41\% of hospital's \hospErrors{} real errors at \unionGatePrec{} | |
| precision with zero configuration and zero labels (\S\ref{sec:method}, | |
| \S\ref{sec:verifier}, \S\ref{sec:ws1results}). | |
| \item \textbf{\textsc{WildClean} and an un-gameable evaluation}: a 65-dataset suite | |
| (real-error benchmarks plus seeded error injection over 15 harvested open-data | |
| domains) scored with a churn-neutral, convention-tolerant metric that cannot be | |
| inflated by mass rewriting, with damage and silent edits scored alongside repair F1, | |
| degenerate baselines pinning the metric's floor and ceiling, and the scorer itself | |
| validated against 30 adversarial known-by-construction cases (\S\ref{sec:eval}, | |
| \S\ref{sec:degenerate}). | |
| \item \textbf{Four deterministic capabilities that carry never-seen-table | |
| generalization}: bounded suspect surfacing for high-cardinality columns, generic | |
| entity-reference reconciliation with an exact-hit typing floor, cross-row majority | |
| voting with a false-consensus guard, and convention-conservatism gates --- each | |
| motivated by a measured failure regime and gated by the verifier | |
| (\S\ref{sec:capabilities}, \S\ref{sec:wild}). | |
| \item \textbf{Execution-verified synthetic supervision}, the training method behind | |
| the 4B planner instantiation: every training example is validated by running the | |
| executor on the (dirty table, plan) pair and checking that the known-clean table is | |
| recovered; non-recovering examples are discarded (\S\ref{sec:sft}). | |
| \item \textbf{A unified finding on where capability lives in this architecture}: five | |
| further supervised fine-tunes and a three-arm GRPO pilot with the executor as a | |
| verifiable reward leave held-out generalization statistically bounded (TOST), while | |
| two of three zero-shot 24--31B planners dropped into the same harness exceed the | |
| fine-tune's operating point --- deterministic machinery plus plan-level verification carry the | |
| generalization that exists, and raw planner capability, not task fine-tuning, scales | |
| it (\S\ref{sec:negative}, \S\ref{sec:scaling}). | |
| \end{enumerate} | |
| We deliberately report a negative-flavored finding alongside the positive ones: on | |
| \emph{injected} typos, classical frequency clustering remains a strong baseline---by | |
| construction, injection places the canonical form in the column, which is clustering's | |
| ideal regime. The advantage of grounding is concentrated where it matters: real errors, | |
| tail entities absent from the column, and adversarial near-misses where acting at all is | |
| wrong (\S\ref{sec:results}). | |
| \section{Related Work} | |
| \label{sec:related} | |
| \textbf{Error detection and repair.} Raha and Baran~\cite{raha} established | |
| configuration-free error detection and correction benchmarks (hospital, beers, flights, | |
| rayyan), which we adopt as out-of-distribution evaluation. HoloClean~\cite{holoclean} | |
| combines integrity constraints, external reference data, and statistics in probabilistic | |
| repair, demonstrating that external signals can veto statistically plausible but wrong | |
| repairs---an insight our reference-veto inherits. GARF~\cite{garf} learns repair rules | |
| self-supervised from the data itself; it also demonstrates the structural limit we | |
| observe for frequency-only methods: a lone categorical column offers no co-occurring | |
| signal to vote against an error. | |
| \textbf{The 2025--26 landscape.} Post-Cocoon work concentrates on zero-label | |
| \emph{detection}: ZeroED~\cite{zeroed} (cloud-LLM cluster labeling, hospital | |
| detection F1 0.81, collapsing to 0.27 on smaller models), ForestED~\cite{forested} | |
| (LLM-induced decision trees, 0.756), and Auto-Test~\cite{autotest} (corpus-mined | |
| semantic-domain constraints, no LLM at inference) --- none performs zero-label | |
| \emph{repair}. GIDCL~\cite{gidcl} sets the labeled-class repair ceiling | |
| (hospital \gidclHosp{} with 20 labels and a LoRA trained per cleaned table); | |
| Cocoon~\cite{cocoon} remains an unreproduced preprint (15 citing papers, none a | |
| reproduction). Two concurrent results corroborate facets of this paper's central | |
| negative finding that machinery, not weights, carries cleaning generalization: a | |
| study showing even frontier models cannot correct table distortions without | |
| explicit priors~\cite{distort}, and a large multi-agent-debate evaluation in which | |
| LLM self-critique \emph{degrades} repair and only an adversarially separate, | |
| execution-grounded critic helps~\cite{debate} --- the architecture our verifier | |
| instantiates. Spreadsheet-RL~\cite{spreadsheetrl} reports the complementary | |
| positive case: with full-scale RL infrastructure and execution-verified rewards, | |
| a 4B model's spreadsheet-manipulation skill \emph{does} move (12.0\% | |
| $\rightarrow$ 23.4\%) --- consistent with our reading that the gap between our | |
| \$30 pilot and such results is infrastructure scale, a boundary we state rather | |
| than blur (\S\ref{sec:negative}). | |
| \textbf{LLMs for data wrangling.} Narayan et al.~\cite{wrangle} showed frontier | |
| foundation models handle entity matching and imputation few-shot; | |
| Jellyfish~\cite{jellyfish} and Table-GPT~\cite{tablegpt} fine-tune mid-size models for | |
| data tasks. RetClean~\cite{retclean} is closest in spirit: retrieval from data lakes | |
| grounds cell repair, with the key empirical split that parametric knowledge suffices on | |
| world-known head values but collapses on the tail---motivating retrieval. Our work | |
| differs in the planner/executor decomposition (the model emits no cell values, only | |
| plans), in execution-verified supervision, and in the calibrated-abstention contract. | |
| \textbf{Entity linking over tables.} TURL~\cite{turl} and TableLlama~\cite{tablellama} | |
| inject candidate entities into table understanding; Belotti et al.~\cite{belotti} | |
| show retriever coverage is the accuracy ceiling for table entity disambiguation and that | |
| long candidate lists hurt smaller models. RACOON~\cite{racoon} shows inference-time KG | |
| retrieval lifts a frozen model substantially, supporting our choice to ground at | |
| inference rather than bake aliases into weights (TURL's out-of-domain collapse is the | |
| cautionary result). MTab~\cite{mtab} established type-constrained matching with | |
| abstention in semantic table annotation. | |
| \textbf{Clustering-based cleaning tools.} The de-facto practitioner baseline is | |
| OpenRefine: key-collision (fingerprint) clustering plus a nearest-neighbour mode; we | |
| reimplement both faithfully, including blocking, and compare head-to-head. | |
| \textbf{Selective prediction.} Risk--coverage analysis and calibration | |
| metrics~\cite{selective} formalize ``knowing when not to act''; to our knowledge their | |
| application to data-cleaning merge decisions is new. | |
| \textbf{Small specialized models.} OpenMed~\cite{openmed} fine-tunes sub-500M encoders | |
| to state-of-the-art biomedical NER, the sister result to our thesis that small | |
| specialized models beat large generic ones on narrow structured tasks; we adopt their | |
| released PII token classifiers for column typing (\S\ref{sec:pii}). | |
| \section{Method} | |
| \label{sec:method} | |
| \subsection{Planner / executor decomposition} | |
| A \emph{profiler} reduces each column to a typed summary: detected semantic type, missing | |
| counts, issue flags, and a value--frequency distribution capped at 80 distinct values | |
| (high-cardinality columns are summarized by their head). The \emph{planner}---either a | |
| deterministic heuristic or our fine-tuned 4B model---maps the profile (plus three sample | |
| rows) to a JSON plan: a list of per-column operations drawn from a closed vocabulary | |
| (\texttt{canonicalize\_categories} with an explicit mapping, \texttt{parse\_date}, | |
| \texttt{standardize\_phone}, \texttt{mask\_pii}, \ldots), table operations, and review | |
| flags. The \emph{executor} applies the plan with pure pandas transforms. The plan is the | |
| only channel through which data changes: every diff is attributable to a named operation | |
| with a rationale, the original table is never mutated, and abstentions are first-class | |
| plan objects. We export per-run decision summaries as OpenTelemetry GenAI spans. | |
| \subsection{Execution-verified synthetic supervision} | |
| \label{sec:sft} | |
| Training pairs are generated by corrupting clean synthetic tables with realistic noise | |
| (casing, aliases, single-character typos with Zipf-distributed long-tail categorical | |
| columns of 30--80 distinct values) while recording the ground-truth plan. The defining | |
| step is \emph{verification by execution}: a candidate example is kept only if | |
| $\textsc{Execute}(\text{dirty}, \text{plan}) = \text{clean}$ cell-for-cell. This closes | |
| the loop between supervision and semantics---a plan that would not actually clean the | |
| table can never become a training label. We augment with real supervision derived from | |
| paired dirty/clean benchmarks by aligning cells and keeping only \emph{learnable} | |
| canonicalizations (a surface form that is a string variant of its target and never a | |
| legitimate value elsewhere), which excludes unlearnable per-cell corrections such as | |
| divergent flight times. The fine-tune is QLoRA (rank 32) over Qwen3-4B-Instruct in | |
| bf16; one practical finding is that the base model's tool-calling prior dominates | |
| free-running generation even after convergent fine-tuning (loss 0.16) and must be | |
| suppressed at decode time by banning the two tool-call tokens. | |
| \subsection{Reference-grounded canonicalization with abstention} | |
| \label{sec:grounding} | |
| For columns whose values reconcile to a known concept type (countries, administrative | |
| regions, cities), canonical forms are never generated: a fuzzy retriever (normalized | |
| edit similarity with first-character blocking and length prefilters) matches each | |
| distinct value against the type-scoped reference (ISO/pycountry; GeoNames cities500, | |
| 196k entries). A value maps to a canonical only if (i) similarity clears a threshold | |
| $\tau{=}0.84$, (ii) the best--second-best margin clears $0.03$ (ambiguity veto: a value | |
| equally close to \texttt{Box} and \texttt{Boaz} abstains), and (iii) the canonical is | |
| cast to the column's observed case convention. Near-misses ($0.70{\le}s{<}\tau$) are | |
| surfaced as review flags. The same wrapper grounds the \emph{model} planner: for | |
| reference-typed columns the model's free-generated mapping is replaced by the grounded | |
| one, so the model can add coverage but never invent a canonical for a grounded type. | |
| \subsection{Plan-level selective prediction: the verified union planner} | |
| \label{sec:verifier} | |
| Grounding constrains reference-typed columns, but the planner's \emph{free} | |
| canonicalization mappings on non-grounded columns remain unguarded---and they are where | |
| real-data precision dies (the fine-tune's raw hospital plan: \hospModelPrecVSix{} | |
| precision at \hospModelRecallVSix{} recall). Rather than retrain, we extend abstention | |
| to the plan itself. A deterministic \emph{verifier} scores every proposed mapping entry | |
| $raw{\to}canon$ with contract-preserving evidence (no cell values emitted, no gold | |
| access): three hard gates distilled from the model's measured failure classes---a value | |
| occurring ${\ge}3$ times is data, not a typo (\emph{errors are rare}); the target must | |
| be a frequent column value clearly dominating the source (no mapping one typo onto | |
| another); digit-bearing codes repair only when the letter part is near-identical---then | |
| a confidence combining edit similarity with frequency support. Entries below a | |
| threshold $\tau$ are dropped to review flags; abstention stays first-class. Sweeping | |
| $\tau$ yields a plan-level precision--coverage curve. The shipped composition, | |
| the \emph{verified union planner}, is the verifier-gated model plan ($\tau{=}0.5$) | |
| unioned with the grounded heuristic's mappings (the model wins per surface form); | |
| the same code path is the product default. | |
| \subsection{Visibility and consensus: four deterministic capabilities} | |
| \label{sec:capabilities} | |
| Four further mechanisms, each motivated by a measured failure regime on never-seen | |
| tables, complete the deterministic machinery. \textbf{(a) Suspect surfacing.} The | |
| profile's value-frequency view is capped, so high-cardinality columns hide their | |
| dirty cells from any planner. Every column profile now carries a bounded | |
| \texttt{suspect\_values} section: rare anomalous surfaces with evidence-backed | |
| repair candidates (frequency dominance, edit similarity, reference membership). | |
| The heuristic planner repairs from suspects under a strict verifier bar | |
| ($\tau_{hc}{=}0.8$) and flags the rest. \textbf{(b) Generic entity reference.} | |
| Open vocabularies (SemTab ToughTables aliases --- derived excluding our benchmark | |
| tables; MusicBrainz search-hint misspellings; RxNorm; Wikidata; ROR) register as a | |
| pluggable reference type. Because the reference is broad, entity-typing a column | |
| additionally requires that ${\ge}20\%$ of its distinct values match the reference | |
| \emph{exactly} --- fuzzy coverage alone over-fires on name-like columns (measured). | |
| This resolves the regime where every surface in a column is unique (no in-column | |
| frequency signal exists at all): five such benchmark tables go from 0.0 to | |
| \ttFOne{} F1 at \emph{zero} damage. \textbf{(c) Cross-row majority voting.} Tables | |
| that repeat a real-world entity across rows (a flight reported by many sources) | |
| carry their own repair signal. A detection step finds compact-token key columns | |
| with small groups (median multiplicity 3--30) and columns whose groups show | |
| \emph{majority-bearing} disagreement with per-group information; a table-level | |
| operation then resolves thin dissenting minorities to the group majority. A | |
| \emph{false-consensus} guard declines when minority shares look like legitimate | |
| correlated updates rather than reporting errors (mean minority share ${\ge}0.25$) | |
| --- a flat volume cap was measured to destroy the legitimate dense-disagreement | |
| regime and replaced. \textbf{(d) Convention conservatism.} The planner never | |
| re-formats an internally consistent column: date and percent ops are gated on | |
| dominant-shape inconsistency (digit and alpha runs collapsed; 90\% rule), | |
| ZIP/postal-named columns are never typed as phones or dates, and Excel-serial | |
| date typing requires a date-suggestive column name. Suppressed minority values | |
| surface as review flags --- abstention is visible, never silent. The verifier | |
| enforces the same gates on model-emitted plans at the verification boundary. | |
| \subsection{PII as a second task instance} | |
| \label{sec:pii} | |
| The identical contract covers PII: a deterministic tier types columns by checksum and | |
| pattern validators (Luhn, IBAN mod-97, SSN/email/phone) over distinct values; an | |
| optional 44M OpenMed-PII token classifier~\cite{openmed} extends coverage to names and | |
| addresses, gated by a sensitive-type allowlist and a column-level coverage vote; and | |
| masking, salted hashing, and join-stable pseudonymization are deterministic executor | |
| operations. Measured briefly: the classifier, though trained on sentence-level | |
| clinical text, transfers to bare cell values --- \piiNameBare{} detection on | |
| person-name cells and \piiAddrBare{} on address cells ($n{=}40$ sampled cells each); | |
| the validator tier, evaluated out-of-distribution on per-type columns from the Gretel | |
| PII test split, types 5/5 covered PII types correctly with 0/7 false positives on | |
| negative columns drawn from real open data; and after deterministic masking, | |
| re-running all validators over the output finds \piiLeakRate{} residual PII --- | |
| residual PII \emph{detectable by our validators}, a circularity we note explicitly: | |
| the leak test can only see what the validator tier sees. | |
| \section{Evaluation Design} | |
| \label{sec:eval} | |
| \textbf{Suite.} Five real-error benchmarks (Raha) plus seeded error injection | |
| (typo/OCR/case/whitespace) over 15 harvested open-data domains (NYC, Chicago, SF, LA, | |
| Seattle, Texas, WA portals; GitHub) $\approx$ 65 datasets per seed. We aggregate as a | |
| \emph{double macro}---mean over error types of mean over datasets, harmonically combined | |
| with the domain macro---so no single table or error type dominates: | |
| \begin{equation*} | |
| \textsc{north} \;=\; \operatorname{HM}\Biggl( | |
| \underbrace{\frac{1}{|T|}\sum_{t \in T}\frac{1}{|D_t|}\sum_{d \in D_t} F_1(d)}_{\text{error-type macro}},\; | |
| \underbrace{\frac{1}{|G|}\sum_{g \in G}\frac{1}{|D_g|}\sum_{d \in D_g} F_1(d)}_{\text{domain macro}} | |
| \Biggr), | |
| \end{equation*} | |
| where $T$ is the set of error types, $G$ the set of data domains, $D_t$ (resp.\ $D_g$) | |
| the datasets carrying error type $t$ (domain $g$), and $\operatorname{HM}$ the harmonic | |
| mean. | |
| \textbf{Churn-neutral metric.} A cell change that is case/whitespace-equivalent to the | |
| input but does not restore the gold counts as nothing: not a fix, not a change, not | |
| damage. Without this, mass case-rewriting inflates precision (we observed $+0.12$ | |
| NORTH from \emph{removing} case matching before the correction); with it, fixing a | |
| case-injected error requires actually acting. We additionally report | |
| \emph{damage}---the rate of semantically corrupting clean cells---and an adversarial | |
| \emph{abstain slice} whose traps are garbage strings (not single-edit variants of any | |
| reference entity; an earlier trap set mis-scored grounding for correctly mapping | |
| \texttt{Boazz}$\to$\texttt{Boaz}). We report both repairs of these metric artifacts as | |
| evidence that gameability must be tested, not assumed. | |
| \textbf{Real vs.\ injected.} Injected typos are in-distribution for frequency | |
| clustering by construction (the canonical is present and dominant in the column), so we | |
| report the real-error and injected slices separately. A TableEG-style audit | |
| quantifies the gap (\texttt{eval/inject\_validity.py}): the injector covers three | |
| of nine error classes (Jensen--Shannon divergence 0.526 bits from the pooled real | |
| distribution over 163{,}607 real errors), and injected-only evaluation would | |
| invert the fingerprint-clustering ranking --- exactly the overstatement the | |
| separate-slice reporting prevents. | |
| \textbf{Scorer validation.} Following GroUSE-style evaluator | |
| testing~\cite{grouse}, the scorer itself is validated against 30 adversarial | |
| known-by-construction cases: a no-op plan must score 0 fixes and 0 damage, an | |
| oracle plan exactly 1.0, vandalizing $k$ of $m$ clean cells must score damage | |
| $k/m$ at precision 0, pure churn (case/whitespace rewrites that do not restore | |
| gold) must count as nothing although a naive scorer would count it, fixes must | |
| require actually acting, and silent edits must trip the audit. All 30 pass | |
| against the shipped scorer unmodified. We additionally cross-score every system | |
| under the \emph{original} Raha/Baran cell-repair protocol side by side with ours | |
| (\texttt{eval/cross\_scoring.py}): rankings agree at Kendall $\tau_b{=}1.0$ on | |
| three of five datasets, and the disagreements cut both ways --- raw string | |
| equality denies credit for numerically-correct serialization restorations (our | |
| movies\_1 repairs), while churn-neutrality charges Baran for load-time | |
| normalizer rewrites its own protocol hides (hospital precision | |
| $0.908\!\to\!0.783$). Neither metric family flatters us uniformly, and our Baran | |
| reproduction calibrates against its published Table~3 within $+0.02$ on three of | |
| the four shared datasets. | |
| \textbf{Contamination.} The Raha-suite benchmarks have been public on GitHub since | |
| 2019 and sit inside every modern base model's training window; we treat them as | |
| potentially contaminated and split our claims accordingly. A verbatim-completion | |
| probe makes the concern concrete: prompted with five fields of a gold hospital row, | |
| a frontier-class model reproduces \textbf{25\%} of the held-out cells exactly | |
| (30/120 cells over 30 rows, exact-substring match), versus \textbf{0\%} (0/120) on a | |
| date-stamped post-training-cutoff wild harvest under the identical protocol | |
| (\texttt{eval/contamination\_probe.py}). The rate is an upper bound on memorization | |
| --- some completions are guessable from the given fields --- but it is not zero, so | |
| results on legacy-public benchmarks (including the all-hospital | |
| Table~\ref{tab:scaling}, whose zero-shot planners may partially benefit from | |
| memorized gold) carry this caveat, while the architecture's trust claims | |
| (zero silent edits, damage accounting, abstention) rest on the date-stamped wild | |
| and GitTables slices, where the probe finds nothing to complete. | |
| \section{Results} | |
| \label{sec:results} | |
| \subsection{Plan-level selective prediction on real errors} | |
| \label{sec:ws1results} | |
| On hospital's \hospErrors{} real errors, the verifier transforms the fine-tune from | |
| unshippable to precise (Figure~\ref{fig:pc}): the raw model plan repairs | |
| \hospModelRecallVSix{} of errors at \hospModelPrecVSix{} precision; gated at $\tau{=}0.5$ it | |
| reaches \modelGatePrec{} precision at \modelGateCov{} coverage (146 of 147 committed | |
| changes correct). The union with the grounded heuristic buys coverage back: | |
| \textbf{\unionGatePrec{} precision at \unionGateCov{} coverage} (\unionChanged{} | |
| changes, \unionFixed{} correct). This turns the system's promise into a measured | |
| sentence: \emph{zero-configuration and zero labels, repair 41\% of real errors at | |
| ${\ge}0.90$ precision, with every declined merge surfaced for review}. For context, | |
| Baran given oracle error positions and 20 gold-labeled tuples per dataset reaches | |
| \realFBaran{} F1 on the same slice (\S\ref{sec:ws4})---selective prediction does not | |
| close a supervised gap, but it makes the zero-label operating point trustworthy, which | |
| is the regime our user occupies. Precision is flat ($0.89$--$0.91$) for | |
| $\tau\in[0.2,0.8]$, so the operating point is not threshold-brittle, and the result is | |
| seed-robust: across three training seeds of the same data recipe the union operating | |
| point is \unionGateThreeSeedPrec{} precision at \unionGateThreeSeedCov{} coverage | |
| (the shipped adapter is the strongest seed), with every seed clearing the | |
| $0.70$-precision/$0.30$-coverage bar decisively. All 3-seed intervals in this paper | |
| are normal-approximation 95\% CIs ($1.96\,\sigma/\sqrt{3}$); the $t$-based | |
| interval at $n{=}3$ is ${\sim}2.7\times$ wider ($\pm 0.031$ here) and every | |
| qualitative claim survives it --- the weakest seed alone clears the bar. | |
| \textbf{Candidate-constrained planning (negative result).} We also tested constraining | |
| the planner's \emph{inputs}: the profiler emits evidence-backed (variant$\,\to\,$ | |
| canonical) candidate pairs (frequency dominance, edit similarity, reference membership) | |
| and the model may only select among them, with a deterministic check dropping | |
| off-candidate mappings to review flags. As a standalone guard it is strong---the raw | |
| plan's precision rises from \hospModelPrecVSix{} to \pairsRawPrec{} with no verifier at | |
| all---but composed with the verifier and union it reaches \pairsUnionPrec{} precision | |
| at \pairsUnionCov{} coverage, slightly \emph{below} the unconstrained pipeline at the | |
| same $\tau$: the candidate cap (top-3 per surface) removes some correct repairs the | |
| verifier would have kept, and the two mechanisms gate the same failure class. We ship | |
| the verifier and keep candidate constraining available but off by default, reporting | |
| this as a measured redundancy rather than a stacked win. | |
| \begin{figure}[t] | |
| \centering | |
| \includegraphics[width=0.62\linewidth]{fig_precision_coverage} | |
| \caption{Plan-level precision--coverage on hospital (509 real errors), sweeping the | |
| verifier threshold $\tau$. The union planner dominates the raw model plan; the shipped | |
| operating point ($\tau{=}0.5$) is annotated.} | |
| \label{fig:pc} | |
| \end{figure} | |
| \subsection{The 4B fine-tune as one planner instantiation} | |
| On frozen synthetic gold, the fine-tuned 4B planner reaches canonicalization micro-F1 | |
| \canonFMultiSeed{} --- versus \canonFBig{} for a much larger zero-shot generalist | |
| prompted identically and \canonFHeur{} for the rule heuristic (best single run | |
| \canonFOursBest; operation-F1 \opFOurs, JSON validity \jsonValidOurs). On real hospital | |
| typos the synthetic-only fine-tune scores 0.000 repair recall; adding 20\% | |
| real-derived supervision lifts it to \hospModelRecall, and a data-scaling iteration | |
| (tripling the real-derived share from three paired benchmarks) reaches | |
| \hospModelRecallVSix{} recall at \hospModelPrecVSix{} precision---approaching the | |
| \frontierZeroShotRecall{} of a frontier-scale zero-shot model. The scaling gain is seed-robust: $+0.09$ | |
| canonicalization F1 over the base mix under identical protocol, with non-overlapping | |
| 3-seed confidence intervals. Real, execution-verified pairs are what transfer: | |
| the same iteration found frequency-derived and algorithm-cleaned labels both | |
| \emph{reduce} quality, consistent with our grounding thesis. | |
| \subsection{Grounding vs.\ clustering} | |
| With the errors-are-rare frequency gates now in both paths, grounding and frequency | |
| clustering are comparable on hospital alone (repairs-only, churn-neutral: | |
| \hospPrecGrounded{} precision at \hospRecallGrounded{} recall grounded vs | |
| \hospPrecFreq{} at \hospRecallFreq{} clustering---hospital's dominant errors are | |
| in-column typos, clustering's best case). Grounding's margin appears where references | |
| matter: across the five-benchmark real-error macro it reaches \ablFullRealF{} versus | |
| \ablNoGroundRealF{} for the frequency-clustering ablation ($+29\%$), and it carries | |
| the behavioral guarantees below. | |
| On the full suite against OpenRefine (Table~\ref{tab:money}), the result splits | |
| cleanly by regime, and we report both. On the \emph{real-error} slice---the regime the | |
| tool exists for---grounded cleaning reaches REAL-F1 \realFGrounded{}, $3.9\times$ | |
| OpenRefine kNN (\realFORKnn) and $5.7\times$ fingerprint (\realFORFp), with seed CIs of | |
| $\pm$\northGroundedCI. Provenance: the grounded and OpenRefine rows of | |
| Table~\ref{tab:money} are regenerated at the current system head (2026-06-12, | |
| post-capability, scorer fix in); the dagger rows keep their original capture | |
| provenance. The June-10 freeze system measured REAL-F1 \realFGroundedFreeze{} on the | |
| same protocol --- the $+0.05$ difference is the measured contribution of the four | |
| deterministic capabilities (\S\ref{sec:capabilities}) on the real-error slice. On the \emph{injected} slice, fingerprint clustering wins | |
| (\injFORFp{} vs \injFGrounded) at near-zero damage: our case/whitespace injectors are | |
| exactly the perturbations key-collision normalizes away, so this is its home game and | |
| we say so. kNN clustering---the method that, like us, attempts typo repair---loses on | |
| both slices while incurring the highest damage among baselines (\damageORKnn), the | |
| no-reference over-merging failure the grounding was built to prevent. The shipped | |
| verified-union system's suite row (REAL-F1 \modelRealF, damage \modelDamage) shows the | |
| grounding wrapper and heuristic union carry entity canonicalization on these datasets --- | |
| the model's contribution concentrates on the synthetic regime and hospital repair | |
| (\S\ref{sec:ws1results}), and the verifier cuts its suite damage to \modelDamage, | |
| $6\times$ below the grounded heuristic's \damageGrounded{} (HEAD damage vs the union | |
| row's freeze-time capture --- a disclosed basis mix). Within our own ablations | |
| (June-10 freeze basis throughout), removing grounding cedes $22\%$ of real-error | |
| F1 (\ablNoGroundRealF{} vs \ablFullRealF) and forfeits the behavioral guarantees: | |
| perfect abstention on adversarial traps (\ablFullAbstain) versus | |
| \ablNoAbstainAbstain{} without abstention, and reference-vetoed wrong merges (e.g.\ | |
| \texttt{guntxrsvillx}$\to$\texttt{huntsville}). | |
| \begin{table}[t] | |
| \centering | |
| \caption{Wide-suite comparison, 3 injection seeds, churn-neutral metric. NORTH is the | |
| double-macro harmonic mean; REAL-F1 is the real-error slice. Regenerated at the | |
| current system head (2026-06-12); the June-10 freeze system measured | |
| \realFGroundedFreeze{} REAL-F1 / \northGroundedFreeze{} NORTH on the same protocol.} | |
| \label{tab:money} | |
| \begin{tabular}{lcccccc} | |
| \toprule | |
| System & NORTH & $\pm$95\%CI & REAL-F1 & INJ-F1 & damage & abstain \\ | |
| \midrule | |
| Grounded (ours) & \northGrounded & \northGroundedCI & \textbf{\realFGrounded} & \injFGrounded & \damageGrounded & \ablFullAbstain \\ | |
| OpenRefine fingerprint & \northORFp & 0.000 & \realFORFp & \injFORFp & \damageORFp & 1.000 \\ | |
| OpenRefine kNN & \northORKnn & 0.002 & \realFORKnn & 0.148 & \damageORKnn & 1.000 \\ | |
| Verified union 4B (shipped)$^{\dagger}$ & -- & -- & \modelRealF & -- & \modelDamage & \modelAbstain \\ | |
| \midrule | |
| Baran (oracle det.\ + 20 labels)$^{\ddagger}$ & -- & -- & \realFBaran & -- & \damageBaran & -- \\ | |
| Jellyfish-13B (ED+DI)$^{\ddagger}$ & -- & -- & \realFJelly & -- & \damageJelly & -- \\ | |
| \bottomrule | |
| \end{tabular} | |
| \smallskip | |
| {\small $^{\dagger}$single seed, REAL + typo-injected slice only (GPU cost); other rows | |
| are 3-seed means. $^{\ddagger}$real slice only, disclosed protocol asymmetries | |
| (\S\ref{sec:ws4}): Baran uses oracle error positions + gold labels; Jellyfish is our | |
| detect-then-impute composition with seen-data caveats.} | |
| \end{table} | |
| \subsection{Generalization to never-seen tables} | |
| \label{sec:wild} | |
| The freeze-version system above was then pointed at data it had never seen, under | |
| three new harnesses (all released with this paper as the \textsc{WildClean} bundle). | |
| \textbf{(1) Paired bench}: \nPairs{} dirty/gold pairs spanning the Raha suite, SemTab | |
| ToughTables, government open-data typo corpora, entity-matching tables, and | |
| LLM-cleaning evaluation sets. On the 35 pairs from sources absent from training --- | |
| a count that coincidentally equals, but is distinct from, the \nWild{} gold-free wild | |
| tables of harness~(2) below --- the | |
| post-freeze system scores \textbf{macro F1 \unseenMacroF{} at damage | |
| \unseenMacroDamage}. The largest single contribution is the regime | |
| \S\ref{sec:capabilities}(b) unlocks: on five all-unique entity tables where no | |
| in-column frequency signal exists, F1 moves from $0.0$ to \ttFOne{} at zero damage. | |
| Cross-row voting (\S\ref{sec:capabilities}c) is the second: flights---many sources | |
| reporting the same flight---goes from \flightsBaseF{} to \flightsVoteF{} F1 | |
| heuristic-only, and the heuristic hospital path doubles from \hospBaseHeur{} to | |
| \hospVoteHeur{}. The hospital union gate is invariant under all of this | |
| (\unionGatePrec{} at \unionGateCov). \textbf{(2) Wild bench}: \nWild{} uncurated | |
| in-the-wild tables (open-data portals, GitHub, Kaggle) with no gold; we score seeded | |
| inject--recovery on each table's own data (mean recovery \wildRecovery{} over the 34 | |
| tables with inject scores; one table has none) plus a | |
| behavioral audit: every run yields a valid plan, every changed cell is attributable | |
| to a logged operation --- \textbf{zero silent edits across all \nWild{} tables}. | |
| \textbf{(3) Trust audit at scale}: \nTrust{} GitTables tables, same property --- | |
| \nTrust{}/\nTrust{} valid plans, zero crashes, zero silent edits. The held-out-source | |
| generalization metric (train and evaluation drawn from disjoint benchmark sources) | |
| remains low in absolute terms (GEN-F1 \genFTwo{}, variant-recall \genVRTwo{}, damage | |
| \genDamageTwo): cleaning unfamiliar tables is far from solved, and we report the | |
| number to anchor the next section's claim about \emph{where} the capability that does | |
| exist actually lives. | |
| \subsection{Where capability lives: a bounded null for fine-tuning} | |
| \label{sec:negative} | |
| Every attempt to move never-seen-table performance through the model weights failed; | |
| every gain in \S\ref{sec:wild} came from deterministic machinery plus the verifier. | |
| Five further supervised fine-tunes --- adding 109k harvested real-world alias pairs | |
| (ToughTables-derived, MusicBrainz search hints, RxNorm, OpenFlights), error-dense | |
| episode mixes, and a suspects-contract retrain --- left held-out GEN-F1 | |
| \emph{statistically bounded}: every retrain's delta is positive but negligible (mean | |
| $+0.003$), never approaching the pre-registered $\delta{=}0.05$. ``Bounded'' is a | |
| tested equivalence claim, not an eyeballed one~\cite{lakens}: across the five-retrain | |
| series the mean held-out GEN-F1 delta (retrain minus champion) is $+0.0028$ | |
| (90\% bootstrap CI $[+0.0008, +0.0049]$, strictly positive; | |
| 10{,}000 resamples, seed 42; per-dataset granularity, $n{=}15$ over 3 held-out | |
| sources $\times$ 5 retrains --- per-pair deltas do not exist for the retrain | |
| series, so within-retrain deltas are clustered and we add a retrain-level | |
| robustness check, $n{=}5$ macro deltas), and TOST rejects effects beyond the | |
| pre-registered SESOI of $\pm 0.05$ ($p = 8.0\times10^{-16}$; retrain-level check | |
| $p = 8.3\times10^{-8}$). One disclosure sharpens the clustering caveat: two | |
| retrains' held-out rows are \emph{bit-identical} --- mechanically verified as | |
| verifier-collapse, not a data error (their raw plans share zero mapping entries, | |
| 9 vs.\ 82 on flights, yet the verifier kills all of both, so each union | |
| degenerates to the same deterministic plan; | |
| \texttt{eval/results/equivalence\_coincidence.json}) --- so the $n{=}15$ rows | |
| carry fewer independent observations than their count suggests, which is exactly | |
| why the $n{=}5$ retrain-level test is the one we lean on. The collapse itself is | |
| the finding in miniature: different weights, same held-out behavior, because the | |
| verifier and the deterministic machinery decide what survives. Two reconciliations make the claim auditable. First, the | |
| basis: the equivalence series is scored against the champion's absolute GEN-F1 of | |
| \genChampionBasis{}, while the \genFTwo{} of \S\ref{sec:wild} is the \emph{shipped | |
| system} at the post-freeze HEAD with all deterministic capabilities --- the | |
| equivalence series scores each retrain's model-union path at its own capture time, | |
| so the two figures share a metric but not a basis. Second, the SESOI: weight | |
| interventions move GEN-F1 by at most $0.005$, while the deterministic machinery of | |
| \S\ref{sec:capabilities} moved the unseen-pair macro from $0.10$ to \unseenMacroF{} | |
| --- $\delta{=}0.05$ sits an order of magnitude above the measured weight effect and | |
| well below the machinery effect, which is exactly the boundary the test is meant to | |
| police. Mixing harvested pairs into the training blend | |
| \emph{diluted} the synthetic skill the executor verifies (a monotonic dilution law | |
| across mix ratios). A GRPO pilot using the executor as a verifiable reward (the | |
| direction RLVR table work~\cite{tabler1} motivates) was negative in all three arms at | |
| 4B/LoRA scale: the main arm and a KL-anchored variant degraded plan-format validity, | |
| and a random-reward control arm reproduced the same drift, identifying it as an RL | |
| artifact rather than signal~\cite{spurious}. We state this as a \emph{bounded} null, | |
| not a universal one: at 4B/LoRA scale, under our propose/execute protocol and | |
| training budgets, no weight intervention we ran produced measurable movement in | |
| never-seen-table repair --- profiling visibility, reference grounding, cross-row | |
| consensus, convention conservatism, and plan-level verification carry the capability | |
| that exists. The bound is explicit: results with full-scale RL infrastructure | |
| (execution-verified rewards on multi-GPU RLVR stacks~\cite{spreadsheetrl,tabler1}) | |
| show task skill moving at the same parameter scale, so our claim is about what | |
| SFT-and-pilot-RL buy in this protocol class, not about reinforcement learning in | |
| general. A second explicit bound: every weight experiment here uses the Qwen3 | |
| family --- and the very work we cite to explain the control arm's drift documents | |
| that random-reward GRPO effects are themselves family-sensitive~\cite{spurious} | |
| --- so the null is stated for Qwen3-class models pending a cross-family | |
| replication. Concurrent evaluations corroborate the mechanism from independent | |
| directions~\cite{distort,debate}. The practical corollary is unusual but actionable: | |
| a contributor who wants to improve a system like this should write a deterministic | |
| capability and gate it with the verifier, not collect more training data. | |
| The null extends to test-time compute --- with one instructive exception that | |
| \emph{confirms} the architecture claim. Self-consistency \emph{voting} over | |
| $N{=}16$ temperature-0.7 samples (cell-edit-level majority, run through the | |
| identical verifier--union pipeline) yields 0.906 precision at 0.454 coverage | |
| versus 0.9055 at 0.4519 for matched greedy decoding on the same local runtime --- | |
| a null at matched precision, the visibility law from the test-time side: voting | |
| cannot surface repairs the profile does not expose, and it actively discards | |
| verified-recoverable coverage. But pooling \emph{every} mapping from all 16 | |
| samples and letting the verifier filter the union gives the best operating point | |
| we measure for the 4B: \textbf{0.911 precision at 0.483 coverage} ($+0.6$ points | |
| precision, $+7.1$ points coverage over the shipped gate; an independent $N{=}8$ | |
| replication reproduces the \emph{voted} point to $\pm 0.0003$ precision / | |
| $\pm 0.002$ coverage, and the greedy anchor exactly). The lesson is the paper's thesis in miniature: sampling | |
| helps only as a \emph{candidate generator}; consensus adds nothing the verifier | |
| does not already provide --- pool candidates, verify, do not vote. Separately, | |
| the local capture path itself (Q8 quantization with grammar-constrained decoding) | |
| is worth $+3.9$ points of coverage over the original Modal capture at equal | |
| precision. | |
| \subsection{Zero-label capability scaling: the verifier harness is planner-agnostic} | |
| \label{sec:scaling} | |
| The negative result bounds what fine-tuning small weights buys; it says nothing | |
| about raw capability. To separate the two we dropped zero-shot, $\leq$32B | |
| open-weights planners --- with \emph{no} task training --- into the identical | |
| hospital pipeline the 4B fine-tune uses: same prompt contract, same | |
| verify($\tau{=}0.5$), same union with the grounded heuristic | |
| (Table~\ref{tab:scaling}). devstral-small-2-24B and gemma4-31B both reach | |
| \textbf{\scalePrecBig{} precision at \scaleCovBig{} coverage} --- exceeding the | |
| fine-tune's union point of \unionGatePrec{} at \unionGateCov{} --- while | |
| nemotron-30B reaches \scalePrecNemo{} at \scaleCovNemo{} with JSON-plan validity | |
| 0.4 (validity is part of the measurement: a planner that cannot reliably emit the | |
| plan schema loses coverage before capability is measured). gpt-oss-20B is | |
| excluded as a serving failure, documented rather than scored as capability: the | |
| hosted proxy returned empty content on every planning call despite full-length | |
| generation. The arm is multi-family (Mistral, Google, NVIDIA), which addresses | |
| the single-family bound of \S\ref{sec:negative} for the inference side; the | |
| weight-training null itself remains Qwen3-scoped. Disclosure: these models were | |
| measured via hosted inference for speed; all are $\leq$32B open weights and | |
| locally deployable in principle. The interpretation we draw is the paper's | |
| sharpest: SFT at 4B does not buy held-out generalization (\S\ref{sec:negative}), | |
| but raw capability at 24--31B does lift the same harness --- the verifier/union | |
| architecture is the portable contribution, converting any sufficiently capable | |
| planner into a trustworthy cleaner. | |
| \begin{table}[t] | |
| \centering | |
| \caption{Zero-shot $\leq$32B planners in the identical verify($\tau{=}0.5$)+union | |
| harness, hospital's \hospErrors{} real errors. Validity = fraction of planning | |
| calls returning schema-valid JSON. Runtime = wall-clock for the planning calls on | |
| hosted endpoints (single capture, no seeds; the 4B row is a prior Modal A100 | |
| capture with no comparable local figure). Each scaling row is a single capture; | |
| the primary evidence is the union coverage delta ($+0.07$) at matched-or-better | |
| precision, not any single cell. For context, 16-sample pooling lifts the 4B | |
| fine-tune to $0.911@0.483$ at $16\times$ planning compute | |
| (\S\ref{sec:negative}); the 24--31B planners reach $0.915@0.485$ in a single | |
| greedy pass --- single-pass capability versus test-time compute, both converted | |
| into trustworthy operating points by the same verifier. Bold marks the best union operating point. | |
| gpt-oss-20B excluded (serving failure: empty | |
| proxy responses, not measurable capability). | |
| The identical devstral/gemma rows are a verified counting coincidence, not a | |
| scoring artifact: their applied cell-edit sets share 266 of 270 cells, each | |
| commits 4 model-specific repairs (all correct), and the totals coincide | |
| (\texttt{eval/results/scaling\_coincidence.json}). | |
| } | |
| \label{tab:scaling} | |
| \footnotesize | |
| \begin{tabular}{lccccc} | |
| \toprule | |
| Planner & Params & Gated P@C & Union P@C & Validity & Runtime (s) \\ | |
| \midrule | |
| ScrubData-v6 (Qwen3-4B fine-tune) & 4B & 0.993 @ 0.287 & 0.905 @ 0.413 & --- & --- \\ | |
| devstral-small-2 (Mistral) & 24B & 0.943 @ 0.426 & \textbf{0.915 @ 0.485} & 1.0 & \runtimeDevstral \\ | |
| nemotron-3-nano (NVIDIA) & 30B & 1.000 @ 0.138 & 0.877 @ 0.336 & 0.4 & \runtimeNemo \\ | |
| gemma4 (Google) & 31B & 0.943 @ 0.426 & \textbf{0.915 @ 0.485} & 1.0 & \runtimeGemma \\ | |
| \bottomrule | |
| \end{tabular} | |
| \end{table} | |
| \subsection{Ablations} | |
| All ablations are 3-seed means (CIs $\le\pm0.003$). Removing abstention costs $-0.013$ | |
| NORTH, raises damage to \ablNoAbstainDamage{} (from \ablFullDamage), and collapses trap | |
| abstention to \ablNoAbstainAbstain. Removing the ambiguity margin costs $-0.006$ with | |
| $+0.001$ damage. Removing case matching costs $-0.002$ under the churn-neutral metric | |
| (and \emph{gained} $+0.12$ under the uncorrected metric---the artifact). Replacing | |
| grounding with frequency clustering gains $+0.020$ NORTH, all of it from the injected | |
| slice (\S\ref{sec:eval}), while ceding $-0.039$ real-error F1---the trade the system | |
| refuses by design. | |
| \subsection{Learned-repair baselines under disclosed protocols} | |
| \label{sec:ws4} | |
| We additionally run two learned-repair baselines on the real-error (Raha) slice, | |
| under the identical churn-neutral metric but with honestly disclosed protocol | |
| asymmetries. \textbf{Baran}~\cite{raha} is semi-supervised: we run its reference | |
| configuration---oracle error positions from the dirty/gold diff plus 20 gold-labeled | |
| tuples per dataset (its package default), without the optional Wikipedia-pretrained | |
| value models. It reaches REAL-F1 \realFBaran{}$\,\pm$\realFBaranCI{} (3 label-sampling | |
| seeds) at \damageBaran{} damage---an upper bound under a strictly more informed | |
| protocol than ours (zero labels, no oracle detection); with oracle positions it can | |
| essentially only edit true-error cells, so its near-zero damage is structural. | |
| \textbf{Jellyfish-13B}~\cite{jellyfish} publishes per-cell error detection and | |
| imputation but no repair task; we compose the two (detect, then impute flagged cells | |
| with the attribute masked) --- a pipeline of our construction, not theirs. It scores | |
| REAL-F1 \realFJelly{} at \damageJelly{} damage (single seed, recommended decoding; | |
| note hospital is in its instruction-tuning data and flights/rayyan in its published | |
| evaluation suite, so these numbers may flatter it). Neither baseline is run on the | |
| 56-spec injected suite (computationally and methodologically out of scope for | |
| semi-supervised and per-cell-LLM repair); their NORTH/INJ-F1 cells in | |
| Table~\ref{tab:money} are blank by design. The comparison locates our contribution: | |
| zero-config systems (ours, OpenRefine) occupy a different protocol class from | |
| supervised repair, and the verifier (\S\ref{sec:ws1results}) is what makes the | |
| zero-config class precise enough to trust, not what closes the labeled gap. | |
| Table~\ref{tab:perdataset} breaks the real-error slice down per dataset at HEAD. | |
| The verified-union rows are reported with their honest shape: off hospital the | |
| union turns ultra-conservative --- on rayyan it commits 12 changes at 0.001 | |
| damage; on beers it holds precision 0.546 at recall 0.018. The gate's precision | |
| premise transfers as \emph{safety} (union damage stays at 0.001--0.080) but not | |
| as coverage. The movies\_1 union cell ($^{q}$: local Q8 capture, the disclosed | |
| quantized protocol) is the instructive worst case: on entity-rich name columns | |
| the quantized planner proposes plausible-but-wrong merges | |
| (\texttt{The Longest Day}$\,\to\,$\texttt{The Longest Yard}); the verifier kills | |
| most, and what leaks through is damage within the disclosed band with zero | |
| credited fixes --- the planner contributes nothing there, and the system's value | |
| is that it \emph{contains} a bad planner rather than amplifying it. This directly | |
| answers the co-adaptation concern: hospital is where the model's learned mappings | |
| live, and elsewhere the system abstains or contains rather than guesses. | |
| \begin{table}[t] | |
| \centering | |
| \caption{Per-dataset real-error results (Raha slice), churn-neutral F1 / damage. | |
| Grounded is the HEAD deterministic system; OR = OpenRefine reimplementations; | |
| Union is the verified union planner ($\tau{=}0.5$) where a captured model plan | |
| exists (movies\_1 capture pending); Baran uses oracle error positions + 20 gold | |
| labels (mean of 3 label-sampling seeds) and is a supervised reference, not a | |
| peer.} | |
| \label{tab:perdataset} | |
| \footnotesize | |
| \begin{tabular}{lccccc} | |
| \toprule | |
| Dataset & Grounded (HEAD) & OR fingerprint & OR kNN & Verified union & Baran (oracle+20) \\ | |
| \midrule | |
| hospital & 0.258 / .066 & 0.000 / .000 & 0.189 / .083 & 0.567 / .001 & 0.827 / .004 \\ | |
| beers & 0.025 / .005 & 0.194 / .000 & 0.086 / .074 & 0.035 / .001 & 0.918 / .000 \\ | |
| flights & 0.127 / .082 & 0.000 / .000 & 0.014 / .065 & 0.035 / .080 & 1.000 / .000 \\ | |
| rayyan & 0.000 / .118 & 0.000 / .001 & 0.002 / .008 & 0.000 / .001 & 0.402 / .010 \\ | |
| movies\_1 & 0.714 / .025 & 0.002 / .018 & 0.001 / .072 & 0.000 / .025$^{q}$ & 0.909 / .001 \\ | |
| \midrule | |
| macro F1 & \realFOursHead & 0.039 & 0.058 & --- & \realFBaran \\ | |
| \bottomrule | |
| \end{tabular} | |
| \end{table} | |
| \subsection{A matched label budget separates the supervision regimes} | |
| \label{sec:labelcurve} | |
| The Baran comparison above is two points (zero labels, twenty labels); the | |
| matched-budget curve in Figure~\ref{fig:labelcurve} measures what each label is | |
| worth to each system on the same five-dataset real-error macro. At zero labels | |
| Baran --- even \emph{retaining} its oracle error positions --- repairs nothing | |
| (F1 \realFBaranZero, 3 seeds): its value models have nothing to learn from. | |
| ScrubData operates at \realFOursHead{} with zero configuration. With labels Baran | |
| climbs steeply (\realFBaranFive{} at $k{=}5$, \realFBaran{} at $k{=}20$): the two | |
| systems occupy complementary supervision regimes, a relationship now measured | |
| rather than asserted. ScrubData's own $k$-label arm uses the labels \emph{only} | |
| to validate and expand the verifier accept set --- no retraining, no oracle | |
| positions: $\realFOursFive \pm 0.023$ at $k{=}5$ and $\realFOursTwenty \pm 0.012$ | |
| at $k{=}20$ (3 label-sampling seeds). The disclosed asymmetry stands at every | |
| budget: Baran keeps oracle error positions throughout, so the curve is an upper | |
| bound in its favor. | |
| \begin{figure}[t] | |
| \centering | |
| \includegraphics[width=0.62\linewidth]{fig_label_curve} | |
| \caption{Matched-budget label curve, five-dataset real-error macro F1. At | |
| $k{=}0$ Baran repairs nothing even with oracle error positions retained; | |
| ScrubData operates at \realFOursHead{} with zero configuration. With labels | |
| Baran climbs steeply --- complementary supervision regimes, measured. Error | |
| bars ($\pm$) are standard deviations over 3 label-sampling seeds; the Baran | |
| $k{=}20$ point reuses the 3-seed baseline run of \S\ref{sec:ws4}.} | |
| \label{fig:labelcurve} | |
| \end{figure} | |
| \subsection{Degenerate baselines and cost-weighted damage} | |
| \label{sec:degenerate} | |
| Four degenerate policies pin the metric's floor and ceiling on the full 42-pair | |
| bench (Table~\ref{tab:degenerate}). No-op and oracle land exactly at 0 and 1; | |
| abstain-all is score-identical to no-op because the repair metric is flag-blind | |
| by design (abstentions are audited separately); seeded random editing of 5\% of | |
| cells is vandalism the metric must punish. Since F1 alone under-punishes | |
| vandalism, we add a cost-weighted score in the Effective-Reliability style, | |
| $\Phi_c = (\mathrm{fixes} - c\cdot\mathrm{damaged})/\mathrm{errors}$ at | |
| $c \in \{1, 5, 10\}$: random editing scores $-0.49$ to $-4.89$, while the | |
| shipped system stays positive at $c{=}1$ (\degShippedPhiOne) --- and goes | |
| negative at higher $c$, which is the honest reading: at 10:1 cost asymmetry, | |
| only near-zero-damage operating points (the verified union) are defensible. | |
| One disclosure: the oracle acceptance check itself surfaced a scorer artifact | |
| --- 3 cells in 1.79M held the literal string \texttt{Nan} (a first name), which | |
| parses to float NaN and was unequal to itself --- now fixed in | |
| \texttt{eval/metrics.py} with a regression test; published numbers shift by | |
| less than $10^{-4}$. | |
| \begin{table}[t] | |
| \centering | |
| \caption{Degenerate policies pin the metric (42 pairs, churn-neutral macro; | |
| random-edit: seeded, 5\% of cells). $\Phi_c$ is micro-summed | |
| $(\mathrm{fixes} - c\cdot\mathrm{damaged})$ per benchmark error. ``Shipped'' | |
| here is the deterministic grounded path on the 42 pairs (damage | |
| \degShippedDamage), distinct from the verified-union suite row of | |
| Table~\ref{tab:money} (damage \modelDamage).} | |
| \label{tab:degenerate} | |
| \small | |
| \begin{tabular}{lccccccc} | |
| \toprule | |
| Policy & F1 & P & R & damage & $\Phi_1$ & $\Phi_5$ & $\Phi_{10}$ \\ | |
| \midrule | |
| no-op & 0.000 & 1.000 & 0.000 & 0.000 & $0.00$ & $0.00$ & $0.00$ \\ | |
| abstain-all & 0.000 & 1.000 & 0.000 & 0.000 & $0.00$ & $0.00$ & $0.00$ \\ | |
| random-edit & 0.000 & 0.001 & 0.001 & 0.049 & $-0.49$ & $-2.45$ & $-4.89$ \\ | |
| oracle & 1.000 & 1.000 & 1.000 & 0.000 & $+1.00$ & $+1.00$ & $+1.00$ \\ | |
| shipped & \degShippedF & \degShippedP & 0.308 & \degShippedDamage & $+0.13$ & $-1.37$ & $-3.26$ \\ | |
| \bottomrule | |
| \end{tabular} | |
| \end{table} | |
| \subsection{Calibration of abstention} | |
| \label{sec:calibration} | |
| On a probe of reference-entity typos plus garbage traps, retrieval confidence is a | |
| usable selective-prediction signal: AURC \aurc, ECE \ece{} (over-confident; | |
| temperature scaling is future work), and | |
| precision rises monotonically with threshold---\precAtDefault{} precision at the default | |
| $\tau{=}0.84$ (coverage \covAtDefault), and $\geq$95\% precision at | |
| $\tau{=}\threshNinetyFive$ (coverage \covNinetyFive). Figure~\ref{fig:rc} shows the | |
| risk--coverage curve. | |
| \begin{figure}[t] | |
| \centering | |
| \includegraphics[width=0.62\linewidth]{fig_risk_coverage} | |
| \caption{Risk--coverage for grounded city reconciliation (650 probes). Operating points | |
| annotated; the confidence supports thresholded abstention.} | |
| \label{fig:rc} | |
| \end{figure} | |
| \section{Limitations} | |
| Reference coverage is the recall ceiling: entities absent from the taxonomy abstain by | |
| design, which is safe but not helpful; coverage work (larger gazetteers, ROR for | |
| organizations) moves recall directly. Our damage metric is convention-tolerant for case | |
| and whitespace but still counts alias expansion (\texttt{NYC}$\to$\texttt{New York}) as | |
| damage when the gold keeps the alias---a value-level convention question we leave open. | |
| The confidence signal is over-confident (ECE \ece); temperature scaling is future | |
| work. The injected half of the suite, while seeded and reproducible, inherits the | |
| injector's error model; we mitigate with the real-error slice and report both. All | |
| weight-training experiments (SFT and GRPO) use a single model family (Qwen3), so | |
| the negative result of \S\ref{sec:negative} is family-scoped until replicated on a | |
| second family. PII | |
| coverage is English-only, and we make no de-identification guarantee. Finally, the | |
| fine-tune headline is reported with multi-seed confidence intervals, but the wide-suite | |
| model row is single-seed for cost reasons and scoped as such. | |
| \section{Conclusion} | |
| A planner/executor decomposition with plan-level selective prediction --- the model | |
| proposes, a deterministic engine executes, a verifier gates every mapping --- turns | |
| LLM data cleaning from a trust liability into an auditable system: every change is a | |
| named, reversible operation; uncertain actions become review flags rather than silent | |
| corruptions; and the evaluation itself is built to resist gaming. The post-freeze | |
| program sharpened the architecture into a finding: across | |
| five further fine-tunes and a three-arm GRPO pilot, the weights never moved | |
| never-seen-table performance --- deterministic visibility, grounding, consensus, and | |
| verification did, at zero silent edits across \nWild{} wild tables and a | |
| \nTrust{}-table trust audit. The scaling arm completes the picture: the bounded null | |
| is about fine-tuning small weights, not about capability --- two of three zero-shot | |
| 24--31B planners dropped into the unchanged verifier harness exceed the | |
| fine-tune's operating point (\S\ref{sec:scaling}), so the architecture is | |
| planner-agnostic: capability gains arrive as better operating points without | |
| retraining. The shipped system runs | |
| entirely locally on commodity hardware and no data leaves the machine; the | |
| scaling-arm planners were measured via hosted endpoints, but all are locally | |
| deployable open weights. We believe the recipe---propose/execute decomposition, | |
| verification-by-execution, retrieval-grounded outputs, and selective prediction over | |
| deterministic capabilities---is a template for deploying small specialized models on | |
| other structured tasks. | |
| \section*{Reproducibility} | |
| \begin{sloppypar} | |
| The model weights are public: | |
| \url{https://huggingface.co/ricalanis/scrubdata-qwen3-4b-v6-q8}. Code, evaluation | |
| suite, and result artifacts are released at the project repository, | |
| \url{https://github.com/ricalanis/scrubdata-hackathon} (public upon publication, | |
| available to reviewers from the initial submission). The \textsc{WildClean} | |
| bundle --- redistributable dirty/gold pairs, the GitTables audit slice, open | |
| vocabularies, result JSONs, and license-gated loaders for the non-redistributable | |
| pairs --- is a public Hugging Face dataset | |
| (\url{https://huggingface.co/datasets/ricalanis/wildclean}). The shipped product | |
| planner is the identical code path measured here (\texttt{scrubdata/active.py}). | |
| \end{sloppypar} | |
| \paragraph{Release integrity.} Our own reproducibility QA discovered that the | |
| published Q8\_0 GGUF was corrupted by an export bug (the export declared a wrong | |
| end-of-generation token id inside the Qwen3 vocabulary, degenerating into | |
| tool-call loops on all runtimes; a base-model control isolated the fault to the | |
| export, not the adapter). It has been re-exported from the v6 adapter and | |
| replaced under the same filename, with both sha256 checksums recorded in the | |
| model card's Integrity section. Third-party reproduction of the model-path | |
| numbers additionally requires constrained decoding on long prompts --- | |
| \texttt{format=json} under Ollama, or | |
| \texttt{suppress\_tokens=[151657,151658]} under transformers --- which is now | |
| documented in the model card and \texttt{notebooks/Modelfile}. | |
| \paragraph{Setup.} Clone the repository and run \texttt{uv sync} (Python 3.12; | |
| \texttt{uv} resolves the pinned environment). The non-redistributable benchmark | |
| pairs materialize from their original sources with the \textsc{WildClean} | |
| \texttt{loaders.py}. Model-path results additionally need the released Q8\_0 GGUF | |
| served by a local Ollama (\texttt{SCRUBDATA\_MODEL}); every deterministic-path | |
| number runs with no model at all. Baran runs in the separate pinned environment | |
| documented at the top of \texttt{eval/run\_baran.py}; Jellyfish-13B runs remotely | |
| via Modal. | |
| \paragraph{One command per reported number} (all from the repository root, at the | |
| released revision): | |
| \begin{center} | |
| \footnotesize | |
| \begin{tabular}{@{}ll@{}} | |
| \toprule | |
| Reported result & Command \\ | |
| \midrule | |
| Wide-suite comparison (Table~\ref{tab:money}) & \texttt{python -m eval.run\_real\_multi --out eval/results} \\ | |
| Precision--coverage curve + gate & \texttt{python -m eval.precision\_curve} \\ | |
| \quad (Figure~\ref{fig:pc}, \S\ref{sec:ws1results}) & \texttt{\ \ --plan eval/results/v6\_hospital\_raw\_plan.json --union} \\ | |
| Ablations & \texttt{python -m eval.ablations} \\ | |
| Calibration (Figure~\ref{fig:rc}) & \texttt{python -m eval.calibration} \\ | |
| PII leak test & \texttt{python -m eval.pii\_leak} \\ | |
| Baran baseline & \texttt{python eval/run\_baran.py}, then \\ | |
| & \texttt{python -m eval.baselines\_learned --score-baran} \\ | |
| Jellyfish baseline & \texttt{modal run scripts/modal\_jellyfish.py} \\ | |
| \midrule | |
| Paired bench (\S\ref{sec:wild}) & \texttt{python -m eval.paired\_bench} \\ | |
| Wild bench (\S\ref{sec:wild}) & \texttt{python -m eval.wild\_bench} \\ | |
| GitTables trust audit (\S\ref{sec:wild}) & \texttt{python -m eval.gittables\_audit} \\ | |
| Held-out-source generalization & \texttt{python -m eval.generalization} \\ | |
| \midrule | |
| Scorer validation (\S\ref{sec:eval}) & \texttt{python -m pytest tests/test\_wildclean\_scorer.py} \\ | |
| Degenerate baselines (Table~\ref{tab:degenerate}) & \texttt{python -m eval.degenerate} \\ | |
| TOST equivalence (\S\ref{sec:negative}) & \texttt{python -m eval.equivalence} \\ | |
| Label curve (Figure~\ref{fig:labelcurve}) & \texttt{python -m eval.label\_curve} \\ | |
| Per-dataset table (Table~\ref{tab:perdataset}) & \texttt{python -m eval.raha\_table} \\ | |
| Self-consistency vote/pool (\S\ref{sec:negative}) & \texttt{python -m eval.sc\_rerank --model scrubdata-ft --n 16} \\ | |
| Scaling arm (Table~\ref{tab:scaling}) & \texttt{python -m eval.scaling\_arm} \\ | |
| \bottomrule | |
| \end{tabular} | |
| \end{center} | |
| \begin{thebibliography}{20} | |
| \bibitem{raha} M.~Mahdavi, Z.~Abedjan, R.~Castro Fernandez, S.~Madden, M.~Ouzzani, | |
| M.~Stonebraker, N.~Tang. Raha: A Configuration-Free Error Detection System. SIGMOD | |
| 2019; M.~Mahdavi, Z.~Abedjan. Baran: Effective Error Correction via a Unified Context | |
| Representation and Transfer Learning. PVLDB 13(11):1948--1961, 2020. | |
| \bibitem{holoclean} T.~Rekatsinas, X.~Chu, I.~F.~Ilyas, C.~R\'e. HoloClean: Holistic | |
| Data Repairs with Probabilistic Inference. PVLDB 10(11), 2017. arXiv:1702.00820. | |
| \bibitem{garf} J.~Peng, D.~Shen, N.~Tang, T.~Liu, Y.~Kou, T.~Nie, H.~Cui, G.~Yu. | |
| Self-Supervised and Interpretable Data Cleaning with Sequence Generative Adversarial | |
| Networks (GARF). PVLDB 16(3):433--446, 2022. | |
| \bibitem{wrangle} A.~Narayan, I.~Chami, L.~Orr, S.~Arora, C.~R\'e. Can Foundation | |
| Models Wrangle Your Data? PVLDB 16(4):738--746, 2022. arXiv:2205.09911. | |
| \bibitem{jellyfish} H.~Zhang, Y.~Dong, C.~Xiao, M.~Oyamada. Jellyfish: | |
| Instruction-Tuning Local Large Language Models for Data Preprocessing. EMNLP 2024. | |
| arXiv:2312.01678. | |
| \bibitem{cocoon} S.~Zhang, Z.~Huang, E.~Wu. Data Cleaning Using Large Language Models | |
| (Cocoon). arXiv:2410.15547, 2024 (preprint; no published reproduction). | |
| \bibitem{zeroed} W.~Ni, K.~Zhang, X.~Miao, X.~Zhao, Y.~Wu, Y.~Wang, J.~Yin. ZeroED: | |
| Hybrid Zero-Shot Error Detection Through Large Language Model Reasoning. ICDE 2025. | |
| arXiv:2504.05345. | |
| \bibitem{forested} M.~Wang, J.~Wang, Q.~Liu, X.~Xu, Z.~Xing, L.~Zhu, W.~Zhang. | |
| Ensembling LLM-Induced Decision Trees for Explainable and Robust Error Detection. | |
| arXiv:2512.07246, 2025 (preprint). | |
| \bibitem{autotest} Q.~Chen, Y.~He, R.~C.-W.~Wong, W.~Cui, S.~Ge, H.~Zhang, D.~Zhang, | |
| S.~Chaudhuri. Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error | |
| Detection in Tables. SIGMOD 2025. arXiv:2504.10762. | |
| \bibitem{gidcl} M.~Yan, Y.~Wang, Y.~Wang, X.~Miao, J.~Li. GIDCL: A Graph-Enhanced | |
| Interpretable Data Cleaning Framework with Large Language Models. Proc.\ ACM Manag.\ | |
| Data 2(6), Article 236, 2024 (SIGMOD). | |
| \bibitem{spreadsheetrl} B.~Chi, Y.~Xie, M.~Wu, J.~Yang, J.~Jiang, Z.~Li, et al. | |
| Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks | |
| via Reinforcement Learning. arXiv:2605.22642, 2026. | |
| \bibitem{distort} A.~Dutta, H.~Nigam, H.~Hasanbeig, A.~Radhakrishna, S.~Gulwani. | |
| An Empirical Investigation of Robustness in Large Language Models under Tabular | |
| Distortions. arXiv:2601.05009, 2026. | |
| \bibitem{debate} C.~Parmar, A.~Mehta, H.~Wu, J.~Ramamurthy, S.~Medhekar. When Helping | |
| Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning. arXiv:2606.02866, 2026. | |
| \bibitem{tabler1} Z.~Yang, L.~Chen, A.~Cohan, Y.~Zhao. Table-R1: Inference-Time | |
| Scaling for Table Reasoning. EMNLP 2025. arXiv:2505.23621. | |
| \bibitem{spurious} R.~Shao, S.~S.~Li, R.~Xin, S.~Geng, Y.~Wang, et al. Spurious | |
| Rewards: Rethinking Training Signals in RLVR. arXiv:2506.10947, 2025. | |
| \bibitem{tablegpt} P.~Li, Y.~He, D.~Yashar, W.~Cui, S.~Ge, H.~Zhang, D.~Rifinski | |
| Fainman, D.~Zhang, S.~Chaudhuri. Table-GPT: Table Fine-tuned GPT for Diverse Table | |
| Tasks. Proc.\ ACM Manag.\ Data 2(3), Article 176, 2024 (SIGMOD). arXiv:2310.09263. | |
| \bibitem{retclean} Z.~A.~Naeem, M.~S.~Ahmad, M.~Eltabakh, M.~Ouzzani, N.~Tang. | |
| RetClean: Retrieval-Based Data Cleaning Using LLMs and Data Lakes. PVLDB 17(12), 2024 | |
| (demo). arXiv:2303.16909. | |
| \bibitem{turl} X.~Deng, H.~Sun, A.~Lees, Y.~Wu, C.~Yu. TURL: Table Understanding | |
| through Representation Learning. PVLDB 14(3):307--319, 2021. arXiv:2006.14806. | |
| \bibitem{tablellama} T.~Zhang, X.~Yue, Y.~Li, H.~Sun. TableLlama: Towards Open Large | |
| Generalist Models for Tables. NAACL 2024. arXiv:2311.09206. | |
| \bibitem{belotti} F.~Belotti, F.~Dadda, M.~Cremaschi, R.~Avogadro, M.~Palmonari. | |
| Evaluating LLMs on Entity Disambiguation in Tables. arXiv:2408.06423, 2024 (preprint). | |
| \bibitem{racoon} L.~L.~Wei, G.~Xiao, M.~Balazinska. RACOON: An LLM-based Framework for | |
| Retrieval-Augmented Column Type Annotation with a Knowledge Graph. arXiv:2409.14556, | |
| 2024 (preprint). | |
| \bibitem{mtab} P.~Nguyen, N.~Kertkeidkachorn, R.~Ichise, H.~Takeda. MTab: Matching | |
| Tabular Data to Knowledge Graph using Probability Models. SemTab/ISWC 2019. | |
| arXiv:1910.00246. | |
| \bibitem{selective} R.~El-Yaniv, Y.~Wiener. On the Foundations of Noise-free Selective | |
| Classification. JMLR 11:1605--1641, 2010; Y.~Geifman, R.~El-Yaniv. Selective | |
| Classification for Deep Neural Networks. NeurIPS 2017. | |
| \bibitem{openmed} M.~Panahi. OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art | |
| Transformers for Biomedical NER Across 12 Public Datasets. arXiv:2508.01630, 2025 | |
| (preprint). | |
| \bibitem{lakens} D.~Lakens. Equivalence Tests: A Practical Primer for t Tests, | |
| Correlations, and Meta-Analyses. Social Psychological and Personality Science | |
| 8(4):355--362, 2017. | |
| \bibitem{grouse} S.~Muller, A.~Loison, B.~Omrani, G.~Viaud. GroUSE: A Benchmark | |
| to Evaluate Evaluators in Grounded Question Answering. COLING 2025. | |
| arXiv:2409.06595. | |
| \end{thebibliography} | |
| \end{document} | |