Spaces:

build-small-hackathon
/

scrubdata

Running

scrubdata / eval /results /inject_validity_appendix.tex

OpenAI Codex

deploy: add sponsor:openai tag (Best Use of Codex) + Codex-hardened build

16dc556 18 days ago

2.1 kB

	% Auto-generated by eval/inject_validity.py — do not edit by hand.
	\subsection{Validity of the Injected Slice}\label{app:inject-validity}
	Following the TableEG-style audit, we classify every error cell (dirty vs.\ gold)
	with a deterministic taxonomy and compare the suite's injected errors (money-table
	seeds 7/17/27, $n=43{,}011$) against the $163{,}607$ real errors across the 42 paired sources (hospital's 509 included).
	\begin{table}[t]\centering\small
	\caption{Error-type distributions, real vs.\ injected (pooled).}
	\label{tab:inject-validity}
	\begin{tabular}{lrr}\toprule
	error type & real & injected \\ \midrule
	typo & 0.386 & 0.454 \\
	case & 0.009 & 0.214 \\
	whitespace & 0.009 & 0.333 \\
	encoding & 0.004 & 0.000 \\
	numeric & 0.061 & 0.000 \\
	date-format & 0.000 & 0.000 \\
	token-swap & 0.000 & 0.000 \\
	missing & 0.032 & 0.000 \\
	other & 0.500 & 0.000 \\
	\bottomrule\end{tabular}\end{table}
	The injector covers only the recoverable surface classes it targets by design
	(typo/case/whitespace; injector--taxonomy agreement 0.997), whereas real errors
	are dominated by substitutions beyond edit distance~2 (other, 0.500) and short typos (0.386), with numeric (0.061), missing-value (0.032), and encoding classes the injector never produces.
	Pooled Jensen--Shannon divergence is 0.526~bits (per-source median 0.398, range 0.212--1.000; hospital 0.398): the two slices are \emph{not}
	interchangeable, which is why the paper reports them separately and localizes
	the grounding claim in the real slice. Ranking preservation is partial: Kendall
	$\tau_b$ between system rankings on the injected vs.\ real F1 slices is $0.33$ over the four cross-system rows and $0.80$ with the degenerate anchors
	(abstain-all, random-edit, oracle) included. The injected slice preserves the
	floor/ceiling ordering but ranks OpenRefine fingerprint above both our system
	and OpenRefine kNN, the reverse of the real slice --- frequency clustering looks
	strong exactly where the canonical form is present and dominant by construction.
	Injected-only evaluation would therefore overstate frequency-clustering
	baselines.