scrubdata / eval /results /inject_validity_appendix.tex
OpenAI Codex
deploy: add sponsor:openai tag (Best Use of Codex) + Codex-hardened build
16dc556
Raw
History Blame Contribute Delete
2.1 kB
% Auto-generated by eval/inject_validity.py — do not edit by hand.
\subsection{Validity of the Injected Slice}\label{app:inject-validity}
Following the TableEG-style audit, we classify every error cell (dirty vs.\ gold)
with a deterministic taxonomy and compare the suite's injected errors (money-table
seeds 7/17/27, $n=43{,}011$) against the $163{,}607$ real errors across the 42 paired sources (hospital's 509 included).
\begin{table}[t]\centering\small
\caption{Error-type distributions, real vs.\ injected (pooled).}
\label{tab:inject-validity}
\begin{tabular}{lrr}\toprule
error type & real & injected \\ \midrule
typo & 0.386 & 0.454 \\
case & 0.009 & 0.214 \\
whitespace & 0.009 & 0.333 \\
encoding & 0.004 & 0.000 \\
numeric & 0.061 & 0.000 \\
date-format & 0.000 & 0.000 \\
token-swap & 0.000 & 0.000 \\
missing & 0.032 & 0.000 \\
other & 0.500 & 0.000 \\
\bottomrule\end{tabular}\end{table}
The injector covers only the recoverable surface classes it targets by design
(typo/case/whitespace; injector--taxonomy agreement 0.997), whereas real errors
are dominated by substitutions beyond edit distance~2 (other, 0.500) and short typos (0.386), with numeric (0.061), missing-value (0.032), and encoding classes the injector never produces.
Pooled Jensen--Shannon divergence is 0.526~bits (per-source median 0.398, range 0.212--1.000; hospital 0.398): the two slices are \emph{not}
interchangeable, which is why the paper reports them separately and localizes
the grounding claim in the real slice. Ranking preservation is partial: Kendall
$\tau_b$ between system rankings on the injected vs.\ real F1 slices is $0.33$ over the four cross-system rows and $0.80$ with the degenerate anchors
(abstain-all, random-edit, oracle) included. The injected slice preserves the
floor/ceiling ordering but ranks OpenRefine fingerprint above both our system
and OpenRefine kNN, the reverse of the real slice --- frequency clustering looks
strong exactly where the canonical form is present and dominant by construction.
Injected-only evaluation would therefore overstate frequency-clustering
baselines.