Spaces:
Running
Running
| % Auto-generated by eval/inject_validity.py — do not edit by hand. | |
| \subsection{Validity of the Injected Slice}\label{app:inject-validity} | |
| Following the TableEG-style audit, we classify every error cell (dirty vs.\ gold) | |
| with a deterministic taxonomy and compare the suite's injected errors (money-table | |
| seeds 7/17/27, $n=43{,}011$) against the $163{,}607$ real errors across the 42 paired sources (hospital's 509 included). | |
| \begin{table}[t]\centering\small | |
| \caption{Error-type distributions, real vs.\ injected (pooled).} | |
| \label{tab:inject-validity} | |
| \begin{tabular}{lrr}\toprule | |
| error type & real & injected \\ \midrule | |
| typo & 0.386 & 0.454 \\ | |
| case & 0.009 & 0.214 \\ | |
| whitespace & 0.009 & 0.333 \\ | |
| encoding & 0.004 & 0.000 \\ | |
| numeric & 0.061 & 0.000 \\ | |
| date-format & 0.000 & 0.000 \\ | |
| token-swap & 0.000 & 0.000 \\ | |
| missing & 0.032 & 0.000 \\ | |
| other & 0.500 & 0.000 \\ | |
| \bottomrule\end{tabular}\end{table} | |
| The injector covers only the recoverable surface classes it targets by design | |
| (typo/case/whitespace; injector--taxonomy agreement 0.997), whereas real errors | |
| are dominated by substitutions beyond edit distance~2 (other, 0.500) and short typos (0.386), with numeric (0.061), missing-value (0.032), and encoding classes the injector never produces. | |
| Pooled Jensen--Shannon divergence is 0.526~bits (per-source median 0.398, range 0.212--1.000; hospital 0.398): the two slices are \emph{not} | |
| interchangeable, which is why the paper reports them separately and localizes | |
| the grounding claim in the real slice. Ranking preservation is partial: Kendall | |
| $\tau_b$ between system rankings on the injected vs.\ real F1 slices is $0.33$ over the four cross-system rows and $0.80$ with the degenerate anchors | |
| (abstain-all, random-edit, oracle) included. The injected slice preserves the | |
| floor/ceiling ordering but ranks OpenRefine fingerprint above both our system | |
| and OpenRefine kNN, the reverse of the real slice --- frequency clustering looks | |
| strong exactly where the canonical form is present and dominant by construction. | |
| Injected-only evaluation would therefore overstate frequency-clustering | |
| baselines. | |