File size: 16,934 Bytes
16dc556
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
# ScrubData β€” The Profound Tool Reference

> The single local document that explains the whole system: what it is, why every
> piece exists, where every number comes from, and what we learned building it.
> Written at the close of the research domain (2026-06-12). The paper
> (`docs/paper/main.tex`) is the citable account; THIS file is the operational one.

---

## 1. What ScrubData is

ScrubData is a **zero-config, zero-label, local** tabular data-cleaning system built
around one architectural commitment: **the model never touches data**.

A profiler aggregates each column into a bounded value-frequency profile; a small
(≀4B, locally-run) fine-tuned planner *proposes* a JSON cleaning plan; a
deterministic pandas executor *applies* it. The plan is the complete, inspectable,
reversible specification of every change. Three consequences define the product:

1. **No silent edits by construction** β€” every changed cell traces to a named,
   logged operation (verified at scale: 0 silent edits across 35 wild tables and a
   239-table GitTables trust audit).
2. **Abstention is first-class** β€” anything below confidence becomes a review flag
   ("YOUR CALL" card in the UI), never a quiet skip and never a guess.
3. **Profile-not-rows scaling** β€” the prompt scales with *distinct values*, not
   rows; a million-row table profiles like a hundred-row one, and no cell values
   leave the machine.

### The central finding (load-bearing, repeatedly measured)

**Model weights contribute approximately nothing to never-seen-table
generalization in this protocol class.** Five SFT retrains (v7–v10 + mixes, 109k
harvested real alias pairs) and a three-arm GRPO pilot (executor as verifiable
reward, including a random-reward control that reproduced the same format drift)
all failed to move held-out generalization. Every measured gain came from
**deterministic machinery gated by the plan-level verifier** (Β§5). Corroborated
independently by Spreadsheet-RL, arXiv:2601.05009, and arXiv:2606.02866.
Practical corollary: *to improve ScrubData, write a deterministic capability and
gate it with the verifier; do not collect more training data.*

---

## 2. The shipped pipeline (`scrubdata/active.py::get_planner`)

```
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 df ──► profiler ──►   β”‚  model path (only if SCRUBDATA_MODEL is set) β”‚
        (bounded       β”‚  batched (4 cols/call) local Ollama planner  β”‚
        profile incl.  β”‚  β†’ per-batch fallback to heuristic on error  β”‚
        suspects)      β”‚  β†’ grounded (reference taxonomies, RACOON)   β”‚
                       β”‚  β†’ verify_plan(tau=SCRUBDATA_TAU, def 0.5)   β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β”‚  union_plans (model wins per surface;
                                       β”‚  inherits deterministic ops + table ops)
        heuristic mock_plan β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β–Ό
                        executor.apply_plan β†’ (clean_df, change_log)
                                       β–Ό
                 report.render_report Β· trace.log_run Β· observability
```

- **No model configured** β†’ `mock_plan` (grounded deterministic heuristic) alone.
  The app always produces a plan; the model is an upgrade, never a dependency.
- **Measured operating point** (hospital, 509 real errors): union **0.905
  precision @ 0.413 coverage**; gated model alone 0.993 @ 0.287; 3-seed
  0.891Β±0.012 @ 0.396Β±0.025. Precision flat 0.89–0.91 for Ο„βˆˆ[0.2,0.8].

Entry points: `uv run server.py` (FastAPI + UI), `app.py` (HF Space/Gradio),
`scrubdata/cli.py` (`scrubdata <file.csv> -o out.csv --report r.md --plan p.json`).

### Environment variables

| Var | Default | Meaning |
|---|---|---|
| `SCRUBDATA_MODEL` | unset | local Ollama model id (e.g. `scrubdata-ft-v6`); unset = heuristic only |
| `SCRUBDATA_TAU` | `0.5` | per-entry verifier threshold on model mappings |
| `SCRUBDATA_HC_TAU` | `0.8` | stricter bar for heuristic suspect-mappings (no model cross-check there) |
| `SCRUBDATA_PAIR_PROFILES` | off | WS2 candidate-constrained planning (measured redundant with verifier; off by default) |
| `SCRUBDATA_PII_NER` | off | OpenMed-PII 44M NER tier on top of deterministic validators |

---

## 3. Module map (`scrubdata/`)

| Module | Role | Key facts |
|---|---|---|
| `profiler.py` | column β†’ bounded profile | `VALUE_COUNTS_CAP=80` (high-card cols: top-8 only) + `suspect_values` section (the visibility fix); `truncated_values` count keeps honesty about what's hidden |
| `detect.py` | typing + issue predicates | `detect_semantic_type` (zip/ZCTA/Excel-serial guards), `date_formats_consistent` (collapses digit AND alpha runs; 90% dominant-shape), `percent_formats_consistent` (90%), `has_mojibake`, `is_missing` |
| `planner.py` | deterministic heuristic planner | `mock_plan`, `_column_operations`, `_suspect_canonicalize` (Ο„_hc=0.8), `detect_entity_groups` (cross-row voting detection), emits `fix_encoding` BEFORE `strip_whitespace` (order-critical), `off_convention_dates` visible-abstention flags |
| `executor.py` | the only thing that touches cells | op dispatch (Β§4); unknown ops are no-ops (forward-compatible); returns `(df, change_log)`; `resolve_by_majority` table op lives here |
| `verifier.py` | WS1 selective prediction | `entry_confidence` (3 hard gates, Β§5.0), `verify_plan` (also enforces convention gates on MODEL-emitted parse_date/parse_percent β€” the model path otherwise bypasses them), `union_plans` (order-preserving op inheritance via `reversed(inherit)`) |
| `reconcile.py` | reference grounding | `ReferenceIndex`, `default_index()` loads toughtables_ref (contamination-guarded: excludes the 8 benchmark tables) + MusicBrainz hints + Wikidata companies + ROR; `infer_reference_type` needs **β‰₯20% exact entity hits** (over-fire guard); falls back to `training/harvests/` for Space/clone parity |
| `grounded.py` | RACOON wrapper | model never free-generates a canonical for a reference-typed column |
| `pair_profile.py` | suspects + WS2 candidates | `suspects_for_column` (≀25/col, bounded: 4k rare cap + cheap prefilters before SequenceMatcher β€” 40minβ†’24s fix), `candidate_pairs`, `constrain_plan` |
| `model_planner.py` | Ollama backends | `make_local_ollama_planner`, `make_batched_planner(batch_size=4)`, JSON extraction |
| `prompt.py` | prompt/training contract | `_profile_for_prompt` (compact suspects), `build_chat_example` (training-data side of the same contract β€” change one, regenerate the other) |
| `pii.py` | PII second task | deterministic validators (Luhn, IBAN, phone) + allowlist + coverage vote; optional 44M NER; `mask/hash/pseudonymize` |
| `active.py` | THE composition | `get_planner()` β€” Β§2 |
| `cli.py` / `report.py` / `trace.py` / `observability.py` | UX + audit | CLI, markdown report, JSONL traces, monitor summary/OTel span |
| `baselines.py` | OpenRefine kNN/fingerprint reimplementations | the zero-config comparison class |
| `refdata/cities.txt` | seed gazetteer | plus everything in `training/harvests/*.jsonl` |

---

## 4. Operation vocabulary (the executor's closed set)

**Column ops** (`_apply_column_op`): `strip_whitespace`, `normalize_punctuation`,
`fix_encoding` (lossless cp1252/latin-1↔utf8 round-trip, mojibake-marker-reduction
gated), `normalize_disguised_nulls`, `parse_currency`, `parse_number`,
`parse_percent` (abstains on bare values β€” no /100 corruption),
`parse_date`, `standardize_boolean`, `standardize_phone` (7-digit β†’ `DDD-DDDD`),
`normalize_email`, `standardize_case`, `canonicalize_categories` (mapping-driven;
the verifier's subject), `flag_pii` (log-only), `mask_pii`, `hash_pii`,
`pseudonymize_pii`. Unknown op β†’ no-op.

**Table ops**: `drop_empty_columns`, `drop_empty_rows`, `drop_exact_duplicates`,
`resolve_by_majority` (Β§5.3).

Op-order invariant: **`fix_encoding` must precede whitespace/punctuation ops** β€”
they destroy the UTF-8 byte patterns repair needs (grader-reproduced bug; fixed in
both heuristic emission and union inheritance).

---

## 5. The five deterministic capabilities (what actually generalizes)

### 5.0 Plan-level verifier (WS1) β€” `verifier.entry_confidence`
Every non-grounded `canonicalize_categories` entry `raw→canon` is scored with
three HARD gates, each killing a measured hospital failure class:
- **errors are rare**: `freq(raw) β‰₯ 3` β†’ 0.0 (frequent = legit data; "de kalb"Γ—92)
- **repair to dominance only**: `freq(canon) < max(2, 2Β·freq(raw))` β†’ 0.0
  ("yex→yexu", typo mapped to a worse typo)
- **code discipline**: digit-bearing values repair only if letter-part similarity
  β‰₯0.85 AND digits identical (allows `amix-2β†’ami-2`, blocks `ak_β†’al_`)
Survivors score `sim Γ— (0.5 + 0.5Β·support)`; below-Ο„ entries become review flags.

### 5.1 Suspect surfacing (visibility) β€” `pair_profile.suspects_for_column`
The 80-value profile cap structurally hides high-cardinality dirty cells from ANY
planner (proved by the v8/v9 retrains: more data couldn't fix what the model
couldn't see). Every text-ish column profile now carries ≀25 `suspect_values`:
rare surfaces + evidence-backed candidates (frequency dominance, edit similarity,
reference membership). The heuristic maps suspects clearing `entry_confidence β‰₯
SCRUBDATA_HC_TAU=0.8`; the rest become flags.

### 5.2 Generic entity reference β€” `reconcile.default_index`
Open vocabularies (ToughTables-derived ref [8 bench tables excluded], MusicBrainz
search-hint misspellings, RxNorm, Wikidata companies, ROR, GeoNames, OpenFlights,
O*NET, nicknames) as a pluggable reference type. Typing requires **β‰₯20% exact
hits** of distinct values (fuzzy coverage alone over-fires on name-like columns β€”
measured). Cracked the all-unique regime: 5 ToughTables tables **0 β†’ 0.955–0.957
F1 at 0.0000 damage** (~62k corrections) β€” where no in-column frequency signal
exists at all.

### 5.3 Cross-row majority voting β€” `planner.detect_entity_groups` + `resolve_by_majority`
Tables repeating a real-world entity across rows (flights reported by many
sources) carry their own repair signal. Detection: compact-token key columns,
median multiplicity 3–30, β‰₯2 votable string columns with majority-bearing
disagreement + β‰₯2 distinct majorities, date-share ≀0.3 guard. Execution: resolve
thin dissenting minorities to group majority; skips missing-like keys;
min_share/min_group clamped. **False-consensus guard**: mean minority share β‰₯0.25
β†’ decline (legitimate correlated updates, not reporting errors β€” a flat volume cap
was measured to destroy the legitimate regime and replaced). Measured: flights
heuristic 0.044β†’**0.164** F1; hospital heuristic 0.092β†’**0.186**.

### 5.4 Convention conservatism β€” `detect.*_formats_consistent` + `verify_plan`
Never re-format an internally consistent column: date/percent ops gated on
dominant-shape inconsistency (digit+alpha runs collapsed, 90% rule); zip/postal
names never typed phone/date; Excel-serial typing needs a date-suggestive name.
Suppressed minorities surface as `off_convention_dates` flags. The verifier
enforces the same gates on model plans at the verification boundary (the model
path otherwise bypasses heuristic emission gates entirely).

---

## 6. Evaluation (how every number regenerates)

One scoring contract β€” `eval/run_real_multi.py::score()` β€” **churn-neutral,
convention-tolerant**: sem-equal = numeric-tolerant OR strip+casefold equal; pure
case/whitespace churn counts as nothing; a fix requires acting; **damage** =
clean cells corrupted / clean cells; **silent edits** = changed columns minus
log-attributed columns (must be 0).

| Harness | Command | What it measures | Current numbers |
|---|---|---|---|
| Money table | `python -m eval.run_real_multi` | 65-set suite, 3 seeds | grounded NORTH 0.224Β±0.004; REAL-F1 0.225 vs OR-kNN 0.058 (HEAD 2026-06-12 regen; freeze was 0.203/0.174) |
| WS1 gate | `python -m eval.precision_curve --plan eval/results/v6_hospital_raw_plan.json --union` | precision–coverage curve | **0.905 @ 0.413** (Ο„=0.5) |
| Paired bench | `python -m eval.paired_bench` | 42 dirty/gold pairs | unseen-35 macro F1 **0.363** @ dmg **0.0219** |
| Wild bench | `python -m eval.wild_bench` | 35 uncurated tables, behavioral + inject-recovery | recovery 0.207; **0 silent edits** |
| Trust audit | `python -m eval.gittables_audit` | 239 GitTables clean-lake | 239/239 valid, 0 crashes, 0 silent edits |
| Generalization | `python -m eval.generalization` | held-out-source (train: hospital/beers/movies_1 Β· eval: flights/rayyan/ed2) | GEN-F1 0.058, VR 0.108, dmg 0.036 |
| RADAR board | `python -m eval.radar_bench` | regime boundaries by artifact type | abstains on missingness βœ“; reasoning-class = frontier territory |
| Baselines | `eval/run_baran.py`, `modal run scripts/modal_jellyfish.py` | disclosed-protocol comparisons | Baran (oracle+20 labels) 0.811; Jellyfish-13B 0.074 |
| Calibration / PII | `eval.calibration`, `eval.pii_leak` | abstention quality / leak test | AURC 0.120, ECE 0.169; 0/360 residual PII |

**Eval-source discipline**: TRAIN_SOURCES["v6"]={hospital,beers,movies_1};
EVAL_SOURCES={flights,rayyan,ed2_restaurants}. Never crossed.

---

## 7. Model & artifacts

| Artifact | Where | Notes |
|---|---|---|
| Champion adapter | Modal volume `scrubdata-v5-adapter` `/v5_seed21` (= "v6") | survived v7–v10 challenges + GRPO |
| Merged model | `hf.co/ricalanis/scrubdata-qwen3-4b` | card carries the v2 finding |
| Q8 GGUF | `hf.co/ricalanis/scrubdata-qwen3-4b-v6-q8` | **Q8_0 only β€” Q4_K_M corrupts** (Unsloth 2026.6.x); non-thinking Modelfile required (`notebooks/Modelfile`); suppress tokens 151657/151658 under transformers |
| Benchmark | `hf.co/datasets/ricalanis/wildclean` | 33 redistributable pairs + loaders.py for 9 license-gated + gittables250 + 10 vocabs + frozen results; first cleaning bench with damage + silent-edit accounting |
| Demo | `hf.co/spaces/build-small-hackathon/scrubdata` | deploy = `HfApi.upload_folder` of `git archive HEAD` β€” **NO GitHub auto-sync** |
| Paper | `docs/paper/main.tex` + `numbers.tex` | compile: `~/.local/bin/tectonic main.tex` (no pdflatex on this machine) |
| Vocabs | `training/harvests/*.jsonl` (15MB, 13 files) | loader falls back here for clone parity |

Modal patterns: `--detach` for anything long; results land in Modal Dicts
(`scrubdata-train-results`, `scrubdata-eval-v5-results`, `scrubdata-suite-results`).
**Budget status at domain close: ~$187 of $212 ceiling β€” Modal HALTED.**

---

## 8. Negative results ledger (measured, do not re-litigate)

1. **v7–v10 SFT retrains**: 109k harvested alias pairs, episode mixes, suspects
   contract β€” GEN flat/worse. Mixing harvested pairs **dilutes** executor-verified
   synthetic skill (monotonic dilution law across mix ratios; mixH 0.677).
2. **GRPO pilot, 3 arms** (main, KL-anchored v2, random-reward control): all
   degrade format at 4B/LoRA/$30 scale; the control proved the drift is an RL
   artifact (cf. "Spurious Rewards"). Published RLVR wins used real infra
   (verl, 4Γ—H100Γ—40h). Episodes corpus (600, `training/build_grpo_episodes.py`) +
   hand-rolled loop (`scripts/modal_grpo.py`) committed for a future attempt.
3. **Uniform verification of existing low-card mappings** (A1 per-class
   thresholds): 0.905β†’0.890 β€” reverted.
4. **Strict entity-typing thresholds** (0.90/0.05): cost more than bought β€” reverted.
5. **WS2 candidate constraining composed with verifier**: 0.876 @ 0.387 < union at
   same Ο„ β€” redundant gating of the same failure class; available, off by default.
6. **Flat volume cap on cross-row voting**: destroyed the legitimate
   dense-disagreement regime β€” replaced by the false-consensus guard.
7. **Frozen-gold synthetic yardstick predates the suspects prompt contract** β€”
   regenerate gold before ever quoting synthetic canon_f1 again.

## 9. Known-open (graded non-blocking)

`_parse_date` per-value dayfirst; i18n name guards; mojibake fixpoint /
sequence-plausibility; backlog sources: CMS API, NHTSA, Canada contracts, Matelda
~6,670 pairs, GLEIF/USDA vocabs, WDVC-16. Reasoning-class artifacts (RADAR) are
explicitly out of protocol class β€” frontier-model territory.

## 10. Where deeper detail lives

`docs/PRODUCT.md` (trust contract) Β· `docs/SOTA.md` + `docs/ROADMAP_SOTA2.md`
(position + research map) Β· `docs/CAPABILITY_GRADES.md` (12-agent adversarial
grading + must-fix ledger) Β· `docs/WILD_BENCH.md` / `docs/PAIRED_BENCH.md` /
`docs/GITTABLES_AUDIT.md` / `docs/DATASETS.md` (per-bench detail + licenses) Β·
`docs/NIGHT_LOG.md` (stage-3 timeline) Β· `project-memory/` (agent memory snapshot).