Spaces:
Running
Running
File size: 16,934 Bytes
16dc556 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 | # ScrubData β The Profound Tool Reference
> The single local document that explains the whole system: what it is, why every
> piece exists, where every number comes from, and what we learned building it.
> Written at the close of the research domain (2026-06-12). The paper
> (`docs/paper/main.tex`) is the citable account; THIS file is the operational one.
---
## 1. What ScrubData is
ScrubData is a **zero-config, zero-label, local** tabular data-cleaning system built
around one architectural commitment: **the model never touches data**.
A profiler aggregates each column into a bounded value-frequency profile; a small
(β€4B, locally-run) fine-tuned planner *proposes* a JSON cleaning plan; a
deterministic pandas executor *applies* it. The plan is the complete, inspectable,
reversible specification of every change. Three consequences define the product:
1. **No silent edits by construction** β every changed cell traces to a named,
logged operation (verified at scale: 0 silent edits across 35 wild tables and a
239-table GitTables trust audit).
2. **Abstention is first-class** β anything below confidence becomes a review flag
("YOUR CALL" card in the UI), never a quiet skip and never a guess.
3. **Profile-not-rows scaling** β the prompt scales with *distinct values*, not
rows; a million-row table profiles like a hundred-row one, and no cell values
leave the machine.
### The central finding (load-bearing, repeatedly measured)
**Model weights contribute approximately nothing to never-seen-table
generalization in this protocol class.** Five SFT retrains (v7βv10 + mixes, 109k
harvested real alias pairs) and a three-arm GRPO pilot (executor as verifiable
reward, including a random-reward control that reproduced the same format drift)
all failed to move held-out generalization. Every measured gain came from
**deterministic machinery gated by the plan-level verifier** (Β§5). Corroborated
independently by Spreadsheet-RL, arXiv:2601.05009, and arXiv:2606.02866.
Practical corollary: *to improve ScrubData, write a deterministic capability and
gate it with the verifier; do not collect more training data.*
---
## 2. The shipped pipeline (`scrubdata/active.py::get_planner`)
```
ββββββββββββββββββββββββββββββββββββββββββββββββ
df βββΊ profiler βββΊ β model path (only if SCRUBDATA_MODEL is set) β
(bounded β batched (4 cols/call) local Ollama planner β
profile incl. β β per-batch fallback to heuristic on error β
suspects) β β grounded (reference taxonomies, RACOON) β
β β verify_plan(tau=SCRUBDATA_TAU, def 0.5) β
βββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β union_plans (model wins per surface;
β inherits deterministic ops + table ops)
heuristic mock_plan ββββββββββββ
βΌ
executor.apply_plan β (clean_df, change_log)
βΌ
report.render_report Β· trace.log_run Β· observability
```
- **No model configured** β `mock_plan` (grounded deterministic heuristic) alone.
The app always produces a plan; the model is an upgrade, never a dependency.
- **Measured operating point** (hospital, 509 real errors): union **0.905
precision @ 0.413 coverage**; gated model alone 0.993 @ 0.287; 3-seed
0.891Β±0.012 @ 0.396Β±0.025. Precision flat 0.89β0.91 for Οβ[0.2,0.8].
Entry points: `uv run server.py` (FastAPI + UI), `app.py` (HF Space/Gradio),
`scrubdata/cli.py` (`scrubdata <file.csv> -o out.csv --report r.md --plan p.json`).
### Environment variables
| Var | Default | Meaning |
|---|---|---|
| `SCRUBDATA_MODEL` | unset | local Ollama model id (e.g. `scrubdata-ft-v6`); unset = heuristic only |
| `SCRUBDATA_TAU` | `0.5` | per-entry verifier threshold on model mappings |
| `SCRUBDATA_HC_TAU` | `0.8` | stricter bar for heuristic suspect-mappings (no model cross-check there) |
| `SCRUBDATA_PAIR_PROFILES` | off | WS2 candidate-constrained planning (measured redundant with verifier; off by default) |
| `SCRUBDATA_PII_NER` | off | OpenMed-PII 44M NER tier on top of deterministic validators |
---
## 3. Module map (`scrubdata/`)
| Module | Role | Key facts |
|---|---|---|
| `profiler.py` | column β bounded profile | `VALUE_COUNTS_CAP=80` (high-card cols: top-8 only) + `suspect_values` section (the visibility fix); `truncated_values` count keeps honesty about what's hidden |
| `detect.py` | typing + issue predicates | `detect_semantic_type` (zip/ZCTA/Excel-serial guards), `date_formats_consistent` (collapses digit AND alpha runs; 90% dominant-shape), `percent_formats_consistent` (90%), `has_mojibake`, `is_missing` |
| `planner.py` | deterministic heuristic planner | `mock_plan`, `_column_operations`, `_suspect_canonicalize` (Ο_hc=0.8), `detect_entity_groups` (cross-row voting detection), emits `fix_encoding` BEFORE `strip_whitespace` (order-critical), `off_convention_dates` visible-abstention flags |
| `executor.py` | the only thing that touches cells | op dispatch (Β§4); unknown ops are no-ops (forward-compatible); returns `(df, change_log)`; `resolve_by_majority` table op lives here |
| `verifier.py` | WS1 selective prediction | `entry_confidence` (3 hard gates, Β§5.0), `verify_plan` (also enforces convention gates on MODEL-emitted parse_date/parse_percent β the model path otherwise bypasses them), `union_plans` (order-preserving op inheritance via `reversed(inherit)`) |
| `reconcile.py` | reference grounding | `ReferenceIndex`, `default_index()` loads toughtables_ref (contamination-guarded: excludes the 8 benchmark tables) + MusicBrainz hints + Wikidata companies + ROR; `infer_reference_type` needs **β₯20% exact entity hits** (over-fire guard); falls back to `training/harvests/` for Space/clone parity |
| `grounded.py` | RACOON wrapper | model never free-generates a canonical for a reference-typed column |
| `pair_profile.py` | suspects + WS2 candidates | `suspects_for_column` (β€25/col, bounded: 4k rare cap + cheap prefilters before SequenceMatcher β 40minβ24s fix), `candidate_pairs`, `constrain_plan` |
| `model_planner.py` | Ollama backends | `make_local_ollama_planner`, `make_batched_planner(batch_size=4)`, JSON extraction |
| `prompt.py` | prompt/training contract | `_profile_for_prompt` (compact suspects), `build_chat_example` (training-data side of the same contract β change one, regenerate the other) |
| `pii.py` | PII second task | deterministic validators (Luhn, IBAN, phone) + allowlist + coverage vote; optional 44M NER; `mask/hash/pseudonymize` |
| `active.py` | THE composition | `get_planner()` β Β§2 |
| `cli.py` / `report.py` / `trace.py` / `observability.py` | UX + audit | CLI, markdown report, JSONL traces, monitor summary/OTel span |
| `baselines.py` | OpenRefine kNN/fingerprint reimplementations | the zero-config comparison class |
| `refdata/cities.txt` | seed gazetteer | plus everything in `training/harvests/*.jsonl` |
---
## 4. Operation vocabulary (the executor's closed set)
**Column ops** (`_apply_column_op`): `strip_whitespace`, `normalize_punctuation`,
`fix_encoding` (lossless cp1252/latin-1βutf8 round-trip, mojibake-marker-reduction
gated), `normalize_disguised_nulls`, `parse_currency`, `parse_number`,
`parse_percent` (abstains on bare values β no /100 corruption),
`parse_date`, `standardize_boolean`, `standardize_phone` (7-digit β `DDD-DDDD`),
`normalize_email`, `standardize_case`, `canonicalize_categories` (mapping-driven;
the verifier's subject), `flag_pii` (log-only), `mask_pii`, `hash_pii`,
`pseudonymize_pii`. Unknown op β no-op.
**Table ops**: `drop_empty_columns`, `drop_empty_rows`, `drop_exact_duplicates`,
`resolve_by_majority` (Β§5.3).
Op-order invariant: **`fix_encoding` must precede whitespace/punctuation ops** β
they destroy the UTF-8 byte patterns repair needs (grader-reproduced bug; fixed in
both heuristic emission and union inheritance).
---
## 5. The five deterministic capabilities (what actually generalizes)
### 5.0 Plan-level verifier (WS1) β `verifier.entry_confidence`
Every non-grounded `canonicalize_categories` entry `rawβcanon` is scored with
three HARD gates, each killing a measured hospital failure class:
- **errors are rare**: `freq(raw) β₯ 3` β 0.0 (frequent = legit data; "de kalb"Γ92)
- **repair to dominance only**: `freq(canon) < max(2, 2Β·freq(raw))` β 0.0
("yexβyexu", typo mapped to a worse typo)
- **code discipline**: digit-bearing values repair only if letter-part similarity
β₯0.85 AND digits identical (allows `amix-2βami-2`, blocks `ak_βal_`)
Survivors score `sim Γ (0.5 + 0.5Β·support)`; below-Ο entries become review flags.
### 5.1 Suspect surfacing (visibility) β `pair_profile.suspects_for_column`
The 80-value profile cap structurally hides high-cardinality dirty cells from ANY
planner (proved by the v8/v9 retrains: more data couldn't fix what the model
couldn't see). Every text-ish column profile now carries β€25 `suspect_values`:
rare surfaces + evidence-backed candidates (frequency dominance, edit similarity,
reference membership). The heuristic maps suspects clearing `entry_confidence β₯
SCRUBDATA_HC_TAU=0.8`; the rest become flags.
### 5.2 Generic entity reference β `reconcile.default_index`
Open vocabularies (ToughTables-derived ref [8 bench tables excluded], MusicBrainz
search-hint misspellings, RxNorm, Wikidata companies, ROR, GeoNames, OpenFlights,
O*NET, nicknames) as a pluggable reference type. Typing requires **β₯20% exact
hits** of distinct values (fuzzy coverage alone over-fires on name-like columns β
measured). Cracked the all-unique regime: 5 ToughTables tables **0 β 0.955β0.957
F1 at 0.0000 damage** (~62k corrections) β where no in-column frequency signal
exists at all.
### 5.3 Cross-row majority voting β `planner.detect_entity_groups` + `resolve_by_majority`
Tables repeating a real-world entity across rows (flights reported by many
sources) carry their own repair signal. Detection: compact-token key columns,
median multiplicity 3β30, β₯2 votable string columns with majority-bearing
disagreement + β₯2 distinct majorities, date-share β€0.3 guard. Execution: resolve
thin dissenting minorities to group majority; skips missing-like keys;
min_share/min_group clamped. **False-consensus guard**: mean minority share β₯0.25
β decline (legitimate correlated updates, not reporting errors β a flat volume cap
was measured to destroy the legitimate regime and replaced). Measured: flights
heuristic 0.044β**0.164** F1; hospital heuristic 0.092β**0.186**.
### 5.4 Convention conservatism β `detect.*_formats_consistent` + `verify_plan`
Never re-format an internally consistent column: date/percent ops gated on
dominant-shape inconsistency (digit+alpha runs collapsed, 90% rule); zip/postal
names never typed phone/date; Excel-serial typing needs a date-suggestive name.
Suppressed minorities surface as `off_convention_dates` flags. The verifier
enforces the same gates on model plans at the verification boundary (the model
path otherwise bypasses heuristic emission gates entirely).
---
## 6. Evaluation (how every number regenerates)
One scoring contract β `eval/run_real_multi.py::score()` β **churn-neutral,
convention-tolerant**: sem-equal = numeric-tolerant OR strip+casefold equal; pure
case/whitespace churn counts as nothing; a fix requires acting; **damage** =
clean cells corrupted / clean cells; **silent edits** = changed columns minus
log-attributed columns (must be 0).
| Harness | Command | What it measures | Current numbers |
|---|---|---|---|
| Money table | `python -m eval.run_real_multi` | 65-set suite, 3 seeds | grounded NORTH 0.224Β±0.004; REAL-F1 0.225 vs OR-kNN 0.058 (HEAD 2026-06-12 regen; freeze was 0.203/0.174) |
| WS1 gate | `python -m eval.precision_curve --plan eval/results/v6_hospital_raw_plan.json --union` | precisionβcoverage curve | **0.905 @ 0.413** (Ο=0.5) |
| Paired bench | `python -m eval.paired_bench` | 42 dirty/gold pairs | unseen-35 macro F1 **0.363** @ dmg **0.0219** |
| Wild bench | `python -m eval.wild_bench` | 35 uncurated tables, behavioral + inject-recovery | recovery 0.207; **0 silent edits** |
| Trust audit | `python -m eval.gittables_audit` | 239 GitTables clean-lake | 239/239 valid, 0 crashes, 0 silent edits |
| Generalization | `python -m eval.generalization` | held-out-source (train: hospital/beers/movies_1 Β· eval: flights/rayyan/ed2) | GEN-F1 0.058, VR 0.108, dmg 0.036 |
| RADAR board | `python -m eval.radar_bench` | regime boundaries by artifact type | abstains on missingness β; reasoning-class = frontier territory |
| Baselines | `eval/run_baran.py`, `modal run scripts/modal_jellyfish.py` | disclosed-protocol comparisons | Baran (oracle+20 labels) 0.811; Jellyfish-13B 0.074 |
| Calibration / PII | `eval.calibration`, `eval.pii_leak` | abstention quality / leak test | AURC 0.120, ECE 0.169; 0/360 residual PII |
**Eval-source discipline**: TRAIN_SOURCES["v6"]={hospital,beers,movies_1};
EVAL_SOURCES={flights,rayyan,ed2_restaurants}. Never crossed.
---
## 7. Model & artifacts
| Artifact | Where | Notes |
|---|---|---|
| Champion adapter | Modal volume `scrubdata-v5-adapter` `/v5_seed21` (= "v6") | survived v7βv10 challenges + GRPO |
| Merged model | `hf.co/ricalanis/scrubdata-qwen3-4b` | card carries the v2 finding |
| Q8 GGUF | `hf.co/ricalanis/scrubdata-qwen3-4b-v6-q8` | **Q8_0 only β Q4_K_M corrupts** (Unsloth 2026.6.x); non-thinking Modelfile required (`notebooks/Modelfile`); suppress tokens 151657/151658 under transformers |
| Benchmark | `hf.co/datasets/ricalanis/wildclean` | 33 redistributable pairs + loaders.py for 9 license-gated + gittables250 + 10 vocabs + frozen results; first cleaning bench with damage + silent-edit accounting |
| Demo | `hf.co/spaces/build-small-hackathon/scrubdata` | deploy = `HfApi.upload_folder` of `git archive HEAD` β **NO GitHub auto-sync** |
| Paper | `docs/paper/main.tex` + `numbers.tex` | compile: `~/.local/bin/tectonic main.tex` (no pdflatex on this machine) |
| Vocabs | `training/harvests/*.jsonl` (15MB, 13 files) | loader falls back here for clone parity |
Modal patterns: `--detach` for anything long; results land in Modal Dicts
(`scrubdata-train-results`, `scrubdata-eval-v5-results`, `scrubdata-suite-results`).
**Budget status at domain close: ~$187 of $212 ceiling β Modal HALTED.**
---
## 8. Negative results ledger (measured, do not re-litigate)
1. **v7βv10 SFT retrains**: 109k harvested alias pairs, episode mixes, suspects
contract β GEN flat/worse. Mixing harvested pairs **dilutes** executor-verified
synthetic skill (monotonic dilution law across mix ratios; mixH 0.677).
2. **GRPO pilot, 3 arms** (main, KL-anchored v2, random-reward control): all
degrade format at 4B/LoRA/$30 scale; the control proved the drift is an RL
artifact (cf. "Spurious Rewards"). Published RLVR wins used real infra
(verl, 4ΓH100Γ40h). Episodes corpus (600, `training/build_grpo_episodes.py`) +
hand-rolled loop (`scripts/modal_grpo.py`) committed for a future attempt.
3. **Uniform verification of existing low-card mappings** (A1 per-class
thresholds): 0.905β0.890 β reverted.
4. **Strict entity-typing thresholds** (0.90/0.05): cost more than bought β reverted.
5. **WS2 candidate constraining composed with verifier**: 0.876 @ 0.387 < union at
same Ο β redundant gating of the same failure class; available, off by default.
6. **Flat volume cap on cross-row voting**: destroyed the legitimate
dense-disagreement regime β replaced by the false-consensus guard.
7. **Frozen-gold synthetic yardstick predates the suspects prompt contract** β
regenerate gold before ever quoting synthetic canon_f1 again.
## 9. Known-open (graded non-blocking)
`_parse_date` per-value dayfirst; i18n name guards; mojibake fixpoint /
sequence-plausibility; backlog sources: CMS API, NHTSA, Canada contracts, Matelda
~6,670 pairs, GLEIF/USDA vocabs, WDVC-16. Reasoning-class artifacts (RADAR) are
explicitly out of protocol class β frontier-model territory.
## 10. Where deeper detail lives
`docs/PRODUCT.md` (trust contract) Β· `docs/SOTA.md` + `docs/ROADMAP_SOTA2.md`
(position + research map) Β· `docs/CAPABILITY_GRADES.md` (12-agent adversarial
grading + must-fix ledger) Β· `docs/WILD_BENCH.md` / `docs/PAIRED_BENCH.md` /
`docs/GITTABLES_AUDIT.md` / `docs/DATASETS.md` (per-bench detail + licenses) Β·
`docs/NIGHT_LOG.md` (stage-3 timeline) Β· `project-memory/` (agent memory snapshot).
|