Spaces:

build-small-hackathon
/

scrubdata

Running

File size: 16,934 Bytes

16dc556

# ScrubData — The Profound Tool Reference

> The single local document that explains the whole system: what it is, why every
> piece exists, where every number comes from, and what we learned building it.
> Written at the close of the research domain (2026-06-12). The paper
> (`docs/paper/main.tex`) is the citable account; THIS file is the operational one.

---

## 1. What ScrubData is

ScrubData is a **zero-config, zero-label, local** tabular data-cleaning system built
around one architectural commitment: **the model never touches data**.

A profiler aggregates each column into a bounded value-frequency profile; a small
(≤4B, locally-run) fine-tuned planner *proposes* a JSON cleaning plan; a
deterministic pandas executor *applies* it. The plan is the complete, inspectable,
reversible specification of every change. Three consequences define the product:

1. **No silent edits by construction** — every changed cell traces to a named,
   logged operation (verified at scale: 0 silent edits across 35 wild tables and a
   239-table GitTables trust audit).
2. **Abstention is first-class** — anything below confidence becomes a review flag
   ("YOUR CALL" card in the UI), never a quiet skip and never a guess.
3. **Profile-not-rows scaling** — the prompt scales with *distinct values*, not
   rows; a million-row table profiles like a hundred-row one, and no cell values
   leave the machine.

### The central finding (load-bearing, repeatedly measured)

**Model weights contribute approximately nothing to never-seen-table
generalization in this protocol class.** Five SFT retrains (v7–v10 + mixes, 109k
harvested real alias pairs) and a three-arm GRPO pilot (executor as verifiable
reward, including a random-reward control that reproduced the same format drift)
all failed to move held-out generalization. Every measured gain came from
**deterministic machinery gated by the plan-level verifier** (§5). Corroborated
independently by Spreadsheet-RL, arXiv:2601.05009, and arXiv:2606.02866.
Practical corollary: *to improve ScrubData, write a deterministic capability and
gate it with the verifier; do not collect more training data.*

---

## 2. The shipped pipeline (`scrubdata/active.py::get_planner`)

```
                       ┌──────────────────────────────────────────────┐
 df ──► profiler ──►   │  model path (only if SCRUBDATA_MODEL is set) │
        (bounded       │  batched (4 cols/call) local Ollama planner  │
        profile incl.  │  → per-batch fallback to heuristic on error  │
        suspects)      │  → grounded (reference taxonomies, RACOON)   │
                       │  → verify_plan(tau=SCRUBDATA_TAU, def 0.5)   │
                       └───────────────┬──────────────────────────────┘
                                       │  union_plans (model wins per surface;
                                       │  inherits deterministic ops + table ops)
        heuristic mock_plan ───────────┘
                                       ▼
                        executor.apply_plan → (clean_df, change_log)
                                       ▼
                 report.render_report · trace.log_run · observability
```

- **No model configured** → `mock_plan` (grounded deterministic heuristic) alone.
  The app always produces a plan; the model is an upgrade, never a dependency.
- **Measured operating point** (hospital, 509 real errors): union **0.905
  precision @ 0.413 coverage**; gated model alone 0.993 @ 0.287; 3-seed
  0.891±0.012 @ 0.396±0.025. Precision flat 0.89–0.91 for τ∈[0.2,0.8].

Entry points: `uv run server.py` (FastAPI + UI), `app.py` (HF Space/Gradio),
`scrubdata/cli.py` (`scrubdata <file.csv> -o out.csv --report r.md --plan p.json`).

### Environment variables

| Var | Default | Meaning |
|---|---|---|
| `SCRUBDATA_MODEL` | unset | local Ollama model id (e.g. `scrubdata-ft-v6`); unset = heuristic only |
| `SCRUBDATA_TAU` | `0.5` | per-entry verifier threshold on model mappings |
| `SCRUBDATA_HC_TAU` | `0.8` | stricter bar for heuristic suspect-mappings (no model cross-check there) |
| `SCRUBDATA_PAIR_PROFILES` | off | WS2 candidate-constrained planning (measured redundant with verifier; off by default) |
| `SCRUBDATA_PII_NER` | off | OpenMed-PII 44M NER tier on top of deterministic validators |

---

## 3. Module map (`scrubdata/`)

| Module | Role | Key facts |
|---|---|---|
| `profiler.py` | column → bounded profile | `VALUE_COUNTS_CAP=80` (high-card cols: top-8 only) + `suspect_values` section (the visibility fix); `truncated_values` count keeps honesty about what's hidden |
| `detect.py` | typing + issue predicates | `detect_semantic_type` (zip/ZCTA/Excel-serial guards), `date_formats_consistent` (collapses digit AND alpha runs; 90% dominant-shape), `percent_formats_consistent` (90%), `has_mojibake`, `is_missing` |
| `planner.py` | deterministic heuristic planner | `mock_plan`, `_column_operations`, `_suspect_canonicalize` (τ_hc=0.8), `detect_entity_groups` (cross-row voting detection), emits `fix_encoding` BEFORE `strip_whitespace` (order-critical), `off_convention_dates` visible-abstention flags |
| `executor.py` | the only thing that touches cells | op dispatch (§4); unknown ops are no-ops (forward-compatible); returns `(df, change_log)`; `resolve_by_majority` table op lives here |
| `verifier.py` | WS1 selective prediction | `entry_confidence` (3 hard gates, §5.0), `verify_plan` (also enforces convention gates on MODEL-emitted parse_date/parse_percent — the model path otherwise bypasses them), `union_plans` (order-preserving op inheritance via `reversed(inherit)`) |
| `reconcile.py` | reference grounding | `ReferenceIndex`, `default_index()` loads toughtables_ref (contamination-guarded: excludes the 8 benchmark tables) + MusicBrainz hints + Wikidata companies + ROR; `infer_reference_type` needs **≥20% exact entity hits** (over-fire guard); falls back to `training/harvests/` for Space/clone parity |
| `grounded.py` | RACOON wrapper | model never free-generates a canonical for a reference-typed column |
| `pair_profile.py` | suspects + WS2 candidates | `suspects_for_column` (≤25/col, bounded: 4k rare cap + cheap prefilters before SequenceMatcher — 40min→24s fix), `candidate_pairs`, `constrain_plan` |
| `model_planner.py` | Ollama backends | `make_local_ollama_planner`, `make_batched_planner(batch_size=4)`, JSON extraction |
| `prompt.py` | prompt/training contract | `_profile_for_prompt` (compact suspects), `build_chat_example` (training-data side of the same contract — change one, regenerate the other) |
| `pii.py` | PII second task | deterministic validators (Luhn, IBAN, phone) + allowlist + coverage vote; optional 44M NER; `mask/hash/pseudonymize` |
| `active.py` | THE composition | `get_planner()` — §2 |
| `cli.py` / `report.py` / `trace.py` / `observability.py` | UX + audit | CLI, markdown report, JSONL traces, monitor summary/OTel span |
| `baselines.py` | OpenRefine kNN/fingerprint reimplementations | the zero-config comparison class |
| `refdata/cities.txt` | seed gazetteer | plus everything in `training/harvests/*.jsonl` |

---

## 4. Operation vocabulary (the executor's closed set)

**Column ops** (`_apply_column_op`): `strip_whitespace`, `normalize_punctuation`,
`fix_encoding` (lossless cp1252/latin-1↔utf8 round-trip, mojibake-marker-reduction
gated), `normalize_disguised_nulls`, `parse_currency`, `parse_number`,
`parse_percent` (abstains on bare values — no /100 corruption),
`parse_date`, `standardize_boolean`, `standardize_phone` (7-digit → `DDD-DDDD`),
`normalize_email`, `standardize_case`, `canonicalize_categories` (mapping-driven;
the verifier's subject), `flag_pii` (log-only), `mask_pii`, `hash_pii`,
`pseudonymize_pii`. Unknown op → no-op.

**Table ops**: `drop_empty_columns`, `drop_empty_rows`, `drop_exact_duplicates`,
`resolve_by_majority` (§5.3).

Op-order invariant: **`fix_encoding` must precede whitespace/punctuation ops** —
they destroy the UTF-8 byte patterns repair needs (grader-reproduced bug; fixed in
both heuristic emission and union inheritance).

---

## 5. The five deterministic capabilities (what actually generalizes)

### 5.0 Plan-level verifier (WS1) — `verifier.entry_confidence`
Every non-grounded `canonicalize_categories` entry `raw→canon` is scored with
three HARD gates, each killing a measured hospital failure class:
- **errors are rare**: `freq(raw) ≥ 3` → 0.0 (frequent = legit data; "de kalb"×92)
- **repair to dominance only**: `freq(canon) < max(2, 2·freq(raw))` → 0.0
  ("yex→yexu", typo mapped to a worse typo)
- **code discipline**: digit-bearing values repair only if letter-part similarity
  ≥0.85 AND digits identical (allows `amix-2→ami-2`, blocks `ak_→al_`)
Survivors score `sim × (0.5 + 0.5·support)`; below-τ entries become review flags.

### 5.1 Suspect surfacing (visibility) — `pair_profile.suspects_for_column`
The 80-value profile cap structurally hides high-cardinality dirty cells from ANY
planner (proved by the v8/v9 retrains: more data couldn't fix what the model
couldn't see). Every text-ish column profile now carries ≤25 `suspect_values`:
rare surfaces + evidence-backed candidates (frequency dominance, edit similarity,
reference membership). The heuristic maps suspects clearing `entry_confidence ≥
SCRUBDATA_HC_TAU=0.8`; the rest become flags.

### 5.2 Generic entity reference — `reconcile.default_index`
Open vocabularies (ToughTables-derived ref [8 bench tables excluded], MusicBrainz
search-hint misspellings, RxNorm, Wikidata companies, ROR, GeoNames, OpenFlights,
O*NET, nicknames) as a pluggable reference type. Typing requires **≥20% exact
hits** of distinct values (fuzzy coverage alone over-fires on name-like columns —
measured). Cracked the all-unique regime: 5 ToughTables tables **0 → 0.955–0.957
F1 at 0.0000 damage** (~62k corrections) — where no in-column frequency signal
exists at all.

### 5.3 Cross-row majority voting — `planner.detect_entity_groups` + `resolve_by_majority`
Tables repeating a real-world entity across rows (flights reported by many
sources) carry their own repair signal. Detection: compact-token key columns,
median multiplicity 3–30, ≥2 votable string columns with majority-bearing
disagreement + ≥2 distinct majorities, date-share ≤0.3 guard. Execution: resolve
thin dissenting minorities to group majority; skips missing-like keys;
min_share/min_group clamped. **False-consensus guard**: mean minority share ≥0.25
→ decline (legitimate correlated updates, not reporting errors — a flat volume cap
was measured to destroy the legitimate regime and replaced). Measured: flights
heuristic 0.044→**0.164** F1; hospital heuristic 0.092→**0.186**.

### 5.4 Convention conservatism — `detect.*_formats_consistent` + `verify_plan`
Never re-format an internally consistent column: date/percent ops gated on
dominant-shape inconsistency (digit+alpha runs collapsed, 90% rule); zip/postal
names never typed phone/date; Excel-serial typing needs a date-suggestive name.
Suppressed minorities surface as `off_convention_dates` flags. The verifier
enforces the same gates on model plans at the verification boundary (the model
path otherwise bypasses heuristic emission gates entirely).

---

## 6. Evaluation (how every number regenerates)

One scoring contract — `eval/run_real_multi.py::score()` — **churn-neutral,
convention-tolerant**: sem-equal = numeric-tolerant OR strip+casefold equal; pure
case/whitespace churn counts as nothing; a fix requires acting; **damage** =
clean cells corrupted / clean cells; **silent edits** = changed columns minus
log-attributed columns (must be 0).

| Harness | Command | What it measures | Current numbers |
|---|---|---|---|
| Money table | `python -m eval.run_real_multi` | 65-set suite, 3 seeds | grounded NORTH 0.224±0.004; REAL-F1 0.225 vs OR-kNN 0.058 (HEAD 2026-06-12 regen; freeze was 0.203/0.174) |
| WS1 gate | `python -m eval.precision_curve --plan eval/results/v6_hospital_raw_plan.json --union` | precision–coverage curve | **0.905 @ 0.413** (τ=0.5) |
| Paired bench | `python -m eval.paired_bench` | 42 dirty/gold pairs | unseen-35 macro F1 **0.363** @ dmg **0.0219** |
| Wild bench | `python -m eval.wild_bench` | 35 uncurated tables, behavioral + inject-recovery | recovery 0.207; **0 silent edits** |
| Trust audit | `python -m eval.gittables_audit` | 239 GitTables clean-lake | 239/239 valid, 0 crashes, 0 silent edits |
| Generalization | `python -m eval.generalization` | held-out-source (train: hospital/beers/movies_1 · eval: flights/rayyan/ed2) | GEN-F1 0.058, VR 0.108, dmg 0.036 |
| RADAR board | `python -m eval.radar_bench` | regime boundaries by artifact type | abstains on missingness ✓; reasoning-class = frontier territory |
| Baselines | `eval/run_baran.py`, `modal run scripts/modal_jellyfish.py` | disclosed-protocol comparisons | Baran (oracle+20 labels) 0.811; Jellyfish-13B 0.074 |
| Calibration / PII | `eval.calibration`, `eval.pii_leak` | abstention quality / leak test | AURC 0.120, ECE 0.169; 0/360 residual PII |

**Eval-source discipline**: TRAIN_SOURCES["v6"]={hospital,beers,movies_1};
EVAL_SOURCES={flights,rayyan,ed2_restaurants}. Never crossed.

---

## 7. Model & artifacts

| Artifact | Where | Notes |
|---|---|---|
| Champion adapter | Modal volume `scrubdata-v5-adapter` `/v5_seed21` (= "v6") | survived v7–v10 challenges + GRPO |
| Merged model | `hf.co/ricalanis/scrubdata-qwen3-4b` | card carries the v2 finding |
| Q8 GGUF | `hf.co/ricalanis/scrubdata-qwen3-4b-v6-q8` | **Q8_0 only — Q4_K_M corrupts** (Unsloth 2026.6.x); non-thinking Modelfile required (`notebooks/Modelfile`); suppress tokens 151657/151658 under transformers |
| Benchmark | `hf.co/datasets/ricalanis/wildclean` | 33 redistributable pairs + loaders.py for 9 license-gated + gittables250 + 10 vocabs + frozen results; first cleaning bench with damage + silent-edit accounting |
| Demo | `hf.co/spaces/build-small-hackathon/scrubdata` | deploy = `HfApi.upload_folder` of `git archive HEAD` — **NO GitHub auto-sync** |
| Paper | `docs/paper/main.tex` + `numbers.tex` | compile: `~/.local/bin/tectonic main.tex` (no pdflatex on this machine) |
| Vocabs | `training/harvests/*.jsonl` (15MB, 13 files) | loader falls back here for clone parity |

Modal patterns: `--detach` for anything long; results land in Modal Dicts
(`scrubdata-train-results`, `scrubdata-eval-v5-results`, `scrubdata-suite-results`).
**Budget status at domain close: ~$187 of $212 ceiling — Modal HALTED.**

---

## 8. Negative results ledger (measured, do not re-litigate)

1. **v7–v10 SFT retrains**: 109k harvested alias pairs, episode mixes, suspects
   contract — GEN flat/worse. Mixing harvested pairs **dilutes** executor-verified
   synthetic skill (monotonic dilution law across mix ratios; mixH 0.677).
2. **GRPO pilot, 3 arms** (main, KL-anchored v2, random-reward control): all
   degrade format at 4B/LoRA/$30 scale; the control proved the drift is an RL
   artifact (cf. "Spurious Rewards"). Published RLVR wins used real infra
   (verl, 4×H100×40h). Episodes corpus (600, `training/build_grpo_episodes.py`) +
   hand-rolled loop (`scripts/modal_grpo.py`) committed for a future attempt.
3. **Uniform verification of existing low-card mappings** (A1 per-class
   thresholds): 0.905→0.890 — reverted.
4. **Strict entity-typing thresholds** (0.90/0.05): cost more than bought — reverted.
5. **WS2 candidate constraining composed with verifier**: 0.876 @ 0.387 < union at
   same τ — redundant gating of the same failure class; available, off by default.
6. **Flat volume cap on cross-row voting**: destroyed the legitimate
   dense-disagreement regime — replaced by the false-consensus guard.
7. **Frozen-gold synthetic yardstick predates the suspects prompt contract** —
   regenerate gold before ever quoting synthetic canon_f1 again.

## 9. Known-open (graded non-blocking)

`_parse_date` per-value dayfirst; i18n name guards; mojibake fixpoint /
sequence-plausibility; backlog sources: CMS API, NHTSA, Canada contracts, Matelda
~6,670 pairs, GLEIF/USDA vocabs, WDVC-16. Reasoning-class artifacts (RADAR) are
explicitly out of protocol class — frontier-model territory.

## 10. Where deeper detail lives

`docs/PRODUCT.md` (trust contract) · `docs/SOTA.md` + `docs/ROADMAP_SOTA2.md`
(position + research map) · `docs/CAPABILITY_GRADES.md` (12-agent adversarial
grading + must-fix ledger) · `docs/WILD_BENCH.md` / `docs/PAIRED_BENCH.md` /
`docs/GITTABLES_AUDIT.md` / `docs/DATASETS.md` (per-bench detail + licenses) ·
`docs/NIGHT_LOG.md` (stage-3 timeline) · `project-memory/` (agent memory snapshot).