Spaces:
Running
A newer version of the Gradio SDK is available: 6.19.0
ScrubData β The Profound Tool Reference
The single local document that explains the whole system: what it is, why every piece exists, where every number comes from, and what we learned building it. Written at the close of the research domain (2026-06-12). The paper (
docs/paper/main.tex) is the citable account; THIS file is the operational one.
1. What ScrubData is
ScrubData is a zero-config, zero-label, local tabular data-cleaning system built around one architectural commitment: the model never touches data.
A profiler aggregates each column into a bounded value-frequency profile; a small (β€4B, locally-run) fine-tuned planner proposes a JSON cleaning plan; a deterministic pandas executor applies it. The plan is the complete, inspectable, reversible specification of every change. Three consequences define the product:
- No silent edits by construction β every changed cell traces to a named, logged operation (verified at scale: 0 silent edits across 35 wild tables and a 239-table GitTables trust audit).
- Abstention is first-class β anything below confidence becomes a review flag ("YOUR CALL" card in the UI), never a quiet skip and never a guess.
- Profile-not-rows scaling β the prompt scales with distinct values, not rows; a million-row table profiles like a hundred-row one, and no cell values leave the machine.
The central finding (load-bearing, repeatedly measured)
Model weights contribute approximately nothing to never-seen-table generalization in this protocol class. Five SFT retrains (v7βv10 + mixes, 109k harvested real alias pairs) and a three-arm GRPO pilot (executor as verifiable reward, including a random-reward control that reproduced the same format drift) all failed to move held-out generalization. Every measured gain came from deterministic machinery gated by the plan-level verifier (Β§5). Corroborated independently by Spreadsheet-RL, arXiv:2601.05009, and arXiv:2606.02866. Practical corollary: to improve ScrubData, write a deterministic capability and gate it with the verifier; do not collect more training data.
2. The shipped pipeline (scrubdata/active.py::get_planner)
ββββββββββββββββββββββββββββββββββββββββββββββββ
df βββΊ profiler βββΊ β model path (only if SCRUBDATA_MODEL is set) β
(bounded β batched (4 cols/call) local Ollama planner β
profile incl. β β per-batch fallback to heuristic on error β
suspects) β β grounded (reference taxonomies, RACOON) β
β β verify_plan(tau=SCRUBDATA_TAU, def 0.5) β
βββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β union_plans (model wins per surface;
β inherits deterministic ops + table ops)
heuristic mock_plan ββββββββββββ
βΌ
executor.apply_plan β (clean_df, change_log)
βΌ
report.render_report Β· trace.log_run Β· observability
- No model configured β
mock_plan(grounded deterministic heuristic) alone. The app always produces a plan; the model is an upgrade, never a dependency. - Measured operating point (hospital, 509 real errors): union 0.905 precision @ 0.413 coverage; gated model alone 0.993 @ 0.287; 3-seed 0.891Β±0.012 @ 0.396Β±0.025. Precision flat 0.89β0.91 for Οβ[0.2,0.8].
Entry points: uv run server.py (FastAPI + UI), app.py (HF Space/Gradio),
scrubdata/cli.py (scrubdata <file.csv> -o out.csv --report r.md --plan p.json).
Environment variables
| Var | Default | Meaning |
|---|---|---|
SCRUBDATA_MODEL |
unset | local Ollama model id (e.g. scrubdata-ft-v6); unset = heuristic only |
SCRUBDATA_TAU |
0.5 |
per-entry verifier threshold on model mappings |
SCRUBDATA_HC_TAU |
0.8 |
stricter bar for heuristic suspect-mappings (no model cross-check there) |
SCRUBDATA_PAIR_PROFILES |
off | WS2 candidate-constrained planning (measured redundant with verifier; off by default) |
SCRUBDATA_PII_NER |
off | OpenMed-PII 44M NER tier on top of deterministic validators |
3. Module map (scrubdata/)
| Module | Role | Key facts |
|---|---|---|
profiler.py |
column β bounded profile | VALUE_COUNTS_CAP=80 (high-card cols: top-8 only) + suspect_values section (the visibility fix); truncated_values count keeps honesty about what's hidden |
detect.py |
typing + issue predicates | detect_semantic_type (zip/ZCTA/Excel-serial guards), date_formats_consistent (collapses digit AND alpha runs; 90% dominant-shape), percent_formats_consistent (90%), has_mojibake, is_missing |
planner.py |
deterministic heuristic planner | mock_plan, _column_operations, _suspect_canonicalize (Ο_hc=0.8), detect_entity_groups (cross-row voting detection), emits fix_encoding BEFORE strip_whitespace (order-critical), off_convention_dates visible-abstention flags |
executor.py |
the only thing that touches cells | op dispatch (Β§4); unknown ops are no-ops (forward-compatible); returns (df, change_log); resolve_by_majority table op lives here |
verifier.py |
WS1 selective prediction | entry_confidence (3 hard gates, Β§5.0), verify_plan (also enforces convention gates on MODEL-emitted parse_date/parse_percent β the model path otherwise bypasses them), union_plans (order-preserving op inheritance via reversed(inherit)) |
reconcile.py |
reference grounding | ReferenceIndex, default_index() loads toughtables_ref (contamination-guarded: excludes the 8 benchmark tables) + MusicBrainz hints + Wikidata companies + ROR; infer_reference_type needs β₯20% exact entity hits (over-fire guard); falls back to training/harvests/ for Space/clone parity |
grounded.py |
RACOON wrapper | model never free-generates a canonical for a reference-typed column |
pair_profile.py |
suspects + WS2 candidates | suspects_for_column (β€25/col, bounded: 4k rare cap + cheap prefilters before SequenceMatcher β 40minβ24s fix), candidate_pairs, constrain_plan |
model_planner.py |
Ollama backends | make_local_ollama_planner, make_batched_planner(batch_size=4), JSON extraction |
prompt.py |
prompt/training contract | _profile_for_prompt (compact suspects), build_chat_example (training-data side of the same contract β change one, regenerate the other) |
pii.py |
PII second task | deterministic validators (Luhn, IBAN, phone) + allowlist + coverage vote; optional 44M NER; mask/hash/pseudonymize |
active.py |
THE composition | get_planner() β Β§2 |
cli.py / report.py / trace.py / observability.py |
UX + audit | CLI, markdown report, JSONL traces, monitor summary/OTel span |
baselines.py |
OpenRefine kNN/fingerprint reimplementations | the zero-config comparison class |
refdata/cities.txt |
seed gazetteer | plus everything in training/harvests/*.jsonl |
4. Operation vocabulary (the executor's closed set)
Column ops (_apply_column_op): strip_whitespace, normalize_punctuation,
fix_encoding (lossless cp1252/latin-1βutf8 round-trip, mojibake-marker-reduction
gated), normalize_disguised_nulls, parse_currency, parse_number,
parse_percent (abstains on bare values β no /100 corruption),
parse_date, standardize_boolean, standardize_phone (7-digit β DDD-DDDD),
normalize_email, standardize_case, canonicalize_categories (mapping-driven;
the verifier's subject), flag_pii (log-only), mask_pii, hash_pii,
pseudonymize_pii. Unknown op β no-op.
Table ops: drop_empty_columns, drop_empty_rows, drop_exact_duplicates,
resolve_by_majority (Β§5.3).
Op-order invariant: fix_encoding must precede whitespace/punctuation ops β
they destroy the UTF-8 byte patterns repair needs (grader-reproduced bug; fixed in
both heuristic emission and union inheritance).
5. The five deterministic capabilities (what actually generalizes)
5.0 Plan-level verifier (WS1) β verifier.entry_confidence
Every non-grounded canonicalize_categories entry rawβcanon is scored with
three HARD gates, each killing a measured hospital failure class:
- errors are rare:
freq(raw) β₯ 3β 0.0 (frequent = legit data; "de kalb"Γ92) - repair to dominance only:
freq(canon) < max(2, 2Β·freq(raw))β 0.0 ("yexβyexu", typo mapped to a worse typo) - code discipline: digit-bearing values repair only if letter-part similarity
β₯0.85 AND digits identical (allows
amix-2βami-2, blocksak_βal_) Survivors scoresim Γ (0.5 + 0.5Β·support); below-Ο entries become review flags.
5.1 Suspect surfacing (visibility) β pair_profile.suspects_for_column
The 80-value profile cap structurally hides high-cardinality dirty cells from ANY
planner (proved by the v8/v9 retrains: more data couldn't fix what the model
couldn't see). Every text-ish column profile now carries β€25 suspect_values:
rare surfaces + evidence-backed candidates (frequency dominance, edit similarity,
reference membership). The heuristic maps suspects clearing entry_confidence β₯ SCRUBDATA_HC_TAU=0.8; the rest become flags.
5.2 Generic entity reference β reconcile.default_index
Open vocabularies (ToughTables-derived ref [8 bench tables excluded], MusicBrainz search-hint misspellings, RxNorm, Wikidata companies, ROR, GeoNames, OpenFlights, O*NET, nicknames) as a pluggable reference type. Typing requires β₯20% exact hits of distinct values (fuzzy coverage alone over-fires on name-like columns β measured). Cracked the all-unique regime: 5 ToughTables tables 0 β 0.955β0.957 F1 at 0.0000 damage (~62k corrections) β where no in-column frequency signal exists at all.
5.3 Cross-row majority voting β planner.detect_entity_groups + resolve_by_majority
Tables repeating a real-world entity across rows (flights reported by many sources) carry their own repair signal. Detection: compact-token key columns, median multiplicity 3β30, β₯2 votable string columns with majority-bearing disagreement + β₯2 distinct majorities, date-share β€0.3 guard. Execution: resolve thin dissenting minorities to group majority; skips missing-like keys; min_share/min_group clamped. False-consensus guard: mean minority share β₯0.25 β decline (legitimate correlated updates, not reporting errors β a flat volume cap was measured to destroy the legitimate regime and replaced). Measured: flights heuristic 0.044β0.164 F1; hospital heuristic 0.092β0.186.
5.4 Convention conservatism β detect.*_formats_consistent + verify_plan
Never re-format an internally consistent column: date/percent ops gated on
dominant-shape inconsistency (digit+alpha runs collapsed, 90% rule); zip/postal
names never typed phone/date; Excel-serial typing needs a date-suggestive name.
Suppressed minorities surface as off_convention_dates flags. The verifier
enforces the same gates on model plans at the verification boundary (the model
path otherwise bypasses heuristic emission gates entirely).
6. Evaluation (how every number regenerates)
One scoring contract β eval/run_real_multi.py::score() β churn-neutral,
convention-tolerant: sem-equal = numeric-tolerant OR strip+casefold equal; pure
case/whitespace churn counts as nothing; a fix requires acting; damage =
clean cells corrupted / clean cells; silent edits = changed columns minus
log-attributed columns (must be 0).
| Harness | Command | What it measures | Current numbers |
|---|---|---|---|
| Money table | python -m eval.run_real_multi |
65-set suite, 3 seeds | grounded NORTH 0.224Β±0.004; REAL-F1 0.225 vs OR-kNN 0.058 (HEAD 2026-06-12 regen; freeze was 0.203/0.174) |
| WS1 gate | python -m eval.precision_curve --plan eval/results/v6_hospital_raw_plan.json --union |
precisionβcoverage curve | 0.905 @ 0.413 (Ο=0.5) |
| Paired bench | python -m eval.paired_bench |
42 dirty/gold pairs | unseen-35 macro F1 0.363 @ dmg 0.0219 |
| Wild bench | python -m eval.wild_bench |
35 uncurated tables, behavioral + inject-recovery | recovery 0.207; 0 silent edits |
| Trust audit | python -m eval.gittables_audit |
239 GitTables clean-lake | 239/239 valid, 0 crashes, 0 silent edits |
| Generalization | python -m eval.generalization |
held-out-source (train: hospital/beers/movies_1 Β· eval: flights/rayyan/ed2) | GEN-F1 0.058, VR 0.108, dmg 0.036 |
| RADAR board | python -m eval.radar_bench |
regime boundaries by artifact type | abstains on missingness β; reasoning-class = frontier territory |
| Baselines | eval/run_baran.py, modal run scripts/modal_jellyfish.py |
disclosed-protocol comparisons | Baran (oracle+20 labels) 0.811; Jellyfish-13B 0.074 |
| Calibration / PII | eval.calibration, eval.pii_leak |
abstention quality / leak test | AURC 0.120, ECE 0.169; 0/360 residual PII |
Eval-source discipline: TRAIN_SOURCES["v6"]={hospital,beers,movies_1}; EVAL_SOURCES={flights,rayyan,ed2_restaurants}. Never crossed.
7. Model & artifacts
| Artifact | Where | Notes |
|---|---|---|
| Champion adapter | Modal volume scrubdata-v5-adapter /v5_seed21 (= "v6") |
survived v7βv10 challenges + GRPO |
| Merged model | hf.co/ricalanis/scrubdata-qwen3-4b |
card carries the v2 finding |
| Q8 GGUF | hf.co/ricalanis/scrubdata-qwen3-4b-v6-q8 |
Q8_0 only β Q4_K_M corrupts (Unsloth 2026.6.x); non-thinking Modelfile required (notebooks/Modelfile); suppress tokens 151657/151658 under transformers |
| Benchmark | hf.co/datasets/ricalanis/wildclean |
33 redistributable pairs + loaders.py for 9 license-gated + gittables250 + 10 vocabs + frozen results; first cleaning bench with damage + silent-edit accounting |
| Demo | hf.co/spaces/build-small-hackathon/scrubdata |
deploy = HfApi.upload_folder of git archive HEAD β NO GitHub auto-sync |
| Paper | docs/paper/main.tex + numbers.tex |
compile: ~/.local/bin/tectonic main.tex (no pdflatex on this machine) |
| Vocabs | training/harvests/*.jsonl (15MB, 13 files) |
loader falls back here for clone parity |
Modal patterns: --detach for anything long; results land in Modal Dicts
(scrubdata-train-results, scrubdata-eval-v5-results, scrubdata-suite-results).
Budget status at domain close: ~$187 of $212 ceiling β Modal HALTED.
8. Negative results ledger (measured, do not re-litigate)
- v7βv10 SFT retrains: 109k harvested alias pairs, episode mixes, suspects contract β GEN flat/worse. Mixing harvested pairs dilutes executor-verified synthetic skill (monotonic dilution law across mix ratios; mixH 0.677).
- GRPO pilot, 3 arms (main, KL-anchored v2, random-reward control): all
degrade format at 4B/LoRA/$30 scale; the control proved the drift is an RL
artifact (cf. "Spurious Rewards"). Published RLVR wins used real infra
(verl, 4ΓH100Γ40h). Episodes corpus (600,
training/build_grpo_episodes.py) + hand-rolled loop (scripts/modal_grpo.py) committed for a future attempt. - Uniform verification of existing low-card mappings (A1 per-class thresholds): 0.905β0.890 β reverted.
- Strict entity-typing thresholds (0.90/0.05): cost more than bought β reverted.
- WS2 candidate constraining composed with verifier: 0.876 @ 0.387 < union at same Ο β redundant gating of the same failure class; available, off by default.
- Flat volume cap on cross-row voting: destroyed the legitimate dense-disagreement regime β replaced by the false-consensus guard.
- Frozen-gold synthetic yardstick predates the suspects prompt contract β regenerate gold before ever quoting synthetic canon_f1 again.
9. Known-open (graded non-blocking)
_parse_date per-value dayfirst; i18n name guards; mojibake fixpoint /
sequence-plausibility; backlog sources: CMS API, NHTSA, Canada contracts, Matelda
~6,670 pairs, GLEIF/USDA vocabs, WDVC-16. Reasoning-class artifacts (RADAR) are
explicitly out of protocol class β frontier-model territory.
10. Where deeper detail lives
docs/PRODUCT.md (trust contract) Β· docs/SOTA.md + docs/ROADMAP_SOTA2.md
(position + research map) Β· docs/CAPABILITY_GRADES.md (12-agent adversarial
grading + must-fix ledger) Β· docs/WILD_BENCH.md / docs/PAIRED_BENCH.md /
docs/GITTABLES_AUDIT.md / docs/DATASETS.md (per-bench detail + licenses) Β·
docs/NIGHT_LOG.md (stage-3 timeline) Β· project-memory/ (agent memory snapshot).