Spaces:

build-small-hackathon
/

scrubdata

Running

App Files Files Community

scrubdata / docs /TOOL_REFERENCE.md

OpenAI Codex

deploy: add sponsor:openai tag (Best Use of Codex) + Codex-hardened build

16dc556 15 days ago

preview code

Raw

History Blame Contribute Delete

16.9 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

ScrubData — The Profound Tool Reference

The single local document that explains the whole system: what it is, why every piece exists, where every number comes from, and what we learned building it. Written at the close of the research domain (2026-06-12). The paper (docs/paper/main.tex) is the citable account; THIS file is the operational one.

1. What ScrubData is

ScrubData is a zero-config, zero-label, local tabular data-cleaning system built around one architectural commitment: the model never touches data.

A profiler aggregates each column into a bounded value-frequency profile; a small (≤4B, locally-run) fine-tuned planner proposes a JSON cleaning plan; a deterministic pandas executor applies it. The plan is the complete, inspectable, reversible specification of every change. Three consequences define the product:

No silent edits by construction — every changed cell traces to a named, logged operation (verified at scale: 0 silent edits across 35 wild tables and a 239-table GitTables trust audit).
Abstention is first-class — anything below confidence becomes a review flag ("YOUR CALL" card in the UI), never a quiet skip and never a guess.
Profile-not-rows scaling — the prompt scales with distinct values, not rows; a million-row table profiles like a hundred-row one, and no cell values leave the machine.

The central finding (load-bearing, repeatedly measured)

Model weights contribute approximately nothing to never-seen-table generalization in this protocol class. Five SFT retrains (v7–v10 + mixes, 109k harvested real alias pairs) and a three-arm GRPO pilot (executor as verifiable reward, including a random-reward control that reproduced the same format drift) all failed to move held-out generalization. Every measured gain came from deterministic machinery gated by the plan-level verifier (§5). Corroborated independently by Spreadsheet-RL, arXiv:2601.05009, and arXiv:2606.02866. Practical corollary: to improve ScrubData, write a deterministic capability and gate it with the verifier; do not collect more training data.

2. The shipped pipeline (`scrubdata/active.py::get_planner`)

                       ┌──────────────────────────────────────────────┐
 df ──► profiler ──►   │  model path (only if SCRUBDATA_MODEL is set) │
        (bounded       │  batched (4 cols/call) local Ollama planner  │
        profile incl.  │  → per-batch fallback to heuristic on error  │
        suspects)      │  → grounded (reference taxonomies, RACOON)   │
                       │  → verify_plan(tau=SCRUBDATA_TAU, def 0.5)   │
                       └───────────────┬──────────────────────────────┘
                                       │  union_plans (model wins per surface;
                                       │  inherits deterministic ops + table ops)
        heuristic mock_plan ───────────┘
                                       ▼
                        executor.apply_plan → (clean_df, change_log)
                                       ▼
                 report.render_report · trace.log_run · observability

No model configured → mock_plan (grounded deterministic heuristic) alone. The app always produces a plan; the model is an upgrade, never a dependency.
Measured operating point (hospital, 509 real errors): union 0.905 precision @ 0.413 coverage; gated model alone 0.993 @ 0.287; 3-seed 0.891±0.012 @ 0.396±0.025. Precision flat 0.89–0.91 for τ∈[0.2,0.8].

Entry points: uv run server.py (FastAPI + UI), app.py (HF Space/Gradio), scrubdata/cli.py (scrubdata <file.csv> -o out.csv --report r.md --plan p.json).

Environment variables

Var	Default	Meaning
`SCRUBDATA_MODEL`	unset	local Ollama model id (e.g. `scrubdata-ft-v6`); unset = heuristic only
`SCRUBDATA_TAU`	`0.5`	per-entry verifier threshold on model mappings
`SCRUBDATA_HC_TAU`	`0.8`	stricter bar for heuristic suspect-mappings (no model cross-check there)
`SCRUBDATA_PAIR_PROFILES`	off	WS2 candidate-constrained planning (measured redundant with verifier; off by default)
`SCRUBDATA_PII_NER`	off	OpenMed-PII 44M NER tier on top of deterministic validators

3. Module map (`scrubdata/`)

Module	Role	Key facts
`profiler.py`	column → bounded profile	`VALUE_COUNTS_CAP=80` (high-card cols: top-8 only) + `suspect_values` section (the visibility fix); `truncated_values` count keeps honesty about what's hidden
`detect.py`	typing + issue predicates	`detect_semantic_type` (zip/ZCTA/Excel-serial guards), `date_formats_consistent` (collapses digit AND alpha runs; 90% dominant-shape), `percent_formats_consistent` (90%), `has_mojibake`, `is_missing`
`planner.py`	deterministic heuristic planner	`mock_plan`, `_column_operations`, `_suspect_canonicalize` (τ_hc=0.8), `detect_entity_groups` (cross-row voting detection), emits `fix_encoding` BEFORE `strip_whitespace` (order-critical), `off_convention_dates` visible-abstention flags
`executor.py`	the only thing that touches cells	op dispatch (§4); unknown ops are no-ops (forward-compatible); returns `(df, change_log)`; `resolve_by_majority` table op lives here
`verifier.py`	WS1 selective prediction	`entry_confidence` (3 hard gates, §5.0), `verify_plan` (also enforces convention gates on MODEL-emitted parse_date/parse_percent — the model path otherwise bypasses them), `union_plans` (order-preserving op inheritance via `reversed(inherit)`)
`reconcile.py`	reference grounding	`ReferenceIndex`, `default_index()` loads toughtables_ref (contamination-guarded: excludes the 8 benchmark tables) + MusicBrainz hints + Wikidata companies + ROR; `infer_reference_type` needs ≥20% exact entity hits (over-fire guard); falls back to `training/harvests/` for Space/clone parity
`grounded.py`	RACOON wrapper	model never free-generates a canonical for a reference-typed column
`pair_profile.py`	suspects + WS2 candidates	`suspects_for_column` (≤25/col, bounded: 4k rare cap + cheap prefilters before SequenceMatcher — 40min→24s fix), `candidate_pairs`, `constrain_plan`
`model_planner.py`	Ollama backends	`make_local_ollama_planner`, `make_batched_planner(batch_size=4)`, JSON extraction
`prompt.py`	prompt/training contract	`_profile_for_prompt` (compact suspects), `build_chat_example` (training-data side of the same contract — change one, regenerate the other)
`pii.py`	PII second task	deterministic validators (Luhn, IBAN, phone) + allowlist + coverage vote; optional 44M NER; `mask/hash/pseudonymize`
`active.py`	THE composition	`get_planner()` — §2
`cli.py` / `report.py` / `trace.py` / `observability.py`	UX + audit	CLI, markdown report, JSONL traces, monitor summary/OTel span
`baselines.py`	OpenRefine kNN/fingerprint reimplementations	the zero-config comparison class
`refdata/cities.txt`	seed gazetteer	plus everything in `training/harvests/*.jsonl`

4. Operation vocabulary (the executor's closed set)

Column ops (_apply_column_op): strip_whitespace, normalize_punctuation, fix_encoding (lossless cp1252/latin-1↔utf8 round-trip, mojibake-marker-reduction gated), normalize_disguised_nulls, parse_currency, parse_number, parse_percent (abstains on bare values — no /100 corruption), parse_date, standardize_boolean, standardize_phone (7-digit → DDD-DDDD), normalize_email, standardize_case, canonicalize_categories (mapping-driven; the verifier's subject), flag_pii (log-only), mask_pii, hash_pii, pseudonymize_pii. Unknown op → no-op.

Table ops: drop_empty_columns, drop_empty_rows, drop_exact_duplicates, resolve_by_majority (§5.3).

Op-order invariant: fix_encoding must precede whitespace/punctuation ops — they destroy the UTF-8 byte patterns repair needs (grader-reproduced bug; fixed in both heuristic emission and union inheritance).

5. The five deterministic capabilities (what actually generalizes)

5.0 Plan-level verifier (WS1) — `verifier.entry_confidence`

Every non-grounded canonicalize_categories entry raw→canon is scored with three HARD gates, each killing a measured hospital failure class:

errors are rare: freq(raw) ≥ 3 → 0.0 (frequent = legit data; "de kalb"×92)
repair to dominance only: freq(canon) < max(2, 2·freq(raw)) → 0.0 ("yex→yexu", typo mapped to a worse typo)
code discipline: digit-bearing values repair only if letter-part similarity ≥0.85 AND digits identical (allows amix-2→ami-2, blocks ak_→al_) Survivors score sim × (0.5 + 0.5·support); below-τ entries become review flags.

5.1 Suspect surfacing (visibility) — `pair_profile.suspects_for_column`

The 80-value profile cap structurally hides high-cardinality dirty cells from ANY planner (proved by the v8/v9 retrains: more data couldn't fix what the model couldn't see). Every text-ish column profile now carries ≤25 suspect_values: rare surfaces + evidence-backed candidates (frequency dominance, edit similarity, reference membership). The heuristic maps suspects clearing entry_confidence ≥ SCRUBDATA_HC_TAU=0.8; the rest become flags.

5.2 Generic entity reference — `reconcile.default_index`

Open vocabularies (ToughTables-derived ref [8 bench tables excluded], MusicBrainz search-hint misspellings, RxNorm, Wikidata companies, ROR, GeoNames, OpenFlights, O*NET, nicknames) as a pluggable reference type. Typing requires ≥20% exact hits of distinct values (fuzzy coverage alone over-fires on name-like columns — measured). Cracked the all-unique regime: 5 ToughTables tables 0 → 0.955–0.957 F1 at 0.0000 damage (~62k corrections) — where no in-column frequency signal exists at all.

5.3 Cross-row majority voting — `planner.detect_entity_groups` + `resolve_by_majority`

Tables repeating a real-world entity across rows (flights reported by many sources) carry their own repair signal. Detection: compact-token key columns, median multiplicity 3–30, ≥2 votable string columns with majority-bearing disagreement + ≥2 distinct majorities, date-share ≤0.3 guard. Execution: resolve thin dissenting minorities to group majority; skips missing-like keys; min_share/min_group clamped. False-consensus guard: mean minority share ≥0.25 → decline (legitimate correlated updates, not reporting errors — a flat volume cap was measured to destroy the legitimate regime and replaced). Measured: flights heuristic 0.044→0.164 F1; hospital heuristic 0.092→0.186.

5.4 Convention conservatism — `detect.*_formats_consistent` + `verify_plan`

Never re-format an internally consistent column: date/percent ops gated on dominant-shape inconsistency (digit+alpha runs collapsed, 90% rule); zip/postal names never typed phone/date; Excel-serial typing needs a date-suggestive name. Suppressed minorities surface as off_convention_dates flags. The verifier enforces the same gates on model plans at the verification boundary (the model path otherwise bypasses heuristic emission gates entirely).

6. Evaluation (how every number regenerates)

One scoring contract — eval/run_real_multi.py::score() — churn-neutral, convention-tolerant: sem-equal = numeric-tolerant OR strip+casefold equal; pure case/whitespace churn counts as nothing; a fix requires acting; damage = clean cells corrupted / clean cells; silent edits = changed columns minus log-attributed columns (must be 0).

Harness	Command	What it measures	Current numbers
Money table	`python -m eval.run_real_multi`	65-set suite, 3 seeds	grounded NORTH 0.224±0.004; REAL-F1 0.225 vs OR-kNN 0.058 (HEAD 2026-06-12 regen; freeze was 0.203/0.174)
WS1 gate	`python -m eval.precision_curve --plan eval/results/v6_hospital_raw_plan.json --union`	precision–coverage curve	0.905 @ 0.413 (τ=0.5)
Paired bench	`python -m eval.paired_bench`	42 dirty/gold pairs	unseen-35 macro F1 0.363 @ dmg 0.0219
Wild bench	`python -m eval.wild_bench`	35 uncurated tables, behavioral + inject-recovery	recovery 0.207; 0 silent edits
Trust audit	`python -m eval.gittables_audit`	239 GitTables clean-lake	239/239 valid, 0 crashes, 0 silent edits
Generalization	`python -m eval.generalization`	held-out-source (train: hospital/beers/movies_1 · eval: flights/rayyan/ed2)	GEN-F1 0.058, VR 0.108, dmg 0.036
RADAR board	`python -m eval.radar_bench`	regime boundaries by artifact type	abstains on missingness ✓; reasoning-class = frontier territory
Baselines	`eval/run_baran.py`, `modal run scripts/modal_jellyfish.py`	disclosed-protocol comparisons	Baran (oracle+20 labels) 0.811; Jellyfish-13B 0.074
Calibration / PII	`eval.calibration`, `eval.pii_leak`	abstention quality / leak test	AURC 0.120, ECE 0.169; 0/360 residual PII

Eval-source discipline: TRAIN_SOURCES["v6"]={hospital,beers,movies_1}; EVAL_SOURCES={flights,rayyan,ed2_restaurants}. Never crossed.

7. Model & artifacts

Artifact	Where	Notes
Champion adapter	Modal volume `scrubdata-v5-adapter` `/v5_seed21` (= "v6")	survived v7–v10 challenges + GRPO
Merged model	`hf.co/ricalanis/scrubdata-qwen3-4b`	card carries the v2 finding
Q8 GGUF	`hf.co/ricalanis/scrubdata-qwen3-4b-v6-q8`	Q8_0 only — Q4_K_M corrupts (Unsloth 2026.6.x); non-thinking Modelfile required (`notebooks/Modelfile`); suppress tokens 151657/151658 under transformers
Benchmark	`hf.co/datasets/ricalanis/wildclean`	33 redistributable pairs + loaders.py for 9 license-gated + gittables250 + 10 vocabs + frozen results; first cleaning bench with damage + silent-edit accounting
Demo	`hf.co/spaces/build-small-hackathon/scrubdata`	deploy = `HfApi.upload_folder` of `git archive HEAD` — NO GitHub auto-sync
Paper	`docs/paper/main.tex` + `numbers.tex`	compile: `~/.local/bin/tectonic main.tex` (no pdflatex on this machine)
Vocabs	`training/harvests/*.jsonl` (15MB, 13 files)	loader falls back here for clone parity

Modal patterns: --detach for anything long; results land in Modal Dicts (scrubdata-train-results, scrubdata-eval-v5-results, scrubdata-suite-results). Budget status at domain close: ~$187 of $212 ceiling — Modal HALTED.

8. Negative results ledger (measured, do not re-litigate)

v7–v10 SFT retrains: 109k harvested alias pairs, episode mixes, suspects contract — GEN flat/worse. Mixing harvested pairs dilutes executor-verified synthetic skill (monotonic dilution law across mix ratios; mixH 0.677).
GRPO pilot, 3 arms (main, KL-anchored v2, random-reward control): all degrade format at 4B/LoRA/$30 scale; the control proved the drift is an RL artifact (cf. "Spurious Rewards"). Published RLVR wins used real infra (verl, 4×H100×40h). Episodes corpus (600, training/build_grpo_episodes.py) + hand-rolled loop (scripts/modal_grpo.py) committed for a future attempt.
Uniform verification of existing low-card mappings (A1 per-class thresholds): 0.905→0.890 — reverted.
Strict entity-typing thresholds (0.90/0.05): cost more than bought — reverted.
WS2 candidate constraining composed with verifier: 0.876 @ 0.387 < union at same τ — redundant gating of the same failure class; available, off by default.
Flat volume cap on cross-row voting: destroyed the legitimate dense-disagreement regime — replaced by the false-consensus guard.
Frozen-gold synthetic yardstick predates the suspects prompt contract — regenerate gold before ever quoting synthetic canon_f1 again.

9. Known-open (graded non-blocking)

_parse_date per-value dayfirst; i18n name guards; mojibake fixpoint / sequence-plausibility; backlog sources: CMS API, NHTSA, Canada contracts, Matelda ~6,670 pairs, GLEIF/USDA vocabs, WDVC-16. Reasoning-class artifacts (RADAR) are explicitly out of protocol class — frontier-model territory.

10. Where deeper detail lives

docs/PRODUCT.md (trust contract) · docs/SOTA.md + docs/ROADMAP_SOTA2.md (position + research map) · docs/CAPABILITY_GRADES.md (12-agent adversarial grading + must-fix ledger) · docs/WILD_BENCH.md / docs/PAIRED_BENCH.md / docs/GITTABLES_AUDIT.md / docs/DATASETS.md (per-bench detail + licenses) · docs/NIGHT_LOG.md (stage-3 timeline) · project-memory/ (agent memory snapshot).