scrubdata / docs /TOOL_REFERENCE.md
OpenAI Codex
deploy: add sponsor:openai tag (Best Use of Codex) + Codex-hardened build
16dc556
|
Raw
History Blame Contribute Delete
16.9 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

ScrubData β€” The Profound Tool Reference

The single local document that explains the whole system: what it is, why every piece exists, where every number comes from, and what we learned building it. Written at the close of the research domain (2026-06-12). The paper (docs/paper/main.tex) is the citable account; THIS file is the operational one.


1. What ScrubData is

ScrubData is a zero-config, zero-label, local tabular data-cleaning system built around one architectural commitment: the model never touches data.

A profiler aggregates each column into a bounded value-frequency profile; a small (≀4B, locally-run) fine-tuned planner proposes a JSON cleaning plan; a deterministic pandas executor applies it. The plan is the complete, inspectable, reversible specification of every change. Three consequences define the product:

  1. No silent edits by construction β€” every changed cell traces to a named, logged operation (verified at scale: 0 silent edits across 35 wild tables and a 239-table GitTables trust audit).
  2. Abstention is first-class β€” anything below confidence becomes a review flag ("YOUR CALL" card in the UI), never a quiet skip and never a guess.
  3. Profile-not-rows scaling β€” the prompt scales with distinct values, not rows; a million-row table profiles like a hundred-row one, and no cell values leave the machine.

The central finding (load-bearing, repeatedly measured)

Model weights contribute approximately nothing to never-seen-table generalization in this protocol class. Five SFT retrains (v7–v10 + mixes, 109k harvested real alias pairs) and a three-arm GRPO pilot (executor as verifiable reward, including a random-reward control that reproduced the same format drift) all failed to move held-out generalization. Every measured gain came from deterministic machinery gated by the plan-level verifier (Β§5). Corroborated independently by Spreadsheet-RL, arXiv:2601.05009, and arXiv:2606.02866. Practical corollary: to improve ScrubData, write a deterministic capability and gate it with the verifier; do not collect more training data.


2. The shipped pipeline (scrubdata/active.py::get_planner)

                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 df ──► profiler ──►   β”‚  model path (only if SCRUBDATA_MODEL is set) β”‚
        (bounded       β”‚  batched (4 cols/call) local Ollama planner  β”‚
        profile incl.  β”‚  β†’ per-batch fallback to heuristic on error  β”‚
        suspects)      β”‚  β†’ grounded (reference taxonomies, RACOON)   β”‚
                       β”‚  β†’ verify_plan(tau=SCRUBDATA_TAU, def 0.5)   β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β”‚  union_plans (model wins per surface;
                                       β”‚  inherits deterministic ops + table ops)
        heuristic mock_plan β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β–Ό
                        executor.apply_plan β†’ (clean_df, change_log)
                                       β–Ό
                 report.render_report Β· trace.log_run Β· observability
  • No model configured β†’ mock_plan (grounded deterministic heuristic) alone. The app always produces a plan; the model is an upgrade, never a dependency.
  • Measured operating point (hospital, 509 real errors): union 0.905 precision @ 0.413 coverage; gated model alone 0.993 @ 0.287; 3-seed 0.891Β±0.012 @ 0.396Β±0.025. Precision flat 0.89–0.91 for Ο„βˆˆ[0.2,0.8].

Entry points: uv run server.py (FastAPI + UI), app.py (HF Space/Gradio), scrubdata/cli.py (scrubdata <file.csv> -o out.csv --report r.md --plan p.json).

Environment variables

Var Default Meaning
SCRUBDATA_MODEL unset local Ollama model id (e.g. scrubdata-ft-v6); unset = heuristic only
SCRUBDATA_TAU 0.5 per-entry verifier threshold on model mappings
SCRUBDATA_HC_TAU 0.8 stricter bar for heuristic suspect-mappings (no model cross-check there)
SCRUBDATA_PAIR_PROFILES off WS2 candidate-constrained planning (measured redundant with verifier; off by default)
SCRUBDATA_PII_NER off OpenMed-PII 44M NER tier on top of deterministic validators

3. Module map (scrubdata/)

Module Role Key facts
profiler.py column β†’ bounded profile VALUE_COUNTS_CAP=80 (high-card cols: top-8 only) + suspect_values section (the visibility fix); truncated_values count keeps honesty about what's hidden
detect.py typing + issue predicates detect_semantic_type (zip/ZCTA/Excel-serial guards), date_formats_consistent (collapses digit AND alpha runs; 90% dominant-shape), percent_formats_consistent (90%), has_mojibake, is_missing
planner.py deterministic heuristic planner mock_plan, _column_operations, _suspect_canonicalize (Ο„_hc=0.8), detect_entity_groups (cross-row voting detection), emits fix_encoding BEFORE strip_whitespace (order-critical), off_convention_dates visible-abstention flags
executor.py the only thing that touches cells op dispatch (Β§4); unknown ops are no-ops (forward-compatible); returns (df, change_log); resolve_by_majority table op lives here
verifier.py WS1 selective prediction entry_confidence (3 hard gates, Β§5.0), verify_plan (also enforces convention gates on MODEL-emitted parse_date/parse_percent β€” the model path otherwise bypasses them), union_plans (order-preserving op inheritance via reversed(inherit))
reconcile.py reference grounding ReferenceIndex, default_index() loads toughtables_ref (contamination-guarded: excludes the 8 benchmark tables) + MusicBrainz hints + Wikidata companies + ROR; infer_reference_type needs β‰₯20% exact entity hits (over-fire guard); falls back to training/harvests/ for Space/clone parity
grounded.py RACOON wrapper model never free-generates a canonical for a reference-typed column
pair_profile.py suspects + WS2 candidates suspects_for_column (≀25/col, bounded: 4k rare cap + cheap prefilters before SequenceMatcher β€” 40minβ†’24s fix), candidate_pairs, constrain_plan
model_planner.py Ollama backends make_local_ollama_planner, make_batched_planner(batch_size=4), JSON extraction
prompt.py prompt/training contract _profile_for_prompt (compact suspects), build_chat_example (training-data side of the same contract β€” change one, regenerate the other)
pii.py PII second task deterministic validators (Luhn, IBAN, phone) + allowlist + coverage vote; optional 44M NER; mask/hash/pseudonymize
active.py THE composition get_planner() β€” Β§2
cli.py / report.py / trace.py / observability.py UX + audit CLI, markdown report, JSONL traces, monitor summary/OTel span
baselines.py OpenRefine kNN/fingerprint reimplementations the zero-config comparison class
refdata/cities.txt seed gazetteer plus everything in training/harvests/*.jsonl

4. Operation vocabulary (the executor's closed set)

Column ops (_apply_column_op): strip_whitespace, normalize_punctuation, fix_encoding (lossless cp1252/latin-1↔utf8 round-trip, mojibake-marker-reduction gated), normalize_disguised_nulls, parse_currency, parse_number, parse_percent (abstains on bare values β€” no /100 corruption), parse_date, standardize_boolean, standardize_phone (7-digit β†’ DDD-DDDD), normalize_email, standardize_case, canonicalize_categories (mapping-driven; the verifier's subject), flag_pii (log-only), mask_pii, hash_pii, pseudonymize_pii. Unknown op β†’ no-op.

Table ops: drop_empty_columns, drop_empty_rows, drop_exact_duplicates, resolve_by_majority (Β§5.3).

Op-order invariant: fix_encoding must precede whitespace/punctuation ops β€” they destroy the UTF-8 byte patterns repair needs (grader-reproduced bug; fixed in both heuristic emission and union inheritance).


5. The five deterministic capabilities (what actually generalizes)

5.0 Plan-level verifier (WS1) β€” verifier.entry_confidence

Every non-grounded canonicalize_categories entry raw→canon is scored with three HARD gates, each killing a measured hospital failure class:

  • errors are rare: freq(raw) β‰₯ 3 β†’ 0.0 (frequent = legit data; "de kalb"Γ—92)
  • repair to dominance only: freq(canon) < max(2, 2Β·freq(raw)) β†’ 0.0 ("yexβ†’yexu", typo mapped to a worse typo)
  • code discipline: digit-bearing values repair only if letter-part similarity β‰₯0.85 AND digits identical (allows amix-2β†’ami-2, blocks ak_β†’al_) Survivors score sim Γ— (0.5 + 0.5Β·support); below-Ο„ entries become review flags.

5.1 Suspect surfacing (visibility) β€” pair_profile.suspects_for_column

The 80-value profile cap structurally hides high-cardinality dirty cells from ANY planner (proved by the v8/v9 retrains: more data couldn't fix what the model couldn't see). Every text-ish column profile now carries ≀25 suspect_values: rare surfaces + evidence-backed candidates (frequency dominance, edit similarity, reference membership). The heuristic maps suspects clearing entry_confidence β‰₯ SCRUBDATA_HC_TAU=0.8; the rest become flags.

5.2 Generic entity reference β€” reconcile.default_index

Open vocabularies (ToughTables-derived ref [8 bench tables excluded], MusicBrainz search-hint misspellings, RxNorm, Wikidata companies, ROR, GeoNames, OpenFlights, O*NET, nicknames) as a pluggable reference type. Typing requires β‰₯20% exact hits of distinct values (fuzzy coverage alone over-fires on name-like columns β€” measured). Cracked the all-unique regime: 5 ToughTables tables 0 β†’ 0.955–0.957 F1 at 0.0000 damage (~62k corrections) β€” where no in-column frequency signal exists at all.

5.3 Cross-row majority voting β€” planner.detect_entity_groups + resolve_by_majority

Tables repeating a real-world entity across rows (flights reported by many sources) carry their own repair signal. Detection: compact-token key columns, median multiplicity 3–30, β‰₯2 votable string columns with majority-bearing disagreement + β‰₯2 distinct majorities, date-share ≀0.3 guard. Execution: resolve thin dissenting minorities to group majority; skips missing-like keys; min_share/min_group clamped. False-consensus guard: mean minority share β‰₯0.25 β†’ decline (legitimate correlated updates, not reporting errors β€” a flat volume cap was measured to destroy the legitimate regime and replaced). Measured: flights heuristic 0.044β†’0.164 F1; hospital heuristic 0.092β†’0.186.

5.4 Convention conservatism β€” detect.*_formats_consistent + verify_plan

Never re-format an internally consistent column: date/percent ops gated on dominant-shape inconsistency (digit+alpha runs collapsed, 90% rule); zip/postal names never typed phone/date; Excel-serial typing needs a date-suggestive name. Suppressed minorities surface as off_convention_dates flags. The verifier enforces the same gates on model plans at the verification boundary (the model path otherwise bypasses heuristic emission gates entirely).


6. Evaluation (how every number regenerates)

One scoring contract β€” eval/run_real_multi.py::score() β€” churn-neutral, convention-tolerant: sem-equal = numeric-tolerant OR strip+casefold equal; pure case/whitespace churn counts as nothing; a fix requires acting; damage = clean cells corrupted / clean cells; silent edits = changed columns minus log-attributed columns (must be 0).

Harness Command What it measures Current numbers
Money table python -m eval.run_real_multi 65-set suite, 3 seeds grounded NORTH 0.224Β±0.004; REAL-F1 0.225 vs OR-kNN 0.058 (HEAD 2026-06-12 regen; freeze was 0.203/0.174)
WS1 gate python -m eval.precision_curve --plan eval/results/v6_hospital_raw_plan.json --union precision–coverage curve 0.905 @ 0.413 (Ο„=0.5)
Paired bench python -m eval.paired_bench 42 dirty/gold pairs unseen-35 macro F1 0.363 @ dmg 0.0219
Wild bench python -m eval.wild_bench 35 uncurated tables, behavioral + inject-recovery recovery 0.207; 0 silent edits
Trust audit python -m eval.gittables_audit 239 GitTables clean-lake 239/239 valid, 0 crashes, 0 silent edits
Generalization python -m eval.generalization held-out-source (train: hospital/beers/movies_1 Β· eval: flights/rayyan/ed2) GEN-F1 0.058, VR 0.108, dmg 0.036
RADAR board python -m eval.radar_bench regime boundaries by artifact type abstains on missingness βœ“; reasoning-class = frontier territory
Baselines eval/run_baran.py, modal run scripts/modal_jellyfish.py disclosed-protocol comparisons Baran (oracle+20 labels) 0.811; Jellyfish-13B 0.074
Calibration / PII eval.calibration, eval.pii_leak abstention quality / leak test AURC 0.120, ECE 0.169; 0/360 residual PII

Eval-source discipline: TRAIN_SOURCES["v6"]={hospital,beers,movies_1}; EVAL_SOURCES={flights,rayyan,ed2_restaurants}. Never crossed.


7. Model & artifacts

Artifact Where Notes
Champion adapter Modal volume scrubdata-v5-adapter /v5_seed21 (= "v6") survived v7–v10 challenges + GRPO
Merged model hf.co/ricalanis/scrubdata-qwen3-4b card carries the v2 finding
Q8 GGUF hf.co/ricalanis/scrubdata-qwen3-4b-v6-q8 Q8_0 only β€” Q4_K_M corrupts (Unsloth 2026.6.x); non-thinking Modelfile required (notebooks/Modelfile); suppress tokens 151657/151658 under transformers
Benchmark hf.co/datasets/ricalanis/wildclean 33 redistributable pairs + loaders.py for 9 license-gated + gittables250 + 10 vocabs + frozen results; first cleaning bench with damage + silent-edit accounting
Demo hf.co/spaces/build-small-hackathon/scrubdata deploy = HfApi.upload_folder of git archive HEAD β€” NO GitHub auto-sync
Paper docs/paper/main.tex + numbers.tex compile: ~/.local/bin/tectonic main.tex (no pdflatex on this machine)
Vocabs training/harvests/*.jsonl (15MB, 13 files) loader falls back here for clone parity

Modal patterns: --detach for anything long; results land in Modal Dicts (scrubdata-train-results, scrubdata-eval-v5-results, scrubdata-suite-results). Budget status at domain close: ~$187 of $212 ceiling β€” Modal HALTED.


8. Negative results ledger (measured, do not re-litigate)

  1. v7–v10 SFT retrains: 109k harvested alias pairs, episode mixes, suspects contract β€” GEN flat/worse. Mixing harvested pairs dilutes executor-verified synthetic skill (monotonic dilution law across mix ratios; mixH 0.677).
  2. GRPO pilot, 3 arms (main, KL-anchored v2, random-reward control): all degrade format at 4B/LoRA/$30 scale; the control proved the drift is an RL artifact (cf. "Spurious Rewards"). Published RLVR wins used real infra (verl, 4Γ—H100Γ—40h). Episodes corpus (600, training/build_grpo_episodes.py) + hand-rolled loop (scripts/modal_grpo.py) committed for a future attempt.
  3. Uniform verification of existing low-card mappings (A1 per-class thresholds): 0.905β†’0.890 β€” reverted.
  4. Strict entity-typing thresholds (0.90/0.05): cost more than bought β€” reverted.
  5. WS2 candidate constraining composed with verifier: 0.876 @ 0.387 < union at same Ο„ β€” redundant gating of the same failure class; available, off by default.
  6. Flat volume cap on cross-row voting: destroyed the legitimate dense-disagreement regime β€” replaced by the false-consensus guard.
  7. Frozen-gold synthetic yardstick predates the suspects prompt contract β€” regenerate gold before ever quoting synthetic canon_f1 again.

9. Known-open (graded non-blocking)

_parse_date per-value dayfirst; i18n name guards; mojibake fixpoint / sequence-plausibility; backlog sources: CMS API, NHTSA, Canada contracts, Matelda ~6,670 pairs, GLEIF/USDA vocabs, WDVC-16. Reasoning-class artifacts (RADAR) are explicitly out of protocol class β€” frontier-model territory.

10. Where deeper detail lives

docs/PRODUCT.md (trust contract) Β· docs/SOTA.md + docs/ROADMAP_SOTA2.md (position + research map) Β· docs/CAPABILITY_GRADES.md (12-agent adversarial grading + must-fix ledger) Β· docs/WILD_BENCH.md / docs/PAIRED_BENCH.md / docs/GITTABLES_AUDIT.md / docs/DATASETS.md (per-bench detail + licenses) Β· docs/NIGHT_LOG.md (stage-3 timeline) Β· project-memory/ (agent memory snapshot).