ScrubData planner — Qwen3-4B fine-tuned for tabular cleaning plans

A ≤4B planner for hands-off data cleaning: it reads an aggregated column profile (per-value frequency counts) and emits a structured JSON cleaning plan that a deterministic pandas executor applies. Built for the Build Small Hackathon (🏡 Backyard AI · Tiny Titan · Well-Tuned).

Live demo: https://huggingface.co/spaces/build-small-hackathon/scrubdata · Code/paper: see the Space repo (docs/paper/) · Traces: build-small-hackathon/scrubdata-traces

What's special about the training data

Every training example is execution-verified: a candidate (dirty table, plan) pair is kept only if running the executor on it provably recovers the known-clean table. Mix: synthetic high-cardinality categorical tables (Zipf long-tail + realistic typos) + 20% real-derived pairs from the Raha benchmarks (cell-aligned, learnable canonicalizations only).

Shipped composition (WS1 — verified union planner): in the product, every model-proposed mapping is scored by a deterministic verifier (errors-are-rare frequency gates, variant similarity, reference agreement; threshold SCRUBDATA_TAU, default 0.5) and unioned with the grounded heuristic. Measured on hospital's 509 real errors: 0.905 precision @ 0.413 coverage (gated model plan alone: 0.993 @ 0.287 — 146/147 committed changes correct; seed-robust: 0.891 ± 0.012 @ 0.396 ± 0.025 over 3 training seeds). Dropped merges become review flags, never silent skips.

Measured

  • Canonicalization micro-F1 0.90 (vs 0.45 for a much larger zero-shot generic model, 0.13 for a rule heuristic) on frozen held-out gold.
  • Real hospital typos (Raha, OOD): repair recall 0.00 → 0.42 from adding the real-derived 20% (synthetic-only fails to transfer — documented honestly).
  • In production the model is wrapped with reference grounding + calibrated abstention (it never free-generates a canonical for a grounded column type).

How to run

Ollama / llama.cpp (recommended): use the non-thinking Modelfile from the Space repo (notebooks/Modelfile). Q8_0 GGUF: ricalanis/scrubdata-qwen3-4b-v4-q8 (Q4_K_M corrupts this model on Unsloth 2026.6.x exports — use Q8_0).

Transformers (bf16 + adapter): suppress the tool-call tokens at decode time or the base model's tool-calling prior dominates:

model.generate(..., suppress_tokens=[151657, 151658])  # <tool_call>, </tool_call>

Limitations

English-centric; plans use a closed op vocabulary; canonicalization quality on entity columns depends on the reference taxonomy's coverage; not a de-identification guarantee.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ricalanis/scrubdata-qwen3-4b

Finetuned
(1725)
this model