Spaces:
Running
Running
File size: 14,009 Bytes
16dc556 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 | ---
title: ScrubData
emoji: 🏔️
colorFrom: green
colorTo: indigo
sdk: gradio
sdk_version: 6.16.0
app_file: server.py
pinned: true
license: mit
tags:
- track:backyard
- sponsor:openai
- sponsor:modal
- achievement:offgrid
- achievement:welltuned
- achievement:offbrand
- achievement:llama
- achievement:sharing
- achievement:fieldnotes
---
# ScrubData — hands-off data cleaning, with the receipts
Entry for the **Build Small Hackathon** (Gradio · Hugging Face), 🏡 Backyard AI track.
Runs a ≤4B model — a local-runnable GGUF, no third-party AI APIs → also in the running for
**Tiny Titan**, **Off-Brand**, **Best Demo**, **Best Agent**, and **Bonus Quest Champion**
(all six quests claimed above).
<!-- SUBMISSION LINKS (all set for June 15):
Demo video: https://www.loom.com/share/2fa868147527496e8097d82dd546d663 [DONE]
Social post: https://x.com/ric_alanis/status/2066598533738692983 [DONE]
These links + this write-up are required by the build-small-hackathon /submit tool. -->
> **Hosted demo vs. local — read this.** This Space is a **no-install demo** that cleans with
> the real **Qwen3-4B fine-tune** by default (served on an A100 GPU, ~1 min/clean warm; first
> run after idle ~2 min on cold start) — the whole point
> is the small model doing the work. Your file is processed on Hugging Face / the GPU endpoint
> (sent to no third-party API, not stored); untick the box for an instant deterministic pass.
> The **privacy story is a property of running it yourself**: `SCRUBDATA_MODEL=scrubdata-ft uv
> run server.py` reads and cleans your file on-device with the same fine-tune — nothing leaves
> your machine. The app labels its own mode honestly (the ribbon says which one you're using).
> Same auditable plan→verify→execute pipeline either way.
> **Modal** (`sponsor:modal`): the hosted Space cleans with the Qwen3-4B fine-tune served from a
> **scale-to-zero Modal GPU endpoint** (`scripts/modal_serve.py`, Ollama on an A100; $0 when idle,
> pre-warmed on page load to hide the cold start). Modal also drove the headless training +
> evaluation loop behind the published model. The deterministic planner is the silent fallback
> if the GPU is cold or down, so the demo never hard-fails.
> **Drop a messy export. Get clean data back — every change named, reversible, and
> explained. Anything sensitive is protected locally. The judgment calls stay yours.**
>
> For the office/ops person trying to do their job while their data is a mess.
**Built by:** [@ricalanis](https://huggingface.co/ricalanis) (solo) · 🤗 Hugging Face: `ricalanis`
**Live Space:** https://huggingface.co/spaces/build-small-hackathon/scrubdata
**Code (open source):** https://github.com/ricalanis/scrubdata-hackathon
**Demo video:** https://www.loom.com/share/2fa868147527496e8097d82dd546d663
**Write-up / post:** https://x.com/ric_alanis/status/2066598533738692983
## How it works
A small local model is the **planner**, never a row-by-row editor:
1. **Profile** — pandas aggregates each column into a value–frequency distribution
(scale-invariant: a million rows profile like a hundred).
2. **Plan** — the model reads the profile and emits a structured JSON cleaning plan:
canonicalization mappings, format fixes, dedup, anomaly flags.
3. **Ground** — canonical forms are never invented: values reconcile against reference
taxonomies (GeoNames 196k cities, ISO countries/states, and a pluggable **entity
reference** built from harvested vocabularies — ToughTables/MusicBrainz/Wikidata/ROR,
~100k entities) with fuzzy retrieval; ambiguous matches **abstain** and surface for
human review (calibrated: 90% precision at the default threshold, ≥95% at 0.91).
Profiles carry **suspect_values** — rare anomalous surfaces with evidence-backed
candidates — so high-cardinality columns are no longer invisible to the planner
(measured: five all-unique-surface benchmark tables went 0.0 → 0.96 F1 at zero damage).
4. **Verify** — every model-proposed mapping is scored by deterministic evidence
(errors-are-rare frequency gates, variant similarity, reference agreement); entries
below the confidence threshold (`SCRUBDATA_TAU`, default 0.5) become review flags
instead of edits. The shipped **verified union planner** (gated model plan ∪ grounded
heuristic) measures **0.905 precision @ 0.413 coverage** on hospital's 509 real errors
— the gated model plan alone is 0.993 @ 0.287.
5. **Protect** — PII is detected locally (Luhn/IBAN checksums + a 44M OpenMed-PII
classifier): cards/SSNs masked format-preservingly, contacts flagged, **0/360 residual
PII** after masking in our leak test.
6. **Execute** — deterministic pandas applies the plan. No silent edits, by construction;
every run exports an audit trail (OpenTelemetry-GenAI spans + open traces).
**Model:** `Qwen3-4B-Instruct-2507` (Tiny Titan), QLoRA fine-tuned on **execution-verified**
synthetic + real-derived data (every training plan provably recovers the clean table),
runnable via llama.cpp GGUF.
## The app (what judges see)
A custom `gr.Server` frontend (no default Gradio chrome — the **Off-Brand** quest), built
around the trust story:
- **YOUR CALL cards** — when the model is genuinely torn (e.g. *Slovia → Slovakia 86% vs
Slovenia 86%*) it abstains and hands you the tie with both candidates; pick the right one
and **stage several decisions**, then "✓ Clean now" replays them as one plan.
- **Named, reversible receipts** — every edit shows as a row in the audit grid with its op +
rationale and a before/after diff; nothing is silent.
- **PII review cards** — embedded cards/SSNs (Luhn/strict-regex) flagged and masked
format-preservingly, on-device.
- **Save / replay recipe** — export the cleaning plan as JSON and re-apply it to next week's
export in one click (the "Monday ritual").
- **Honest, self-aware copy** — the app injects its own runtime state and the ribbon says
exactly which planner ran and where your data was processed.
- **A fun, size-aware ETA timer** + cold-start readiness gate + page-load GPU pre-warm, so
the model path feels responsive and never lies about progress.
- Drag-and-drop, two bundled sample exports, mobile-responsive layout.
## What real users told us (and what we changed)
Before submission we put the live Space in front of people who **aren't** data folks — the
exact audience the tool is for — and sent the link with one line: *"if you have a messy
spreadsheet, try it."* The most useful finding wasn't a bug. It was that the word
**"cleaning" didn't land**:
- One tester read "clean my Excel" as *deleting* data:
*"¿Te refieres a que elimine algo de algún archivo?"* — "You mean it removes something
from the file?"
- Another didn't know where to begin:
*"¿eso del Excel te lo subimos ahí o cómo?"* — "the Excel thing, do we upload it there,
or how?"
- The clearest explanation in the whole thread was one we had to type by hand in chat:
*"it fixes text errors — names, phones, emails, cities."* That sentence wasn't anywhere
in the product.
So we changed the product to **show** what cleaning means instead of naming it:
- the hero now leads with a literal before→after strip
(`nigeia → Nigeria`, `Calfornia → California`, `Ana@GMAIL.com → ana@gmail.com`,
`415.555.0192 → (415) 555-0192`) so the value is obvious *before* any upload;
- the headline is the sentence that worked in chat — **"Fix the messy text in your
spreadsheet"** — and the copy says plainly **"I never delete your data"** (killing the
"does it erase things?" misread);
- a one-click **"watch it run on a sample file"** path removes the "where do I start?" wall;
- jargon labels are gone ("HR payroll (with PII)" → "an HR file with sensitive data").
n is small and informal (friends-and-network, ~3 people), so this isn't a usability *study* —
but the feedback was real, it pointed at a failure of the *framing* rather than the engine,
and it changed the build. The persona "Maria" below is the controlled walk-through; the
quotes above are verbatim from people we know.
## Measured (not vibes)
- **Canonicalization micro-F1 0.90 (best single run; 0.80 ± 0.01 over 3 training seeds)** for the 4B
fine-tune vs **0.45** for a much larger generic model vs **0.15** for rules.
- Real errors (5-benchmark macro): grounded cleaning reaches REAL-F1 **0.225**, 3.9×
OpenRefine kNN (0.058) and 5.7× fingerprint (0.039); the verified-union gate repairs
41% of hospital's 509 real errors at **0.905 precision**, every declined merge
surfaced for review.
- Evaluated on a **65-dataset suite** (Raha benchmarks + seeded error injection over 15
open-data domains) with a churn-neutral metric that can't be gamed by mass rewriting.
- Full write-up: `docs/paper/` (preprint draft) · details in `eval/README.md`.
## Run it
```bash
uv sync
uv run server.py # gr.Server + custom UI (grounded heuristic)
# fine-tuned model as planner (needs Ollama + the GGUF, see notebooks/Modelfile):
ollama pull hf.co/ricalanis/scrubdata-qwen3-4b-v6-q8:Q8_0
ollama create scrubdata-ft -f notebooks/Modelfile
SCRUBDATA_MODEL=scrubdata-ft uv run server.py # model planner, heuristic fallback (on-device)
SCRUBDATA_PII_NER=1 uv run server.py # +44M NER for name/address columns
uv run python -m scrubdata.cli messy.csv -o clean.csv --plan plan.json
uv run pytest tests/ # engine + scorer tests (69)
```
The hosted Space serves the same fine-tune from a scale-to-zero **Modal A100**
(`scripts/modal_serve.py`) and the planner adds `format=json` on that path
(`SCRUBDATA_OLLAMA_FORMAT_JSON=1`) to grammar-constrain the GGUF on the A100's kernels.
`scripts/modal_warm.py on|off` pins/un-pins a warm container (no cold start) without a
redeploy — leave it `off` (scale-to-zero, $0 idle), flip `on` for a live judging window.
## Repo map
- `scrubdata/` — `profiler` · `planner` · `reconcile` (reference grounding + abstain) ·
`grounded` (RACOON wrapper) · `verifier` (selective prediction + union planner) ·
`pair_profile` (candidate-constrained canonicalization, opt-in) · `pii` (checksum +
NER tiers, mask/hash/pseudonymize) · `executor` · `observability` · `trace` ·
`baselines` (OpenRefine) · `cli`.
- `training/` — execution-verified synthetic generator + real-data derivation
(`real_data.py`: paired benchmarks + frequency-derived unpaired open data).
- `eval/` — frozen gold · wide suite + double-macro north-star (`run_real_multi.py`) ·
ablations · calibration (risk–coverage) · PII leak test.
- `docs/paper/` — preprint: *Verified Cleaning Plans: Plan-Level Selective Prediction
Turns Local LLM Planners into Trustworthy Table Cleaners*.
- `scripts/` — Modal train/eval (headless GPU loop), trace publishing.
## Research & resources
Everything behind the demo is public:
- 🚀 **Live Space** — https://huggingface.co/spaces/build-small-hackathon/scrubdata
- 💻 **Code (open source)** — https://github.com/ricalanis/scrubdata-hackathon
- 🧠 **Fine-tuned model** — https://huggingface.co/ricalanis/scrubdata-qwen3-4b
(Q8_0 GGUF: https://huggingface.co/ricalanis/scrubdata-qwen3-4b-v6-q8)
- 📊 **WildClean dataset** (real-world dirty tables + injected-error benches) —
https://huggingface.co/datasets/ricalanis/wildclean
- 🔍 **Agent traces** (OpenTelemetry-GenAI spans from real runs) —
https://huggingface.co/datasets/build-small-hackathon/scrubdata-traces
- 📄 **Preprint** — *Verified Cleaning Plans: Plan-Level Selective Prediction Turns Local
LLM Planners into Trustworthy Table Cleaners* (`docs/paper/main.pdf`)
- 📓 **Field notes** (the build story, failures included) — `docs/FIELD_NOTES.md`
- 🛠️ **Tool reference** (the whole system, end to end) — `docs/TOOL_REFERENCE.md`
## Built with Codex
The final review-and-refine pass used **OpenAI Codex** (gpt-5.5) as a reviewer / last
refiner — not to write the product, but to harden it. It added the executor's
never-corrupt-clean-data regression tests, made column sanitization collision-proof,
did the accessibility pass (ARIA + keyboard + reduced-motion + focus-visible), and wrote
characterization tests for the reference matcher. Every change was human-reviewed and
verified green (84 tests, golden behavior unchanged) before commit; the commits are
attributed to `@codex` in the git history.
## Submission checklist (verified against the build-small-hackathon `/submit` tool)
- [x] Public Gradio Space in the `build-small-hackathon` org
- [x] Every model ≤ 32B (here ≤ 4B → **Tiny Titan**-eligible): `Qwen3-4B-Instruct-2507`
- [x] README `tags:` set — `track:backyard` + all six `achievement:*` quests (above)
- [x] **Off the Grid** (`offgrid`) — no third-party AI APIs; the planner is a local-runnable GGUF (Qwen3-4B). Self-hosted = fully on-device (zero external egress); the hosted demo serves the *same* model from a self-managed Modal GPU, not a SaaS API
- [x] **Well-Tuned** (`welltuned`) — fine-tune published: `ricalanis/scrubdata-qwen3-4b` (+ `-v6-q8` GGUF)
- [x] **Off-Brand** (`offbrand`) — custom `gr.Server` HTML/CSS frontend, not default Gradio
- [x] **Llama Champion** (`llama`) — runs through llama.cpp (Q8_0 GGUF)
- [x] **Sharing is Caring** (`sharing`) — agent traces on the Hub: `build-small-hackathon/scrubdata-traces`
- [x] **Field Notes** (`fieldnotes`) — build report: `docs/FIELD_NOTES.md`
- [x] Write-up in this README (idea + tech)
- [x] **Demo video** link in README: https://www.loom.com/share/2fa868147527496e8097d82dd546d663
- [x] **Social post** link in README: https://x.com/ric_alanis/status/2066598533738692983
- [x] Confirm deadline time/timezone — **June 15 2026, 23:59 UTC** (confirmed on the hackathon page)
Judged (no tag needed, just qualify): Tiny Titan · Off-Brand prize · Best Demo · Best Agent · Bonus Quest Champion.
|