Spaces:

build-small-hackathon
/

scrubdata

Running

App Files Files Community

scrubdata / docs /FIELD_NOTES.md

OpenAI Codex

deploy: add sponsor:openai tag (Best Use of Codex) + Codex-hardened build

16dc556 16 days ago

preview code

Raw

History Blame Contribute Delete

8.49 kB

	# Field notes — building ScrubData small, on purpose

	*Build Small Hackathon, June 2026. A ≤4B model, a Gradio Space, and two weeks of
	finding out what "small but honest" actually costs.*

	## The bet

	The person who most needs data cleaning — the ops coordinator with a messy CRM export
	and a Monday deadline — will never write a pandas script, and shouldn't have to ship
	her customer data to a frontier API either. The bet: a 4B model running locally is
	enough, if you stop asking it to edit data and start asking it to plan.

	So the model never touches a cell. It reads an aggregated profile (per-value frequency
	counts — so the model sees a bounded, fixed-size summary whether the table has a hundred
	rows or a million) and emits a JSON plan; deterministic pandas executes it. Every change is named, reversible, and logged. Silent edits are
	impossible by construction. That decomposition turned out to be the whole project.

	## Things that broke, in order

	The fine-tune that aced the test and failed the job. v4 hit canonicalization F1
	0.90 on held-out synthetic data — and scored exactly 0.000 on real hospital typos. It
	had never seen a high-cardinality real column. Fix: derive training pairs from real
	dirty/clean benchmark tables by cell alignment, keeping only learnable
	canonicalizations (a surface form that's a string variant of its target and never a
	legitimate value elsewhere). Real repair recall: 0.00 → 0.42. Synthetic data teaches
	the format; real data teaches the job.

	The GGUF that lobotomized the model. Same adapter, two exports: Q8_0 worked
	perfectly, Q4_K_M degenerated into `<tool_call>` loops. Hours of template debugging
	later: the quantization itself was corrupting the export. Then the bf16 path had its
	own version — training converged (loss 0.16) but free-running generation still
	emitted tool-call loops, because Qwen3's tool-calling prior dominates the first token.
	The fix is two tokens long: `suppress_tokens=[151657, 151658]`.

	The model that invented cities. Asked for canonical forms, a generative model
	generates — including `guntxrsvillx → huntsville` (wrong town). Frequency clustering
	can't fix this either: a lone column has no signal to vote against the error (GARF
	proves this structurally). The fix came from the literature: never free-generate a
	canonical. Retrieve candidates from a reference taxonomy (GeoNames, ISO), require a
	similarity threshold and an ambiguity margin, and abstain when unsure. `boxz` is
	equally close to `Box` and `Boaz` — so the system declines and asks. We measured the
	abstention: precision rises monotonically with the threshold (90% at the default, 95%
	at 0.91). Knowing when not to act turned out to be the most valuable feature.

	The eval that graded itself too kindly — twice. Our own ablations caught two metric
	artifacts: (1) convention-tolerant scoring counted bulk case-rewrites as "good
	changes," inflating precision — removing case-matching gained +0.12 until we made
	the metric churn-neutral; (2) our adversarial traps included `Boazz`, which grounding
	correctly maps to the real city Boaz — the trap was punishing correct behavior. Both
	fixes are reported in the paper as results, because an eval you haven't tried to break
	is an eval you can't trust.

	The honest negative result. On injected typos, classical frequency clustering
	remains a strong baseline — by construction: injection puts the canonical in the
	column, which is clustering's ideal regime. Grounding's edge is real errors, tail
	entities, and not wrong-merging. We report both slices separately rather than
	averaging the difference away.

	The verifier that made the model shippable. The fine-tune's hospital numbers told
	an awkward story: recall 0.475 (best we'd measured for a local model) at precision
	0.185 — it fixed errors and invented merges. Instead of retraining, we scored every
	proposed mapping with three deterministic gates distilled from its actual failures: a
	value occurring ≥3 times is data, not a typo (errors are rare); a repair target must
	dominate its source in frequency (no mapping one typo onto another); digit-bearing
	codes only repair when the letter part is near-identical (`amix-2 → ami-2` yes,
	`ak_ → al_` no). The gated model plan alone: 0.993 precision at 0.287 coverage —
	146 of 147 changes correct. Union it with the grounded heuristic and you get **0.905
	precision at 0.413 coverage** on hospital's 509 real errors. Every dropped mapping
	becomes a review flag, not a silent skip. That composition — verify the model's
	output, never trust it — is what the app now ships as its default planner.

	## The PII turn

	A friend pointed at the OpenMed project (small Apache-2.0 token classifiers; their
	paper is the sister result to our thesis — small specialized beats big generic). Their
	44M PII model, trained on clinical sentences, turned out to transfer perfectly to
	bare CSV cells: 100% on names and addresses, no prompt template needed. We put it
	behind a sensitive-type allowlist and a column-level vote, added a deterministic
	checksum tier (Luhn, IBAN mod-97 — math, not vibes), and made masking an executor
	operation. Leak test: 0/360 residual detectable PII after masking. OOD type detection:
	5/5 with 0/7 false positives. The privacy ribbon at the top of the app — "nothing
	leaves this machine" — now describes the PII handling too, not just the inference.

	## The word that broke the demo

	We shipped the engine, then sent the live Space to people who actually have messy
	spreadsheets and aren't data people. The most useful feedback wasn't a bug report — it
	was that the word "cleaning" didn't mean anything to them. One tester read "clean my
	Excel" as deleting data: "¿Te refieres a que elimine algo de algún archivo?" ("you
	mean it removes something from the file?"). Another didn't know where to start: *"¿eso
	del Excel te lo subimos ahí o cómo?"* ("the Excel thing — do we upload it there, or
	how?"). The clearest explanation of the whole product turned out to be a sentence we
	typed by hand in a chat reply — "it fixes text errors: names, phones, emails, cities" —
	and that sentence was nowhere in the app.

	The engine was fine. The framing was the failure. So we changed the product to show
	what cleaning is instead of naming it: the hero now opens with a literal before→after
	strip (`nigeia → Nigeria`, `Calfornia → California`) before any upload, the headline is
	the sentence that worked in chat ("Fix the messy text in your spreadsheet"), the copy
	says plainly "I never delete your data," jargon labels are gone ("with PII" → "with
	sensitive data"), and a one-click "watch it run on a sample" path removes the "where do I
	even start" wall. One honesty footnote from the rewrite: our first before→after example
	added a `+52` country code to a phone number — which the executor doesn't actually do — so
	we cut it. The demo strip can only show what the engine truly does.

	n was small and informal (~3 people we know), so this isn't a usability study. But you
	only need to watch one person mistake your tool for a delete button to learn the lesson:
	the people who most need the tool don't share your vocabulary, and the demo has to teach
	the concept before it can show the feature.

	## What we'd tell the next person

	1. Planner/executor is the trust unlock. Auditability isn't a feature you add;
	it's a decomposition you choose.
	2. Verify supervision by executing it. Every training example we kept provably
	recovers the clean table. Bad plans can't become labels.
	3. Ground generation in references and budget for abstention. A small model that
	declines correctly beats a big model that guesses confidently.
	4. Attack your own eval before reviewers do. Both of our metric bugs were found by
	ablations we almost didn't run.
	5. Small models are enough more often than you think — and roughly $35 of GPU
	credit covers an embarrassing number of mistakes if each one teaches you something.
	6. Test the framing on someone outside your vocabulary. The engine can be correct and
	the product still unusable if the first screen assumes a word — "cleaning" — that your
	user doesn't have. Show the concept before you name the feature.

	— Built with a ≤4B planner, a 44M PII classifier, checksums, and a reference gazetteer.
	Total model weight: under 4.1B parameters. Total cloud spend: about $35.