Spaces:
Running
A newer version of the Gradio SDK is available: 6.19.0
Training-data generation
Self-verified synthetic SFT data for the ScrubData planner: (dirty profile → JSON cleaning plan) pairs. We generate clean tables, inject controlled mess, and
because we created the mess the ground-truth plan is known. Every example is then
verified by running scrubdata.executor (dirty + plan must recover the clean
original) — only perfectly-recovered examples are kept.
Run
uv run training/build_dataset.py --n 2000 --out data/train.jsonl --seed 0
Output is chat-format JSONL (messages: system/user/assistant) using the shared
scrubdata/prompt.py serialization, so training === inference.
Why it's hard (not just heuristic imitation)
The early version drew from toy pools (4 countries, 3 category sets) — the target
plans were exactly what the deterministic mock_planner already produces, so a
fine-tune would just clone a free heuristic. The generator is now backed by real
vocabularies the heuristic has no knowledge of:
vocab.py— countries / US states / currencies (offline viapycountry), ~460 cities (curated aliases +cities.txt, open world-cities data), departments, job titles, status sets. Each canonical has realistic surface variants (aliases, ISO codes, casing, punctuation, single-char typos).- Columns stay low-cardinality (a few canonicals each) so every dirty surface appears in the profile sample the model sees — the task is learnable and the executor can recover it.
Latest 2000-example build: 844 distinct canonical targets, 33k surface→canonical pairs, all 10 semantic types, plus anomaly flag-only examples (teach surfacing implausible values without changing them). ~93% of attempts verify; the rest are dropped (the quality gate).
Files
fields.py— field archetypes (clean generator + matched corruptor).vocab.py— real vocabularies + the surface-corruption engine.cities.txt— 400 extra cities derived from open world-cities data.generate.py— assembles one example (columns + table corruptions + anomaly flags).build_dataset.py— loops, verifies via the executor, writes JSONL.