Spaces:
Running
Running
| # Training-data generation | |
| Self-verified synthetic SFT data for the ScrubData planner: `(dirty profile → JSON | |
| cleaning plan)` pairs. We generate **clean** tables, inject controlled mess, and | |
| because *we* created the mess the ground-truth plan is known. Every example is then | |
| **verified by running `scrubdata.executor`** (dirty + plan must recover the clean | |
| original) — only perfectly-recovered examples are kept. | |
| ## Run | |
| ```bash | |
| uv run training/build_dataset.py --n 2000 --out data/train.jsonl --seed 0 | |
| ``` | |
| Output is chat-format JSONL (`messages`: system/user/assistant) using the shared | |
| `scrubdata/prompt.py` serialization, so **training === inference**. | |
| ## Why it's hard (not just heuristic imitation) | |
| The early version drew from toy pools (4 countries, 3 category sets) — the target | |
| plans were exactly what the deterministic `mock_planner` already produces, so a | |
| fine-tune would just clone a free heuristic. The generator is now backed by **real | |
| vocabularies** the heuristic has no knowledge of: | |
| - **`vocab.py`** — countries / US states / currencies (offline via `pycountry`), | |
| ~460 cities (curated aliases + `cities.txt`, open world-cities data), departments, | |
| job titles, status sets. Each canonical has realistic surface variants (aliases, | |
| ISO codes, casing, punctuation, single-char typos). | |
| - Columns stay **low-cardinality** (a few canonicals each) so every dirty surface | |
| appears in the profile sample the model sees — the task is learnable *and* the | |
| executor can recover it. | |
| Latest 2000-example build: **844 distinct canonical targets**, **33k surface→canonical | |
| pairs**, all 10 semantic types, plus **anomaly flag-only** examples (teach surfacing | |
| implausible values without changing them). ~93% of attempts verify; the rest are | |
| dropped (the quality gate). | |
| ## Files | |
| - `fields.py` — field archetypes (clean generator + matched corruptor). | |
| - `vocab.py` — real vocabularies + the surface-corruption engine. | |
| - `cities.txt` — 400 extra cities derived from open world-cities data. | |
| - `generate.py` — assembles one example (columns + table corruptions + anomaly flags). | |
| - `build_dataset.py` — loops, **verifies via the executor**, writes JSONL. | |