Spaces:

build-small-hackathon
/

scrubdata

Running

App Files Files Community

scrubdata / training /README.md

OpenAI Codex

deploy: add sponsor:openai tag (Best Use of Codex) + Codex-hardened build

16dc556 18 days ago

preview code

Raw

History Blame Contribute Delete

2.18 kB

	# Training-data generation

	Self-verified synthetic SFT data for the ScrubData planner: `(dirty profile → JSON
	cleaning plan)` pairs. We generate clean tables, inject controlled mess, and
	because we created the mess the ground-truth plan is known. Every example is then
	verified by running `scrubdata.executor` (dirty + plan must recover the clean
	original) — only perfectly-recovered examples are kept.

	## Run

	```bash
	uv run training/build_dataset.py --n 2000 --out data/train.jsonl --seed 0
	```

	Output is chat-format JSONL (`messages`: system/user/assistant) using the shared
	`scrubdata/prompt.py` serialization, so training === inference.

	## Why it's hard (not just heuristic imitation)

	The early version drew from toy pools (4 countries, 3 category sets) — the target
	plans were exactly what the deterministic `mock_planner` already produces, so a
	fine-tune would just clone a free heuristic. The generator is now backed by **real
	vocabularies** the heuristic has no knowledge of:

	- `vocab.py` — countries / US states / currencies (offline via `pycountry`),
	~460 cities (curated aliases + `cities.txt`, open world-cities data), departments,
	job titles, status sets. Each canonical has realistic surface variants (aliases,
	ISO codes, casing, punctuation, single-char typos).
	- Columns stay low-cardinality (a few canonicals each) so every dirty surface
	appears in the profile sample the model sees — the task is learnable and the
	executor can recover it.

	Latest 2000-example build: 844 distinct canonical targets, **33k surface→canonical
	pairs, all 10 semantic types, plus anomaly flag-only** examples (teach surfacing
	implausible values without changing them). ~93% of attempts verify; the rest are
	dropped (the quality gate).

	## Files
	- `fields.py` — field archetypes (clean generator + matched corruptor).
	- `vocab.py` — real vocabularies + the surface-corruption engine.
	- `cities.txt` — 400 extra cities derived from open world-cities data.
	- `generate.py` — assembles one example (columns + table corruptions + anomaly flags).
	- `build_dataset.py` — loops, verifies via the executor, writes JSONL.