scrubdata / training /README.md
OpenAI Codex
deploy: add sponsor:openai tag (Best Use of Codex) + Codex-hardened build
16dc556
|
Raw
History Blame Contribute Delete
2.18 kB
# Training-data generation
Self-verified synthetic SFT data for the ScrubData planner: `(dirty profile → JSON
cleaning plan)` pairs. We generate **clean** tables, inject controlled mess, and
because *we* created the mess the ground-truth plan is known. Every example is then
**verified by running `scrubdata.executor`** (dirty + plan must recover the clean
original) — only perfectly-recovered examples are kept.
## Run
```bash
uv run training/build_dataset.py --n 2000 --out data/train.jsonl --seed 0
```
Output is chat-format JSONL (`messages`: system/user/assistant) using the shared
`scrubdata/prompt.py` serialization, so **training === inference**.
## Why it's hard (not just heuristic imitation)
The early version drew from toy pools (4 countries, 3 category sets) — the target
plans were exactly what the deterministic `mock_planner` already produces, so a
fine-tune would just clone a free heuristic. The generator is now backed by **real
vocabularies** the heuristic has no knowledge of:
- **`vocab.py`** — countries / US states / currencies (offline via `pycountry`),
~460 cities (curated aliases + `cities.txt`, open world-cities data), departments,
job titles, status sets. Each canonical has realistic surface variants (aliases,
ISO codes, casing, punctuation, single-char typos).
- Columns stay **low-cardinality** (a few canonicals each) so every dirty surface
appears in the profile sample the model sees — the task is learnable *and* the
executor can recover it.
Latest 2000-example build: **844 distinct canonical targets**, **33k surface→canonical
pairs**, all 10 semantic types, plus **anomaly flag-only** examples (teach surfacing
implausible values without changing them). ~93% of attempts verify; the rest are
dropped (the quality gate).
## Files
- `fields.py` — field archetypes (clean generator + matched corruptor).
- `vocab.py` — real vocabularies + the surface-corruption engine.
- `cities.txt` — 400 extra cities derived from open world-cities data.
- `generate.py` — assembles one example (columns + table corruptions + anomaly flags).
- `build_dataset.py` — loops, **verifies via the executor**, writes JSONL.