scrubdata / training /README.md
OpenAI Codex
deploy: add sponsor:openai tag (Best Use of Codex) + Codex-hardened build
16dc556
|
Raw
History Blame Contribute Delete
2.18 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Training-data generation

Self-verified synthetic SFT data for the ScrubData planner: (dirty profile → JSON cleaning plan) pairs. We generate clean tables, inject controlled mess, and because we created the mess the ground-truth plan is known. Every example is then verified by running scrubdata.executor (dirty + plan must recover the clean original) — only perfectly-recovered examples are kept.

Run

uv run training/build_dataset.py --n 2000 --out data/train.jsonl --seed 0

Output is chat-format JSONL (messages: system/user/assistant) using the shared scrubdata/prompt.py serialization, so training === inference.

Why it's hard (not just heuristic imitation)

The early version drew from toy pools (4 countries, 3 category sets) — the target plans were exactly what the deterministic mock_planner already produces, so a fine-tune would just clone a free heuristic. The generator is now backed by real vocabularies the heuristic has no knowledge of:

  • vocab.py — countries / US states / currencies (offline via pycountry), ~460 cities (curated aliases + cities.txt, open world-cities data), departments, job titles, status sets. Each canonical has realistic surface variants (aliases, ISO codes, casing, punctuation, single-char typos).
  • Columns stay low-cardinality (a few canonicals each) so every dirty surface appears in the profile sample the model sees — the task is learnable and the executor can recover it.

Latest 2000-example build: 844 distinct canonical targets, 33k surface→canonical pairs, all 10 semantic types, plus anomaly flag-only examples (teach surfacing implausible values without changing them). ~93% of attempts verify; the rest are dropped (the quality gate).

Files

  • fields.py — field archetypes (clean generator + matched corruptor).
  • vocab.py — real vocabularies + the surface-corruption engine.
  • cities.txt — 400 extra cities derived from open world-cities data.
  • generate.py — assembles one example (columns + table corruptions + anomaly flags).
  • build_dataset.py — loops, verifies via the executor, writes JSONL.