scrubdata / PRODUCT.md
OpenAI Codex
deploy: add sponsor:openai tag (Best Use of Codex) + Codex-hardened build
16dc556
|
Raw
History Blame Contribute Delete
8.15 kB
# ScrubData β€” Product Research & Spec
> What does an office worker actually mean by "just clean my data"? This doc
> pins down the expectations so the cleaning-plan schema and UX aren't guesses.
> (Living doc β€” refine when the deep-research workflows land.)
## 1. The user & the moment
**Who:** an operations / sales-ops / finance / admin person. Lives in
spreadsheets exported from a CRM, an ERP, a Google Form, a POS, a bank portal.
Not a pandas user. Competent with Excel but doesn't want to write `=PROPER()`
across 40 columns or learn Power Query.
**The moment of pain:** they exported a file to do their actual job β€”
build a report, upload to another system, send a mail-merge, reconcile numbers β€”
and the file is dirty enough that the next step breaks or lies. The import fails,
the pivot double-counts, the vlookup misses, the "total revenue" is wrong because
amounts are text.
**What they want:** drop the file in, get a *trustworthy* clean file back, and
a plain sentence telling them what was wrong so they can vouch for it to their
boss. They do **not** want 30 config toggles. Hands-off is the whole pitch.
**What they fear (must design against):** that the tool silently changed
something it shouldn't have. Trust is the product. Every change must be
**visible, explained, and reversible**.
## 2. Taxonomy of "dirty" β€” what we must detect & fix
Grouped by how an office worker would describe it. This list *is* the operation
set the planner emits and the executor implements.
### A. Structural / table-level
- **Exact duplicate rows** β€” "this person is in here 3 times."
- **Near-duplicate rows** β€” same entity, trivial differences (later/stretch).
- **Empty rows & empty columns** β€” junk from the export.
- **Header problems** β€” header not in row 1, merged cells, `Unnamed: 0`,
duplicated column names, units baked into headers (`Amount (USD)`).
- **Inconsistent column naming** β€” `First Name` vs `first_name` (normalize to
snake_case as an option, off by default β€” it's a rename, higher-trust-risk).
### B. Whitespace & casing (the silent killers behind failed joins)
- Leading/trailing whitespace; doubled internal spaces; non-breaking spaces.
- Inconsistent casing (`ACME`, `Acme`, `acme corp`).
- Invisible characters (zero-width, BOM), smart quotes.
### C. Missing values, disguised
- Real blanks **plus** disguised nulls: `N/A`, `na`, `-`, `--`, `null`, `None`,
`#N/A`, `TBD`, `?`, `0` (context-dependent β€” risky, don't auto-assume).
- Decision: normalize disguised nulls β†’ true missing; **imputation is opt-in**,
never silent (filling values is a claim about reality).
### D. Type & format inconsistency (where the model earns its keep)
- **Numbers stored as text:** `"$1,200.50"`, `"1.200,50"` (EU), `"(500)"`
(accounting negative), `"12%"`, `"1,2k"`.
- **Dates in mixed formats:** `2023-01-05`, `01/05/2023`, `5 Jan 2023`,
`Jan-23`, Excel serial `44931`. Ambiguous DMY vs MDY must be detected, not
guessed blindly β€” infer from the column's evidence, flag if undecidable.
- **Booleans:** `Yes/No`, `Y/N`, `TRUE/FALSE`, `1/0`, `T/F`, `βœ“`.
- **Phone numbers:** wildly inconsistent; standardize to E.164-ish where region
is inferable, else just strip to digits + canonical format.
- **Emails:** casing, whitespace, obvious typos (`@gmial.com`), trailing junk.
### E. Categorical canonicalization (the headline AI feature)
- Inconsistent labels for the same thing: `USA / U.S.A. / United States / us`,
`M/F vs Male/Female`, `NY / New York / new york`, status fields, product
names. Rules can't enumerate these β€” **the small model proposes the mapping**,
the executor applies it, the report shows the mapping for approval.
### F. Validity / anomaly flags (flag, don't auto-delete)
- Out-of-range numbers (age 999, negative price), impossible dates (1899-12-31
Excel epoch), malformed emails/phones, values that don't match the column's
inferred type. Default action = **flag in the report**, not silent edit.
## 3. The trust contract (design principles)
1. **Visible** β€” every operation appears in a before/after diff and the report.
2. **Explained** β€” plain-English rationale per operation ("standardized 4 date
formats into ISO `YYYY-MM-DD`").
3. **Conservative by default** β€” destructive/assumptive ops (imputation, row
deletion beyond exact dups, renames) are surfaced as suggestions, applied
only if the user keeps them on. Safe ops (trim whitespace, normalize disguised
nulls, parse types) are on by default.
4. **Reversible** β€” original file untouched; output is a new file + a machine-
readable plan the user could replay or undo.
5. **No config to start** β€” sensible defaults run immediately on upload; the
plan is editable *after* the user sees it, not a wall of options before.
## 4. Competitive landscape (what to learn / what to beat)
| Tool | What it does well | Why an office worker bounces |
|------|-------------------|------------------------------|
| **Excel / Power Query** | Ubiquitous, trusted | Manual; canonicalization is hand-built; steep |
| **OpenRefine** | Powerful clustering/canonicalization (key-collision, kNN) | Intimidating UI, GREL expressions, local Java app |
| **ydata-profiling / pandas-profiling** | Great *profiling* report | Diagnoses, doesn't *fix* |
| **Trifacta / Tableau Prep / Alteryx** | Visual prep pipelines | Enterprise, paid, config-heavy |
| **OpenRefine reconciliation** | Entity canonicalization | Manual, needs setup |
**Our wedge:** OpenRefine's clustering *automated and explained by a small
model*, with zero config and a one-screen trust-preserving UX. We borrow
OpenRefine's clustering idea but the model proposes the clusters/mappings and
narrates them, so the user never learns a tool β€” they just approve sentences.
## 5. Cleaning-plan schema (v0 β€” drives the mock & later the model)
The model outputs this JSON; the executor consumes it. Designed so the model
only does *semantic/fuzzy* judgment, and all execution is deterministic.
```json
{
"dataset_summary": "Contacts export, 38 rows Γ— 9 cols; sales-lead data.",
"table_operations": [
{"op": "drop_exact_duplicates", "rationale": "5 identical rows."},
{"op": "drop_empty_rows"},
{"op": "drop_empty_columns", "columns": ["notes2"]}
],
"columns": [
{
"name": "country",
"detected_semantic_type": "country",
"issues": ["inconsistent_categories", "whitespace", "casing"],
"operations": [
{"op": "strip_whitespace"},
{"op": "canonicalize_categories",
"mapping": {"usa": "United States", "u.s.a.": "United States",
"us": "United States", "uk": "United Kingdom"},
"rationale": "Unified 4 spellings into 2 canonical country names."}
],
"confidence": 0.93
},
{
"name": "amount",
"detected_semantic_type": "currency",
"issues": ["numeric_stored_as_text", "currency_symbols"],
"operations": [
{"op": "parse_currency", "rationale": "Stripped $ and thousands separators; β†’ float."}
],
"confidence": 0.97
}
],
"flags": [
{"column": "age", "row_hint": "value 999", "issue": "out_of_range",
"action": "flag_only", "rationale": "Likely placeholder; left for human review."}
]
}
```
### Operation vocabulary (executor must implement)
Safe-by-default: `strip_whitespace`, `collapse_internal_whitespace`,
`normalize_disguised_nulls`, `standardize_case`, `parse_currency`,
`parse_number`, `parse_percent`, `parse_date`, `standardize_boolean`,
`standardize_phone`, `normalize_email`, `drop_exact_duplicates`,
`drop_empty_rows`, `drop_empty_columns`, `canonicalize_categories`.
Opt-in (assumptive): `impute_missing`, `drop_near_duplicates`,
`rename_columns_snake_case`, `coerce_outliers`.
Flag-only: `flag_out_of_range`, `flag_invalid_format`, `flag_type_mismatch`.
## 6. Success metric for the demo (Backyard AI judging)
A real office person uploads a real ugly export, clicks one button, and says
"oh thank god" β€” then trusts the result enough to use it, because the report
told them exactly what changed. That sentence is the bar.