# Data-Quality Loop ## Objective Keep a recurring dataset or knowledge base trustworthy by validating each refresh against explicit quality rules before it is published, instead of discovering bad data downstream. ## Trigger - Schedule: after each dataset or knowledge-base refresh, or nightly. - Event: upstream load completes, source schema changes, or a freshness SLA is breached. - Manual bootstrap/debug command: "validate the latest refresh and quarantine on failure." ## Intake - The new dataset version, its schema, row counts, and the prior accepted version. - Quality rules: schema constraints, null thresholds, range checks, uniqueness, referential integrity, and freshness. - Known accepted exceptions and prior false positives. ## Context - Required files: data contract or schema, quality-rule definitions, ownership map. - Runtime sources: current load metrics, profiling output, diff against the last accepted version. ## Agents - Profiler: computes row counts, distributions, null rates, and freshness for the new version. - Validator: checks the profile against quality rules and the prior accepted baseline. - Investigator: classifies each failure as a real defect, an accepted shift, or a rule bug. - Reporter: records pass/fail, quarantines bad versions, and proposes rule fixes. ## Workspace And Permissions - Read access to the dataset, schema, profiling tools, and prior versions. - Allowed to run validation queries, write a quality report, and quarantine a failing version. - Disallowed from editing source data, publishing to production, or relaxing rules without review. - Promoting a version to production is a human or policy-gated decision. ## Durable State - Dataset version, profile metrics, rule results, quarantine status, and accepted exceptions. - A quality ledger so recurring shifts are recognized rather than re-investigated every run. ## Loop Steps 1. Load the new version, its schema, and the last accepted baseline. 1. Profile the data: counts, distributions, nulls, ranges, uniqueness, and freshness. 1. Run quality rules and diff the profile against the baseline. 1. Classify each failure as a defect, an accepted shift, or a rule bug. 1. Quarantine the version if a hard rule fails; otherwise mark it ready for promotion. 1. Persist the profile, results, and any new accepted exception. 1. Stop when the version is validated and ready, quarantined, or blocked on owner judgment. ## Verification Gates - Hard rules (schema, referential integrity, freshness) pass before a version is promotable. - Distribution shifts beyond threshold are flagged with the metric and the delta. - Every quarantine cites the failing rule and the offending sample. - Accepted exceptions are recorded so the same shift is not re-flagged forever. ## Budget And Exit - Max retries: 2 re-profiling attempts per failing rule before escalation. - Max runtime: 30-90 minutes depending on dataset size. - Stop when the version is validated, quarantined with evidence, or needs owner judgment. ## Escalation Escalate for ambiguous distribution shifts, schema changes that need a data-contract update, repeated upstream failures, regulated or PII-bearing fields, or a rule that may itself be wrong. ## Loop Instruction ```text Validate the latest refresh against its data contract and quality rules. Profile counts, distributions, nulls, ranges, uniqueness, and freshness, then diff against the last accepted version. Quarantine on any hard-rule failure and cite the failing rule and a sample. Record accepted shifts so they are not re-flagged, and never publish to production yourself. Escalate ambiguous shifts, schema changes, or anything touching regulated fields. ``` Example automation: run after each upstream load or nightly, and quarantine failing versions before any downstream job consumes them. ## Failure Modes - Validating against a moving baseline so real regressions look normal. - Treating every distribution shift as a defect and drowning owners in false positives. - Passing a version on row count alone while a key column silently went null. - Editing or "cleaning" source data instead of quarantining and escalating. ## Safety Notes - Never copy raw PII or regulated values into reports; reference row keys or hashes instead. - Quarantine is reversible; publishing bad data downstream often is not, so fail closed. - Keep the prior accepted version immutable as the comparison baseline. ## Example Contract - [`examples/data-quality-loop.json`](../examples/data-quality-loop.json) ## References - [Effective Context Engineering for AI Agents](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents) - Treating durable data and state as managed runtime inputs rather than ad hoc dumps. - [Durable Execution for Agentic Workflows](https://arizenai.com/durable-execution/) - Checkpointing and replay patterns for recurring data jobs.