Data-Quality Loop
Objective
Keep a recurring dataset or knowledge base trustworthy by validating each refresh against explicit quality rules before it is published, instead of discovering bad data downstream.
Trigger
- Schedule: after each dataset or knowledge-base refresh, or nightly.
- Event: upstream load completes, source schema changes, or a freshness SLA is breached.
- Manual bootstrap/debug command: "validate the latest refresh and quarantine on failure."
Intake
- The new dataset version, its schema, row counts, and the prior accepted version.
- Quality rules: schema constraints, null thresholds, range checks, uniqueness, referential integrity, and freshness.
- Known accepted exceptions and prior false positives.
Context
- Required files: data contract or schema, quality-rule definitions, ownership map.
- Runtime sources: current load metrics, profiling output, diff against the last accepted version.
Agents
- Profiler: computes row counts, distributions, null rates, and freshness for the new version.
- Validator: checks the profile against quality rules and the prior accepted baseline.
- Investigator: classifies each failure as a real defect, an accepted shift, or a rule bug.
- Reporter: records pass/fail, quarantines bad versions, and proposes rule fixes.
Workspace And Permissions
- Read access to the dataset, schema, profiling tools, and prior versions.
- Allowed to run validation queries, write a quality report, and quarantine a failing version.
- Disallowed from editing source data, publishing to production, or relaxing rules without review.
- Promoting a version to production is a human or policy-gated decision.
Durable State
- Dataset version, profile metrics, rule results, quarantine status, and accepted exceptions.
- A quality ledger so recurring shifts are recognized rather than re-investigated every run.
Loop Steps
- Load the new version, its schema, and the last accepted baseline.
- Profile the data: counts, distributions, nulls, ranges, uniqueness, and freshness.
- Run quality rules and diff the profile against the baseline.
- Classify each failure as a defect, an accepted shift, or a rule bug.
- Quarantine the version if a hard rule fails; otherwise mark it ready for promotion.
- Persist the profile, results, and any new accepted exception.
- Stop when the version is validated and ready, quarantined, or blocked on owner judgment.
Verification Gates
- Hard rules (schema, referential integrity, freshness) pass before a version is promotable.
- Distribution shifts beyond threshold are flagged with the metric and the delta.
- Every quarantine cites the failing rule and the offending sample.
- Accepted exceptions are recorded so the same shift is not re-flagged forever.
Budget And Exit
- Max retries: 2 re-profiling attempts per failing rule before escalation.
- Max runtime: 30-90 minutes depending on dataset size.
- Stop when the version is validated, quarantined with evidence, or needs owner judgment.
Escalation
Escalate for ambiguous distribution shifts, schema changes that need a data-contract update, repeated upstream failures, regulated or PII-bearing fields, or a rule that may itself be wrong.
Loop Instruction
Validate the latest <dataset> refresh against its data contract and quality rules.
Profile counts, distributions, nulls, ranges, uniqueness, and freshness, then diff against the last accepted version.
Quarantine on any hard-rule failure and cite the failing rule and a sample.
Record accepted shifts so they are not re-flagged, and never publish to production yourself.
Escalate ambiguous shifts, schema changes, or anything touching regulated fields.
Example automation: run after each upstream load or nightly, and quarantine failing versions before any downstream job consumes them.
Failure Modes
- Validating against a moving baseline so real regressions look normal.
- Treating every distribution shift as a defect and drowning owners in false positives.
- Passing a version on row count alone while a key column silently went null.
- Editing or "cleaning" source data instead of quarantining and escalating.
Safety Notes
- Never copy raw PII or regulated values into reports; reference row keys or hashes instead.
- Quarantine is reversible; publishing bad data downstream often is not, so fail closed.
- Keep the prior accepted version immutable as the comparison baseline.
Example Contract
References
- Effective Context Engineering for AI Agents - Treating durable data and state as managed runtime inputs rather than ad hoc dumps.
- Durable Execution for Agentic Workflows - Checkpointing and replay patterns for recurring data jobs.