data-quality-env / SQL_AGENT_MIND.md
Hemanth Kunta
Meta hackathon submission
91e7690

SQL Agent Mind Guide

This document is a practical SQL reference used by the agent to reason deeply about data quality tasks.

Core SQL command pattern

  • Allowed: SELECT, WITH (CTEs)
  • Blocked: destructive statements (DROP, DELETE, UPDATE, etc.)

Most important SQL functions in this environment

Aggregation

  • COUNT(*)
  • SUM(...)
  • AVG(...)
  • MIN(...), MAX(...)

Data quality checks

  • CASE WHEN ... THEN ... ELSE ... END
  • IS NULL
  • TRY_CAST(...)
  • REPLACE(...)

Deduplication logic

  • GROUP BY ... HAVING COUNT(*) > 1
  • SUM(c - 1) where c is duplicate group count

Drift analysis

  • Baseline vs current mean comparison with subqueries
  • LEFT JOIN ... WHERE right_col IS NULL for novelty/referential drift
  • Distribution checks with GROUP BY

Task-specific deep probe examples

Task 1: Nulls + duplicates

SELECT SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) AS null_email,
       SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS null_customer_id
FROM customers;
SELECT COALESCE(SUM(c - 1), 0) AS duplicate_rows
FROM (
  SELECT customer_id, email, name, signup_date, country, COUNT(*) AS c
  FROM customers
  GROUP BY 1,2,3,4,5
  HAVING COUNT(*) > 1
) t;

Task 2: Schema and range violations

SELECT SUM(CASE WHEN quantity < 0 THEN 1 ELSE 0 END) AS negative_quantity_rows
FROM orders;
SELECT SUM(CASE WHEN try_cast(replace(amount, '$', '') AS DOUBLE) IS NULL THEN 1 ELSE 0 END) AS unparseable_amount_rows
FROM orders;

Task 3: Silent drift

SELECT
  (SELECT AVG(amount) FROM transactions_baseline) AS baseline_mean,
  (SELECT AVG(amount) FROM transactions_current) AS current_mean;
SELECT DISTINCT c.category
FROM transactions_current c
LEFT JOIN (SELECT DISTINCT category FROM transactions_baseline) b
  ON c.category = b.category
WHERE b.category IS NULL
ORDER BY c.category;
SELECT AVG(CASE WHEN user_id >= 1000 THEN 1.0 ELSE 0.0 END) AS new_user_row_pct
FROM transactions_current;

Deeper testing strategy

  1. Run sample + aggregate checks first.
  2. Validate each scoring dimension with one explicit probe.
  3. Add distribution probes to avoid blind spots.
  4. Submit report only after all dimensions are covered.