Spaces:

kumar6591
/

data-quality-env

Sleeping

App Files Files Community

data-quality-env / SQL_AGENT_MIND.md

Hemanth Kunta

Meta hackathon submission

91e7690 about 2 months ago

preview code

raw

history blame contribute delete

2.28 kB

SQL Agent Mind Guide

This document is a practical SQL reference used by the agent to reason deeply about data quality tasks.

Core SQL command pattern

Allowed: SELECT, WITH (CTEs)
Blocked: destructive statements (DROP, DELETE, UPDATE, etc.)

Most important SQL functions in this environment

Aggregation

COUNT(*)
SUM(...)
AVG(...)
MIN(...), MAX(...)

Data quality checks

CASE WHEN ... THEN ... ELSE ... END
IS NULL
TRY_CAST(...)
REPLACE(...)

Deduplication logic

GROUP BY ... HAVING COUNT(*) > 1
SUM(c - 1) where c is duplicate group count

Drift analysis

Baseline vs current mean comparison with subqueries
LEFT JOIN ... WHERE right_col IS NULL for novelty/referential drift
Distribution checks with GROUP BY

Task-specific deep probe examples

Task 1: Nulls + duplicates

SELECT SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) AS null_email,
       SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS null_customer_id
FROM customers;

SELECT COALESCE(SUM(c - 1), 0) AS duplicate_rows
FROM (
  SELECT customer_id, email, name, signup_date, country, COUNT(*) AS c
  FROM customers
  GROUP BY 1,2,3,4,5
  HAVING COUNT(*) > 1
) t;

Task 2: Schema and range violations

SELECT SUM(CASE WHEN quantity < 0 THEN 1 ELSE 0 END) AS negative_quantity_rows
FROM orders;

SELECT SUM(CASE WHEN try_cast(replace(amount, '$', '') AS DOUBLE) IS NULL THEN 1 ELSE 0 END) AS unparseable_amount_rows
FROM orders;

Task 3: Silent drift

SELECT
  (SELECT AVG(amount) FROM transactions_baseline) AS baseline_mean,
  (SELECT AVG(amount) FROM transactions_current) AS current_mean;

SELECT DISTINCT c.category
FROM transactions_current c
LEFT JOIN (SELECT DISTINCT category FROM transactions_baseline) b
  ON c.category = b.category
WHERE b.category IS NULL
ORDER BY c.category;

SELECT AVG(CASE WHEN user_id >= 1000 THEN 1.0 ELSE 0.0 END) AS new_user_row_pct
FROM transactions_current;

Deeper testing strategy

Run sample + aggregate checks first.
Validate each scoring dimension with one explicit probe.
Add distribution probes to avoid blind spots.
Submit report only after all dimensions are covered.