data-quality-env / SQL_AGENT_MIND.md
Hemanth Kunta
Meta hackathon submission
91e7690
# SQL Agent Mind Guide
This document is a practical SQL reference used by the agent to reason deeply about data quality tasks.
## Core SQL command pattern
- Allowed: `SELECT`, `WITH` (CTEs)
- Blocked: destructive statements (`DROP`, `DELETE`, `UPDATE`, etc.)
## Most important SQL functions in this environment
### Aggregation
- `COUNT(*)`
- `SUM(...)`
- `AVG(...)`
- `MIN(...)`, `MAX(...)`
### Data quality checks
- `CASE WHEN ... THEN ... ELSE ... END`
- `IS NULL`
- `TRY_CAST(...)`
- `REPLACE(...)`
### Deduplication logic
- `GROUP BY ... HAVING COUNT(*) > 1`
- `SUM(c - 1)` where `c` is duplicate group count
### Drift analysis
- Baseline vs current mean comparison with subqueries
- `LEFT JOIN ... WHERE right_col IS NULL` for novelty/referential drift
- Distribution checks with `GROUP BY`
## Task-specific deep probe examples
### Task 1: Nulls + duplicates
```sql
SELECT SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) AS null_email,
SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS null_customer_id
FROM customers;
```
```sql
SELECT COALESCE(SUM(c - 1), 0) AS duplicate_rows
FROM (
SELECT customer_id, email, name, signup_date, country, COUNT(*) AS c
FROM customers
GROUP BY 1,2,3,4,5
HAVING COUNT(*) > 1
) t;
```
### Task 2: Schema and range violations
```sql
SELECT SUM(CASE WHEN quantity < 0 THEN 1 ELSE 0 END) AS negative_quantity_rows
FROM orders;
```
```sql
SELECT SUM(CASE WHEN try_cast(replace(amount, '$', '') AS DOUBLE) IS NULL THEN 1 ELSE 0 END) AS unparseable_amount_rows
FROM orders;
```
### Task 3: Silent drift
```sql
SELECT
(SELECT AVG(amount) FROM transactions_baseline) AS baseline_mean,
(SELECT AVG(amount) FROM transactions_current) AS current_mean;
```
```sql
SELECT DISTINCT c.category
FROM transactions_current c
LEFT JOIN (SELECT DISTINCT category FROM transactions_baseline) b
ON c.category = b.category
WHERE b.category IS NULL
ORDER BY c.category;
```
```sql
SELECT AVG(CASE WHEN user_id >= 1000 THEN 1.0 ELSE 0.0 END) AS new_user_row_pct
FROM transactions_current;
```
## Deeper testing strategy
1. Run sample + aggregate checks first.
2. Validate each scoring dimension with one explicit probe.
3. Add distribution probes to avoid blind spots.
4. Submit report only after all dimensions are covered.