Spaces:

kumar6591
/

data-quality-env

Sleeping

File size: 2,280 Bytes

91e7690

# SQL Agent Mind Guide

This document is a practical SQL reference used by the agent to reason deeply about data quality tasks.

## Core SQL command pattern
- Allowed: `SELECT`, `WITH` (CTEs)
- Blocked: destructive statements (`DROP`, `DELETE`, `UPDATE`, etc.)

## Most important SQL functions in this environment

### Aggregation
- `COUNT(*)`
- `SUM(...)`
- `AVG(...)`
- `MIN(...)`, `MAX(...)`

### Data quality checks
- `CASE WHEN ... THEN ... ELSE ... END`
- `IS NULL`
- `TRY_CAST(...)`
- `REPLACE(...)`

### Deduplication logic
- `GROUP BY ... HAVING COUNT(*) > 1`
- `SUM(c - 1)` where `c` is duplicate group count

### Drift analysis
- Baseline vs current mean comparison with subqueries
- `LEFT JOIN ... WHERE right_col IS NULL` for novelty/referential drift
- Distribution checks with `GROUP BY`

## Task-specific deep probe examples

### Task 1: Nulls + duplicates
```sql
SELECT SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) AS null_email,
       SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS null_customer_id
FROM customers;
```

```sql
SELECT COALESCE(SUM(c - 1), 0) AS duplicate_rows
FROM (
  SELECT customer_id, email, name, signup_date, country, COUNT(*) AS c
  FROM customers
  GROUP BY 1,2,3,4,5
  HAVING COUNT(*) > 1
) t;
```

### Task 2: Schema and range violations
```sql
SELECT SUM(CASE WHEN quantity < 0 THEN 1 ELSE 0 END) AS negative_quantity_rows
FROM orders;
```

```sql
SELECT SUM(CASE WHEN try_cast(replace(amount, '$', '') AS DOUBLE) IS NULL THEN 1 ELSE 0 END) AS unparseable_amount_rows
FROM orders;
```

### Task 3: Silent drift
```sql
SELECT
  (SELECT AVG(amount) FROM transactions_baseline) AS baseline_mean,
  (SELECT AVG(amount) FROM transactions_current) AS current_mean;
```

```sql
SELECT DISTINCT c.category
FROM transactions_current c
LEFT JOIN (SELECT DISTINCT category FROM transactions_baseline) b
  ON c.category = b.category
WHERE b.category IS NULL
ORDER BY c.category;
```

```sql
SELECT AVG(CASE WHEN user_id >= 1000 THEN 1.0 ELSE 0.0 END) AS new_user_row_pct
FROM transactions_current;
```

## Deeper testing strategy
1. Run sample + aggregate checks first.
2. Validate each scoring dimension with one explicit probe.
3. Add distribution probes to avoid blind spots.
4. Submit report only after all dimensions are covered.