SQLSHERLOCK-ENV / README.md
Swethaditya's picture
Changes for the intent
9620e6a
metadata
title: SQLSherlock Env
emoji: πŸ”
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 7860
tags:
  - openenv
  - reinforcement-learning
  - data-quality
pinned: false

SQLSherlock-Env

An RL environment where an AI agent performs complete data quality audits on real-world datasets.

Data cleaning consumes ~80% of a data scientist's time. SQLSherlock trains and evaluates AI agents to do it automatically β€” discover issues through statistical investigation, fix them with the right strategy per column type, validate the result, and export a clean dataset.

Key design principles:

  • Intent-driven cleaning β€” 3 intents (Visualization / ML Training / Business Analytics), each with 3 escalating difficulty levels
  • Real dataset scanning β€” environment scans the real dataset at reset() and builds a ground-truth issue catalogue dynamically
  • Any dataset β€” HuggingFace datasets, local CSV / JSON / Parquet / XLSX, or raw CSV text
  • Dense reward signal β€” every action produces a training signal, not just end-of-episode binary feedback
  • The Trap β€” hard tasks plant a deliberately suspicious-looking but correct value; touching it costs βˆ’0.40

Task Architecture: 3 Intents Γ— 3 Difficulties = 9 Tasks

The user selects one intent; the agent automatically runs Easy β†’ Medium β†’ Hard for that intent.

Intent Easy Medium Hard
Visualization viz_easy viz_medium viz_hard
ML Training ml_easy ml_medium ml_hard
Business Analytics bq_easy bq_medium bq_hard

Issue Coverage per Difficulty

Difficulty Issues Covered Max Steps
Easy nulls, type errors, whitespace, inconsistent categories 30
Medium + constraint violations, statistical outliers 40
Hard + duplicate rows, FK violations β€” plus the trap 50

Episode Flow

User selects: dataset + intent + output format
                         β”‚
                         β–Ό
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚  Easy Task   (30 steps max)  β”‚
          β”‚  Medium Task (40 steps max)  │──► per-step reward signals
          β”‚  Hard Task   (50 steps max)  β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
              Grader scores [0.0 – 1.0]
              Cleaned file exported
              User downloads result

Inside Each Episode

reset(dataset, task_id, intent)
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  DatabaseEngine                          β”‚
β”‚                                          β”‚
β”‚  1. load(source)      CSV/JSON/Parquet   β”‚
β”‚  2. records_to_sqlite() in-memory SQLite β”‚
β”‚  3. deep_copy(originals) clean snapshot  β”‚
β”‚  4. profile_table()   stats per column   β”‚
β”‚  5. detect_issues()   8 issue types      β”‚
β”‚  6. Validator(baseline) 6 checks         β”‚
β”‚  7. detect_trap()     hard tasks only    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
  Observation β†’ Agent Step Loop
        β”‚
  investigate β†’ fix β†’ validate β†’ export
        β”‚
        β–Ό
  Grader.score() β†’ [0.0 – 1.0]

Scoring Formulas

Easy scores most β€” the basic cleaning foundation counts for the most weight in every cascade.

Easy

score = resolution(null + type_error + whitespace + category) Γ— 0.80
      + validation Γ— 0.20
      βˆ’ fp_penalty

Medium (easy = 70% of the score)

score = easy_score Γ— 0.70
      + avg(constraint_resolved + outlier_resolved + duplicate_resolved) Γ— 0.30
      βˆ’ fp_penalty

Hard (medium = 70% of the score)

# With FK violations:
score = medium_score Γ— 0.70 + fk_resolved Γ— 0.30 + reasoning_bonus βˆ’ trap_penalty

# Single-table dataset (no FK violations to find):
score = medium_score + reasoning_bonus βˆ’ trap_penalty

Penalties & Bonuses

Component Value
False-positive penalty βˆ’0.05 per clean cell changed (capped at βˆ’0.35)
Skipping validate() validation component Γ— 0.50
Trap hit βˆ’0.40
Reasoning bonus +0.05 (hard tasks, statistical terms used in reasons)

Grading Pipeline (7 Steps)

1. Zero-change guard     β€” if nothing changed β†’ 0.0
2. Resolution score      β€” per issue: confidence-weighted
3. False-positive penalty β€” βˆ’0.05 per clean cell touched (cap βˆ’0.35)
4. Trap penalty          β€” βˆ’0.40 if trap cell modified (hard only)
5. Validation score      β€” checks_passed / total Γ— 0.30
6. Reasoning bonus       β€” +0.05 for statistical reasoning terms in reasons
7. Final weighted sum    β€” clamped to [0.0, 1.0]

Data Cleaning Capabilities

Issue Type Detection Fix Strategy Difficulty
Null values IS NULL or empty string Numeric β†’ median. String β†’ mode All
Type errors Text in β‰₯80% numeric column Column median All
Whitespace Leading/trailing/extra spaces Trimmed string All
Inconsistent categories Case variants ("male"/"Male"/"MALE") Dominant form All
Constraint violations Negative values in must-be-positive columns ABS(value) Medium+
Statistical outliers IQR-based: outside Q1βˆ’3Γ—IQR or Q3+3Γ—IQR Column median Medium+
Duplicates Same natural key appearing twice delete_row Hard
FK violations Orphan references across tables delete_row Hard
Trap (hard only) Planted 2Γ— value β€” z < 3 (looks normal) DO NOT TOUCH (βˆ’0.40) Hard

Smart imputation: profile_column returns median, mode, mean, null_rate, dtype, z_scores. fix_column bulk-fixes ALL nulls + type errors + negatives in one step.


Action Space

action_type Required fields Description
inspect table View all current rows
profile_column table, column Stats: median, mode, mean, std, null_count, null_rate, dtype, z_scores
run_sql sql Read-only SELECT (max 50 rows)
fix_cell table, row_id, column, value, reason Fix one specific cell
fix_column table, column, value, reason Bulk fix: all nulls + type errors + negatives
delete_row table, row_id, reason Remove a duplicate or FK-violation row
validate β€” Run 6-check validator on current state
submit β€” Score and end episode
export β€” Write cleaned file, score, and end episode
classify_intent value Declare inferred intent (visualization / ml_training / business_query)
select_tables tables Declare active tables for multi-table analysis
join_tables table, table2, key LEFT JOIN two tables on a key column

Reward System

Dense per-step rewards β€” every action returns a signal:

Action Reward Cap
inspect +0.02 3 rewarded
profile_column +0.03 3 rewarded
run_sql +0.03 3 rewarded
select_tables +0.02 2 rewarded (only if β‰₯2 tables exist)
validate +0.05 Γ— (checks_passed / 6) 2 rewarded
fix_cell β€” correct +0.15 β€”
fix_cell β€” false positive βˆ’0.20 β€”
fix_cell β€” trap cell βˆ’0.40 β€”
fix_column β€” has issues +0.15 to +0.45 (scales with fraction of total issues) β€”
fix_column β€” no issues βˆ’0.10 β€”
delete_row β€” valid +0.15 β€”
delete_row β€” false positive βˆ’0.20 β€”
classify_intent β€” correct +0.10 β€”
classify_intent β€” wrong βˆ’0.10 β€”
join_tables β€” valid join +0.20 β€”
join_tables β€” invalid βˆ’0.20 β€”
submit β€” all resolved +0.10 β€”
submit β€” issues remain βˆ’0.10 β€”

Validation (6 Checks)

Check Passes when
null_check Null issues resolved (confidence-weighted)
type_check Type errors castable to correct type
range_check No negatives in must-be-positive columns
distribution_check Column mean drift < 20% from baseline
duplicate_check Duplicate count reduced
outlier_check Flagged outlier rows within acceptable range of median

Multiple Tables

Upload 2–3 files in the Agent Demo tab. They are merged into a multi-sheet XLSX β€” each sheet becomes a separate SQLite table. The agent will:

  1. select_tables β€” list all available tables
  2. Inspect schemas and attempt join_tables with the best matching key column
  3. Profile and bulk-fix each table independently

Supported formats per file: CSV, JSON, JSONL, Parquet, XLSX

Alternatively, upload a single XLSX with multiple sheets β€” each sheet loads as a separate table automatically.


Business Query Feature

For the Business Analytics intent, enter a SQL SELECT query in the UI. After the agent finishes cleaning (Hard task), the query runs on the cleaned in-memory database and results are displayed in the UI.

-- Example
SELECT "Pclass", COUNT(*) AS passengers, AVG("Fare") AS avg_fare
FROM dataset
GROUP BY "Pclass"
ORDER BY avg_fare DESC

Works on any intent β€” not just business_query.


Quick Start

Docker (recommended)

docker build -t sqlsherlock-env:latest .
docker run -p 7860:7860 -e HF_TOKEN=hf_your_token sqlsherlock-env:latest
curl http://localhost:7860/health

Local Python

pip install -r sqlsherlock_env/server/requirements.txt
cd sqlsherlock_env
PYTHONPATH=. uvicorn server.app:app --host 0.0.0.0 --port 7860

Run Baseline Inference

export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4.1-mini"
export HF_TOKEN="hf_your_token"
export SPACE_URL="http://localhost:7860"
export DEMO_INTENT="visualization"   # or ml_training / business_query
python inference.py

Inference output format (judges parse this exactly):

=== SQLSherlock-Env  intent=visualization  model=gpt-4.1-mini ===

[START] task=viz_easy env=sqlsherlock_env model=gpt-4.1-mini
[STEP]  step=1 action=inspect reward=0.02 done=false error=null
[STEP]  step=2 action=profile_column(Age) reward=0.03 done=false error=null
[STEP]  step=7 action=fix_column(Age,29.5) reward=0.22 done=false error=null
...
[END]   success=true steps=18 rewards=0.02,0.03,...

[START] task=viz_medium env=sqlsherlock_env model=gpt-4.1-mini
...
[END]   success=true steps=28 rewards=...

[START] task=viz_hard env=sqlsherlock_env model=gpt-4.1-mini
...
[END]   success=true steps=35 rewards=...

Using Any Dataset

from sqlsherlock_env.client import SQLSherlockEnv

env = SQLSherlockEnv(base_url="http://localhost:7860")

# HuggingFace dataset
obs = env.reset(dataset="phihung/titanic", task_id="viz_easy", intent="visualization")

# Local file (CSV, JSON, Parquet, XLSX)
obs = env.reset(dataset="/path/to/data.csv", task_id="ml_medium", intent="ml_training")

# Upload via API then use
# POST /upload_dataset  β†’  returns {"dataset_id": "..."}
# obs = env.reset(dataset="upload://dataset_id", task_id="bq_hard")

API Reference

Method Path Description
WS /ws Persistent WebSocket session
POST /reset Start episode (dataset, task_id, intent, output_format)
POST /step Execute one action
GET /state Current episode state
GET /health Health check + task list
GET /tasks All 9 task definitions
POST /upload_dataset Upload CSV / JSON / Parquet / XLSX
GET /download/{file_id} Download cleaned output
GET /docs OpenAPI Swagger UI

Environment Variables

Variable Default Description
HF_TOKEN (required) HuggingFace / OpenAI API key
API_BASE_URL https://api.openai.com/v1 LLM API endpoint
MODEL_NAME gpt-4.1-mini Model identifier
SPACE_URL http://localhost:7860 Environment server URL
DEMO_INTENT visualization Intent for inference.py (visualization / ml_training / business_query)

Project Structure

SQLSherlock-env/
β”œβ”€β”€ Dockerfile                     ← HF Spaces Docker entrypoint
β”œβ”€β”€ README.md                      ← this file
β”œβ”€β”€ inference.py                   ← hackathon baseline ([START]/[STEP]/[END])
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ sqlsherlock_env/
β”‚   β”œβ”€β”€ client.py                  ← sync WebSocket/HTTP client
β”‚   β”œβ”€β”€ models.py                  ← Action / Observation / State (Pydantic)
β”‚   └── server/
β”‚       β”œβ”€β”€ app.py                 ← FastAPI + Gradio mount
β”‚       β”œβ”€β”€ environment.py         ← RL core: reset() / step() β€” 9 tasks
β”‚       β”œβ”€β”€ ui.py                  ← Gradio UI (intent selector, live log, download)
β”‚       β”œβ”€β”€ database.py            ← In-memory SQLite engine per episode
β”‚       β”œβ”€β”€ dataset_loader.py      ← CSV / JSON / JSONL / Parquet / XLSX / HF
β”‚       β”œβ”€β”€ schema_profiler.py     ← Column stats: median, mode, std, IQR
β”‚       β”œβ”€β”€ issue_detector.py      ← 8 issue types + trap planting
β”‚       β”œβ”€β”€ validator.py           ← 6-check before/after validator
β”‚       β”œβ”€β”€ reward.py              ← Dense per-step reward calculator
β”‚       β”œβ”€β”€ exporter.py            ← Format-preserving output writer
β”‚       β”œβ”€β”€ requirements.txt
β”‚       └── graders/
β”‚           β”œβ”€β”€ __init__.py        ← Routes 9 task IDs to 3 graders
β”‚           β”œβ”€β”€ universal.py       ← 7-step scoring pipeline (shared)
β”‚           β”œβ”€β”€ task1.py           ← Easy grader
β”‚           β”œβ”€β”€ task2.py           ← Medium grader
β”‚           └── task3.py           ← Hard grader (trap + FK + reasoning)
└── tests/
    └── test_environment.py        ← 58 tests

GRPO Training

pip install trl transformers torch
export SPACE_URL="http://localhost:7860"
export MODEL_ID="Qwen/Qwen2.5-1.5B-Instruct"
python train.py

Run Tests

PYTHONPATH=sqlsherlock_env pytest tests/test_environment.py -q
# 58 tests, all pass