Spaces:
Sleeping
title: SQLSherlock Env
emoji: π
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 7860
tags:
- openenv
- reinforcement-learning
- data-quality
pinned: false
SQLSherlock-Env
An RL environment where an AI agent performs complete data quality audits on real-world datasets.
Data cleaning consumes ~80% of a data scientist's time. SQLSherlock trains and evaluates AI agents to do it automatically β discover issues through statistical investigation, fix them with the right strategy per column type, validate the result, and export a clean dataset.
Key design principles:
- Intent-driven cleaning β 3 intents (Visualization / ML Training / Business Analytics), each with 3 escalating difficulty levels
- Real dataset scanning β environment scans the real dataset at
reset()and builds a ground-truth issue catalogue dynamically - Any dataset β HuggingFace datasets, local CSV / JSON / Parquet / XLSX, or raw CSV text
- Dense reward signal β every action produces a training signal, not just end-of-episode binary feedback
- The Trap β hard tasks plant a deliberately suspicious-looking but correct value; touching it costs β0.40
Task Architecture: 3 Intents Γ 3 Difficulties = 9 Tasks
The user selects one intent; the agent automatically runs Easy β Medium β Hard for that intent.
| Intent | Easy | Medium | Hard |
|---|---|---|---|
| Visualization | viz_easy |
viz_medium |
viz_hard |
| ML Training | ml_easy |
ml_medium |
ml_hard |
| Business Analytics | bq_easy |
bq_medium |
bq_hard |
Issue Coverage per Difficulty
| Difficulty | Issues Covered | Max Steps |
|---|---|---|
| Easy | nulls, type errors, whitespace, inconsistent categories | 30 |
| Medium | + constraint violations, statistical outliers | 40 |
| Hard | + duplicate rows, FK violations β plus the trap | 50 |
Episode Flow
User selects: dataset + intent + output format
β
βΌ
ββββββββββββββββββββββββββββββββ
β Easy Task (30 steps max) β
β Medium Task (40 steps max) ββββΊ per-step reward signals
β Hard Task (50 steps max) β
ββββββββββββββββββββββββββββββββ
β
βΌ
Grader scores [0.0 β 1.0]
Cleaned file exported
User downloads result
Inside Each Episode
reset(dataset, task_id, intent)
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββ
β DatabaseEngine β
β β
β 1. load(source) CSV/JSON/Parquet β
β 2. records_to_sqlite() in-memory SQLite β
β 3. deep_copy(originals) clean snapshot β
β 4. profile_table() stats per column β
β 5. detect_issues() 8 issue types β
β 6. Validator(baseline) 6 checks β
β 7. detect_trap() hard tasks only β
ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
Observation β Agent Step Loop
β
investigate β fix β validate β export
β
βΌ
Grader.score() β [0.0 β 1.0]
Scoring Formulas
Easy scores most β the basic cleaning foundation counts for the most weight in every cascade.
Easy
score = resolution(null + type_error + whitespace + category) Γ 0.80
+ validation Γ 0.20
β fp_penalty
Medium (easy = 70% of the score)
score = easy_score Γ 0.70
+ avg(constraint_resolved + outlier_resolved + duplicate_resolved) Γ 0.30
β fp_penalty
Hard (medium = 70% of the score)
# With FK violations:
score = medium_score Γ 0.70 + fk_resolved Γ 0.30 + reasoning_bonus β trap_penalty
# Single-table dataset (no FK violations to find):
score = medium_score + reasoning_bonus β trap_penalty
Penalties & Bonuses
| Component | Value |
|---|---|
| False-positive penalty | β0.05 per clean cell changed (capped at β0.35) |
Skipping validate() |
validation component Γ 0.50 |
| Trap hit | β0.40 |
| Reasoning bonus | +0.05 (hard tasks, statistical terms used in reasons) |
Grading Pipeline (7 Steps)
1. Zero-change guard β if nothing changed β 0.0
2. Resolution score β per issue: confidence-weighted
3. False-positive penalty β β0.05 per clean cell touched (cap β0.35)
4. Trap penalty β β0.40 if trap cell modified (hard only)
5. Validation score β checks_passed / total Γ 0.30
6. Reasoning bonus β +0.05 for statistical reasoning terms in reasons
7. Final weighted sum β clamped to [0.0, 1.0]
Data Cleaning Capabilities
| Issue Type | Detection | Fix Strategy | Difficulty |
|---|---|---|---|
| Null values | IS NULL or empty string |
Numeric β median. String β mode | All |
| Type errors | Text in β₯80% numeric column | Column median | All |
| Whitespace | Leading/trailing/extra spaces | Trimmed string | All |
| Inconsistent categories | Case variants ("male"/"Male"/"MALE") | Dominant form | All |
| Constraint violations | Negative values in must-be-positive columns | ABS(value) |
Medium+ |
| Statistical outliers | IQR-based: outside Q1β3ΓIQR or Q3+3ΓIQR | Column median | Medium+ |
| Duplicates | Same natural key appearing twice | delete_row |
Hard |
| FK violations | Orphan references across tables | delete_row |
Hard |
| Trap (hard only) | Planted 2Γ value β z < 3 (looks normal) | DO NOT TOUCH (β0.40) | Hard |
Smart imputation: profile_column returns median, mode, mean, null_rate, dtype, z_scores. fix_column bulk-fixes ALL nulls + type errors + negatives in one step.
Action Space
action_type |
Required fields | Description |
|---|---|---|
inspect |
table |
View all current rows |
profile_column |
table, column |
Stats: median, mode, mean, std, null_count, null_rate, dtype, z_scores |
run_sql |
sql |
Read-only SELECT (max 50 rows) |
fix_cell |
table, row_id, column, value, reason |
Fix one specific cell |
fix_column |
table, column, value, reason |
Bulk fix: all nulls + type errors + negatives |
delete_row |
table, row_id, reason |
Remove a duplicate or FK-violation row |
validate |
β | Run 6-check validator on current state |
submit |
β | Score and end episode |
export |
β | Write cleaned file, score, and end episode |
classify_intent |
value |
Declare inferred intent (visualization / ml_training / business_query) |
select_tables |
tables |
Declare active tables for multi-table analysis |
join_tables |
table, table2, key |
LEFT JOIN two tables on a key column |
Reward System
Dense per-step rewards β every action returns a signal:
| Action | Reward | Cap |
|---|---|---|
inspect |
+0.02 | 3 rewarded |
profile_column |
+0.03 | 3 rewarded |
run_sql |
+0.03 | 3 rewarded |
select_tables |
+0.02 | 2 rewarded (only if β₯2 tables exist) |
validate |
+0.05 Γ (checks_passed / 6) | 2 rewarded |
fix_cell β correct |
+0.15 | β |
fix_cell β false positive |
β0.20 | β |
fix_cell β trap cell |
β0.40 | β |
fix_column β has issues |
+0.15 to +0.45 (scales with fraction of total issues) | β |
fix_column β no issues |
β0.10 | β |
delete_row β valid |
+0.15 | β |
delete_row β false positive |
β0.20 | β |
classify_intent β correct |
+0.10 | β |
classify_intent β wrong |
β0.10 | β |
join_tables β valid join |
+0.20 | β |
join_tables β invalid |
β0.20 | β |
submit β all resolved |
+0.10 | β |
submit β issues remain |
β0.10 | β |
Validation (6 Checks)
| Check | Passes when |
|---|---|
null_check |
Null issues resolved (confidence-weighted) |
type_check |
Type errors castable to correct type |
range_check |
No negatives in must-be-positive columns |
distribution_check |
Column mean drift < 20% from baseline |
duplicate_check |
Duplicate count reduced |
outlier_check |
Flagged outlier rows within acceptable range of median |
Multiple Tables
Upload 2β3 files in the Agent Demo tab. They are merged into a multi-sheet XLSX β each sheet becomes a separate SQLite table. The agent will:
select_tablesβ list all available tables- Inspect schemas and attempt
join_tableswith the best matching key column - Profile and bulk-fix each table independently
Supported formats per file: CSV, JSON, JSONL, Parquet, XLSX
Alternatively, upload a single XLSX with multiple sheets β each sheet loads as a separate table automatically.
Business Query Feature
For the Business Analytics intent, enter a SQL SELECT query in the UI. After the agent finishes cleaning (Hard task), the query runs on the cleaned in-memory database and results are displayed in the UI.
-- Example
SELECT "Pclass", COUNT(*) AS passengers, AVG("Fare") AS avg_fare
FROM dataset
GROUP BY "Pclass"
ORDER BY avg_fare DESC
Works on any intent β not just business_query.
Quick Start
Docker (recommended)
docker build -t sqlsherlock-env:latest .
docker run -p 7860:7860 -e HF_TOKEN=hf_your_token sqlsherlock-env:latest
curl http://localhost:7860/health
Local Python
pip install -r sqlsherlock_env/server/requirements.txt
cd sqlsherlock_env
PYTHONPATH=. uvicorn server.app:app --host 0.0.0.0 --port 7860
Run Baseline Inference
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4.1-mini"
export HF_TOKEN="hf_your_token"
export SPACE_URL="http://localhost:7860"
export DEMO_INTENT="visualization" # or ml_training / business_query
python inference.py
Inference output format (judges parse this exactly):
=== SQLSherlock-Env intent=visualization model=gpt-4.1-mini ===
[START] task=viz_easy env=sqlsherlock_env model=gpt-4.1-mini
[STEP] step=1 action=inspect reward=0.02 done=false error=null
[STEP] step=2 action=profile_column(Age) reward=0.03 done=false error=null
[STEP] step=7 action=fix_column(Age,29.5) reward=0.22 done=false error=null
...
[END] success=true steps=18 rewards=0.02,0.03,...
[START] task=viz_medium env=sqlsherlock_env model=gpt-4.1-mini
...
[END] success=true steps=28 rewards=...
[START] task=viz_hard env=sqlsherlock_env model=gpt-4.1-mini
...
[END] success=true steps=35 rewards=...
Using Any Dataset
from sqlsherlock_env.client import SQLSherlockEnv
env = SQLSherlockEnv(base_url="http://localhost:7860")
# HuggingFace dataset
obs = env.reset(dataset="phihung/titanic", task_id="viz_easy", intent="visualization")
# Local file (CSV, JSON, Parquet, XLSX)
obs = env.reset(dataset="/path/to/data.csv", task_id="ml_medium", intent="ml_training")
# Upload via API then use
# POST /upload_dataset β returns {"dataset_id": "..."}
# obs = env.reset(dataset="upload://dataset_id", task_id="bq_hard")
API Reference
| Method | Path | Description |
|---|---|---|
WS |
/ws |
Persistent WebSocket session |
POST |
/reset |
Start episode (dataset, task_id, intent, output_format) |
POST |
/step |
Execute one action |
GET |
/state |
Current episode state |
GET |
/health |
Health check + task list |
GET |
/tasks |
All 9 task definitions |
POST |
/upload_dataset |
Upload CSV / JSON / Parquet / XLSX |
GET |
/download/{file_id} |
Download cleaned output |
GET |
/docs |
OpenAPI Swagger UI |
Environment Variables
| Variable | Default | Description |
|---|---|---|
HF_TOKEN |
(required) | HuggingFace / OpenAI API key |
API_BASE_URL |
https://api.openai.com/v1 |
LLM API endpoint |
MODEL_NAME |
gpt-4.1-mini |
Model identifier |
SPACE_URL |
http://localhost:7860 |
Environment server URL |
DEMO_INTENT |
visualization |
Intent for inference.py (visualization / ml_training / business_query) |
Project Structure
SQLSherlock-env/
βββ Dockerfile β HF Spaces Docker entrypoint
βββ README.md β this file
βββ inference.py β hackathon baseline ([START]/[STEP]/[END])
βββ pyproject.toml
βββ sqlsherlock_env/
β βββ client.py β sync WebSocket/HTTP client
β βββ models.py β Action / Observation / State (Pydantic)
β βββ server/
β βββ app.py β FastAPI + Gradio mount
β βββ environment.py β RL core: reset() / step() β 9 tasks
β βββ ui.py β Gradio UI (intent selector, live log, download)
β βββ database.py β In-memory SQLite engine per episode
β βββ dataset_loader.py β CSV / JSON / JSONL / Parquet / XLSX / HF
β βββ schema_profiler.py β Column stats: median, mode, std, IQR
β βββ issue_detector.py β 8 issue types + trap planting
β βββ validator.py β 6-check before/after validator
β βββ reward.py β Dense per-step reward calculator
β βββ exporter.py β Format-preserving output writer
β βββ requirements.txt
β βββ graders/
β βββ __init__.py β Routes 9 task IDs to 3 graders
β βββ universal.py β 7-step scoring pipeline (shared)
β βββ task1.py β Easy grader
β βββ task2.py β Medium grader
β βββ task3.py β Hard grader (trap + FK + reasoning)
βββ tests/
βββ test_environment.py β 58 tests
GRPO Training
pip install trl transformers torch
export SPACE_URL="http://localhost:7860"
export MODEL_ID="Qwen/Qwen2.5-1.5B-Instruct"
python train.py
Run Tests
PYTHONPATH=sqlsherlock_env pytest tests/test_environment.py -q
# 58 tests, all pass