title: SQL-Agent-RL
emoji: π³
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
tag: openenv
SQL / Data Cleaning Sandbox
A FastAPI OpenEnv environment for evaluating AI agents on realistic SQLite data tasks. Agents interact using SQL and Python to triage, clean, and normalize messy datasets across 6 diverse tasks.
Motivation
This environment targets data engineering and debugging workflows where an agent must:
- inspect database state,
- correct broken or inconsistent data,
- calculate complex financial or system metrics,
- migrate flat schemas into normalized tables,
- and do so using incremental feedback.
It is designed for benchmarks with partial progress scoring and explicit penalties for destructive actions.
Action Space
Agents submit actions as JSON objects:
tool:sqlorpythoncommand: the SQL query or Python code to execute
Example:
{
"tool": "sql",
"command": "SELECT COUNT(*) FROM users WHERE email IS NULL"
}
Observation Space
Each environment response includes:
output: command output texterror: raw execution error ornullcurrent_step: current step indexmax_steps: allowed step budgettask_description: active task promptdone: whether the episode finishedreward: partial reward for the step (includes potential late-task penalties)
Tasks
The environment provides six progressively difficult tasks, indexed as task1 through task6.
task1 β Data Triage (Easy)
- Description: Compute total January 2024 revenue from the
salestable. - Goal: Run a SQL aggregation that returns the exact total value.
- Success Criteria: Reward
1.0if the result matches1000.00.
task2 β Data Cleaning (Medium)
- Description: Clean the
userstable:- Lowercase all emails.
- Remove duplicate emails (retain lowest
id). - Replace NULL ages with
0.
- Reward Breakdown:
0.3for Lowercase,0.4for No Duplicates,0.3for No NULLs.
task3 β Schema Migration (Hard)
- Description: Normalize
flat_ordersinto separatecustomersandorderstables. - Reward Breakdown:
0.2for correctcustomersschema.0.2for correctordersschema.0.6for accurate data migration and referential integrity.
task4 β Incident Response (Advanced)
- Description: Identify an IP address spamming 403 errors:
- Create a
blocked_ipstable. - Move the offending IP into the blocklist.
- Prune the offending records from the master
server_logs.
- Create a
- Reward Breakdown:
0.2for table creation,0.3for correct IP identification,0.5for successful log pruning. - Penalty: Deductions occur if legitimate traffic logs are accidentally deleted.
task5 β Data Imputation & Revenue View (Advanced)
- Description: Standardize corrupted date strings and calculate Life Time Value:
- Find and replace "NULL", "N/A", or empty strings in
end_date_strwith "2024-12-31". - Create a view
user_ltvcalculating revenue usingjulianday()arithmetic.
- Find and replace "NULL", "N/A", or empty strings in
- Reward Breakdown:
0.3for data cleaning,0.3for view creation,0.4for calculation accuracy.
task6 β JSON Analysis & Ranking (Expert)
- Description: Extract nested JSON data and rank performance:
- Add a
total_compcolumn toemployees. - Extract
bonus_pctfrom a nested JSON string to compute total compensation. - Create a view
department_all_starsshowing the top earner in each department with performance rating "A".
- Add a
- Reward Breakdown:
0.2for schema mutation,0.3for JSON extraction accuracy,0.5for correct ranking logic.
Reward Mechanism
Each step is scored by the task-specific grader in server/environment.py.
- The grader inspects the current database state and latest output.
- Reward is clamped to the range
0.01to0.99. - Episodes end when the step count reaches
max_stepsor reward reaches0.99. - Errors subtract
0.05from the step reward. - Destructive or incorrect data modifications in advanced tasks result in score penalties.
Baseline Scores
Recent reference runs using robust capable LLMs (e.g., llama-3.3-70b-versatile via Groq) indicate the environment is reliably solvable but effectively differentiates between model reasoning capabilities on the later multi-step tasks.
| Model | Task 1 (Easy) | Task 2 (Medium) | Task 3 (Hard) | Task 4 (Advanced) | Task 5 (Advanced) | Task 6 (Expert) |
|---|---|---|---|---|---|---|
| Llama-3.3-70B | ~1.00 | ~1.00 | ~1.00 | ~0.99 | ~0.90 | ~0.99 |
| Llama-3.1-8B | ~0.99 | ~0.60 | ~0.40 | ~0.30 | ~0.10 | ~0.00 |
Note: Scores represent typical final-step partial-progress rewards. Simpler models often struggle to complete Schema Migration (Task 3) or JSON extraction windowing (Task 6), while advanced models can typically achieve near-perfect rewards within 3 to 6 execution steps per task.
Local Setup
Install Python dependencies
cd MetaPytorch-Hackathon-3
pip install -r server/requirements.txt
pip install -e .
Run the sandbox server locally
python -m uvicorn server.app:app --host 0.0.0.0 --port 7860
Run inference and evaluation
Ensure API_BASE_URL, MODEL_NAME, and HF_TOKEN (or OPENAI_API_KEY) are set.
cd MetaPytorch-Hackathon-3
python inference.py
Docker Setup
cd MetaPytorch-Hackathon-3
docker build -t sql-sandbox .
docker run -p 7860:7860 sql-sandbox
Project structure
client.pyβ OpenEnv client wrappermodels.pyβ action and observation modelsopenenv.yamlβ environment manifestinference.pyβ OpenAI baseline runnerinference_groq.pyβ Groq baseline runnerserver/app.pyβ FastAPI app entrypointserver/environment.pyβ task logic, grading, and reward mechanics
License
BSD-3-Clause