shreyas231219's picture
Fix directory pointers in README instructions
eed7fd0
metadata
title: SQL-Agent-RL
emoji: 🐳
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
tag: openenv

SQL / Data Cleaning Sandbox

A FastAPI OpenEnv environment for evaluating AI agents on realistic SQLite data tasks. Agents interact using SQL and Python to triage, clean, and normalize messy datasets across 6 diverse tasks.

Motivation

This environment targets data engineering and debugging workflows where an agent must:

  • inspect database state,
  • correct broken or inconsistent data,
  • calculate complex financial or system metrics,
  • migrate flat schemas into normalized tables,
  • and do so using incremental feedback.

It is designed for benchmarks with partial progress scoring and explicit penalties for destructive actions.

Action Space

Agents submit actions as JSON objects:

  • tool: sql or python
  • command: the SQL query or Python code to execute

Example:

{
  "tool": "sql",
  "command": "SELECT COUNT(*) FROM users WHERE email IS NULL"
}

Observation Space

Each environment response includes:

  • output: command output text
  • error: raw execution error or null
  • current_step: current step index
  • max_steps: allowed step budget
  • task_description: active task prompt
  • done: whether the episode finished
  • reward: partial reward for the step (includes potential late-task penalties)

Tasks

The environment provides six progressively difficult tasks, indexed as task1 through task6.

task1 β€” Data Triage (Easy)

  • Description: Compute total January 2024 revenue from the sales table.
  • Goal: Run a SQL aggregation that returns the exact total value.
  • Success Criteria: Reward 1.0 if the result matches 1000.00.

task2 β€” Data Cleaning (Medium)

  • Description: Clean the users table:
    • Lowercase all emails.
    • Remove duplicate emails (retain lowest id).
    • Replace NULL ages with 0.
  • Reward Breakdown: 0.3 for Lowercase, 0.4 for No Duplicates, 0.3 for No NULLs.

task3 β€” Schema Migration (Hard)

  • Description: Normalize flat_orders into separate customers and orders tables.
  • Reward Breakdown:
    • 0.2 for correct customers schema.
    • 0.2 for correct orders schema.
    • 0.6 for accurate data migration and referential integrity.

task4 β€” Incident Response (Advanced)

  • Description: Identify an IP address spamming 403 errors:
    • Create a blocked_ips table.
    • Move the offending IP into the blocklist.
    • Prune the offending records from the master server_logs.
  • Reward Breakdown: 0.2 for table creation, 0.3 for correct IP identification, 0.5 for successful log pruning.
  • Penalty: Deductions occur if legitimate traffic logs are accidentally deleted.

task5 β€” Data Imputation & Revenue View (Advanced)

  • Description: Standardize corrupted date strings and calculate Life Time Value:
    • Find and replace "NULL", "N/A", or empty strings in end_date_str with "2024-12-31".
    • Create a view user_ltv calculating revenue using julianday() arithmetic.
  • Reward Breakdown: 0.3 for data cleaning, 0.3 for view creation, 0.4 for calculation accuracy.

task6 β€” JSON Analysis & Ranking (Expert)

  • Description: Extract nested JSON data and rank performance:
    • Add a total_comp column to employees.
    • Extract bonus_pct from a nested JSON string to compute total compensation.
    • Create a view department_all_stars showing the top earner in each department with performance rating "A".
  • Reward Breakdown: 0.2 for schema mutation, 0.3 for JSON extraction accuracy, 0.5 for correct ranking logic.

Reward Mechanism

Each step is scored by the task-specific grader in server/environment.py.

  • The grader inspects the current database state and latest output.
  • Reward is clamped to the range 0.01 to 0.99.
  • Episodes end when the step count reaches max_steps or reward reaches 0.99.
  • Errors subtract 0.05 from the step reward.
  • Destructive or incorrect data modifications in advanced tasks result in score penalties.

Baseline Scores

Recent reference runs using robust capable LLMs (e.g., llama-3.3-70b-versatile via Groq) indicate the environment is reliably solvable but effectively differentiates between model reasoning capabilities on the later multi-step tasks.

Model Task 1 (Easy) Task 2 (Medium) Task 3 (Hard) Task 4 (Advanced) Task 5 (Advanced) Task 6 (Expert)
Llama-3.3-70B ~1.00 ~1.00 ~1.00 ~0.99 ~0.90 ~0.99
Llama-3.1-8B ~0.99 ~0.60 ~0.40 ~0.30 ~0.10 ~0.00

Note: Scores represent typical final-step partial-progress rewards. Simpler models often struggle to complete Schema Migration (Task 3) or JSON extraction windowing (Task 6), while advanced models can typically achieve near-perfect rewards within 3 to 6 execution steps per task.

Local Setup

Install Python dependencies

cd MetaPytorch-Hackathon-3
pip install -r server/requirements.txt
pip install -e .

Run the sandbox server locally

python -m uvicorn server.app:app --host 0.0.0.0 --port 7860

Run inference and evaluation

Ensure API_BASE_URL, MODEL_NAME, and HF_TOKEN (or OPENAI_API_KEY) are set.

cd MetaPytorch-Hackathon-3
python inference.py

Docker Setup

cd MetaPytorch-Hackathon-3

docker build -t sql-sandbox .
docker run -p 7860:7860 sql-sandbox

Project structure

  • client.py β€” OpenEnv client wrapper
  • models.py β€” action and observation models
  • openenv.yaml β€” environment manifest
  • inference.py β€” OpenAI baseline runner
  • inference_groq.py β€” Groq baseline runner
  • server/app.py β€” FastAPI app entrypoint
  • server/environment.py β€” task logic, grading, and reward mechanics

License

BSD-3-Clause