shreyas231219's picture
Fix directory pointers in README instructions
eed7fd0
---
title: SQL-Agent-RL
emoji: 🐳
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
tag: openenv
---
# SQL / Data Cleaning Sandbox
A FastAPI OpenEnv environment for evaluating AI agents on realistic SQLite data tasks.
Agents interact using SQL and Python to triage, clean, and normalize messy datasets across 6 diverse tasks.
## Motivation
This environment targets data engineering and debugging workflows where an agent must:
- inspect database state,
- correct broken or inconsistent data,
- calculate complex financial or system metrics,
- migrate flat schemas into normalized tables,
- and do so using incremental feedback.
It is designed for benchmarks with partial progress scoring and explicit penalties for destructive actions.
## Action Space
Agents submit actions as JSON objects:
- `tool`: `sql` or `python`
- `command`: the SQL query or Python code to execute
Example:
```json
{
"tool": "sql",
"command": "SELECT COUNT(*) FROM users WHERE email IS NULL"
}
```
## Observation Space
Each environment response includes:
- `output`: command output text
- `error`: raw execution error or `null`
- `current_step`: current step index
- `max_steps`: allowed step budget
- `task_description`: active task prompt
- `done`: whether the episode finished
- `reward`: partial reward for the step (includes potential late-task penalties)
## Tasks
The environment provides six progressively difficult tasks, indexed as `task1` through `task6`.
### task1 β€” Data Triage (Easy)
- **Description**: Compute total January 2024 revenue from the `sales` table.
- **Goal**: Run a SQL aggregation that returns the exact total value.
- **Success Criteria**: Reward `1.0` if the result matches `1000.00`.
### task2 β€” Data Cleaning (Medium)
- **Description**: Clean the `users` table:
- Lowercase all emails.
- Remove duplicate emails (retain lowest `id`).
- Replace NULL ages with `0`.
- **Reward Breakdown**: `0.3` for Lowercase, `0.4` for No Duplicates, `0.3` for No NULLs.
### task3 β€” Schema Migration (Hard)
- **Description**: Normalize `flat_orders` into separate `customers` and `orders` tables.
- **Reward Breakdown**:
- `0.2` for correct `customers` schema.
- `0.2` for correct `orders` schema.
- `0.6` for accurate data migration and referential integrity.
### task4 β€” Incident Response (Advanced)
- **Description**: Identify an IP address spamming 403 errors:
- Create a `blocked_ips` table.
- Move the offending IP into the blocklist.
- Prune the offending records from the master `server_logs`.
- **Reward Breakdown**: `0.2` for table creation, `0.3` for correct IP identification, `0.5` for successful log pruning.
- **Penalty**: Deductions occur if legitimate traffic logs are accidentally deleted.
### task5 β€” Data Imputation & Revenue View (Advanced)
- **Description**: Standardize corrupted date strings and calculate Life Time Value:
- Find and replace "NULL", "N/A", or empty strings in `end_date_str` with "2024-12-31".
- Create a view `user_ltv` calculating revenue using `julianday()` arithmetic.
- **Reward Breakdown**: `0.3` for data cleaning, `0.3` for view creation, `0.4` for calculation accuracy.
### task6 β€” JSON Analysis & Ranking (Expert)
- **Description**: Extract nested JSON data and rank performance:
- Add a `total_comp` column to `employees`.
- Extract `bonus_pct` from a nested JSON string to compute total compensation.
- Create a view `department_all_stars` showing the top earner in each department with performance rating "A".
- **Reward Breakdown**: `0.2` for schema mutation, `0.3` for JSON extraction accuracy, `0.5` for correct ranking logic.
## Reward Mechanism
Each step is scored by the task-specific grader in `server/environment.py`.
- The grader inspects the current database state and latest output.
- Reward is clamped to the range `0.01` to `0.99`.
- Episodes end when the step count reaches `max_steps` or reward reaches `0.99`.
- Errors subtract `0.05` from the step reward.
- Destructive or incorrect data modifications in advanced tasks result in score penalties.
## Baseline Scores
Recent reference runs using robust capable LLMs (e.g., `llama-3.3-70b-versatile` via Groq) indicate the environment is reliably solvable but effectively differentiates between model reasoning capabilities on the later multi-step tasks.
| Model | Task 1 (Easy) | Task 2 (Medium) | Task 3 (Hard) | Task 4 (Advanced) | Task 5 (Advanced) | Task 6 (Expert) |
|---|---|---|---|---|---|---|
| Llama-3.3-70B | ~1.00 | ~1.00 | ~1.00 | ~0.99 | ~0.90 | ~0.99 |
| Llama-3.1-8B | ~0.99 | ~0.60 | ~0.40 | ~0.30 | ~0.10 | ~0.00 |
*Note: Scores represent typical final-step partial-progress rewards. Simpler models often struggle to complete Schema Migration (Task 3) or JSON extraction windowing (Task 6), while advanced models can typically achieve near-perfect rewards within 3 to 6 execution steps per task.*
## Local Setup
### Install Python dependencies
```bash
cd MetaPytorch-Hackathon-3
pip install -r server/requirements.txt
pip install -e .
```
### Run the sandbox server locally
```bash
python -m uvicorn server.app:app --host 0.0.0.0 --port 7860
```
## Run inference and evaluation
Ensure `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` (or `OPENAI_API_KEY`) are set.
```bash
cd MetaPytorch-Hackathon-3
python inference.py
```
## Docker Setup
```bash
cd MetaPytorch-Hackathon-3
docker build -t sql-sandbox .
docker run -p 7860:7860 sql-sandbox
```
## Project structure
- `client.py` β€” OpenEnv client wrapper
- `models.py` β€” action and observation models
- `openenv.yaml` β€” environment manifest
- `inference.py` β€” OpenAI baseline runner
- `inference_groq.py` β€” Groq baseline runner
- `server/app.py` β€” FastAPI app entrypoint
- `server/environment.py` β€” task logic, grading, and reward mechanics
## License
BSD-3-Clause