| --- |
| title: SQL-Agent-RL |
| emoji: π³ |
| colorFrom: blue |
| colorTo: indigo |
| sdk: docker |
| pinned: false |
| tag: openenv |
| --- |
| # SQL / Data Cleaning Sandbox |
|
|
| A FastAPI OpenEnv environment for evaluating AI agents on realistic SQLite data tasks. |
| Agents interact using SQL and Python to triage, clean, and normalize messy datasets across 6 diverse tasks. |
|
|
| ## Motivation |
|
|
| This environment targets data engineering and debugging workflows where an agent must: |
| - inspect database state, |
| - correct broken or inconsistent data, |
| - calculate complex financial or system metrics, |
| - migrate flat schemas into normalized tables, |
| - and do so using incremental feedback. |
|
|
| It is designed for benchmarks with partial progress scoring and explicit penalties for destructive actions. |
|
|
| ## Action Space |
|
|
| Agents submit actions as JSON objects: |
| - `tool`: `sql` or `python` |
| - `command`: the SQL query or Python code to execute |
|
|
| Example: |
|
|
| ```json |
| { |
| "tool": "sql", |
| "command": "SELECT COUNT(*) FROM users WHERE email IS NULL" |
| } |
| ``` |
|
|
| ## Observation Space |
|
|
| Each environment response includes: |
| - `output`: command output text |
| - `error`: raw execution error or `null` |
| - `current_step`: current step index |
| - `max_steps`: allowed step budget |
| - `task_description`: active task prompt |
| - `done`: whether the episode finished |
| - `reward`: partial reward for the step (includes potential late-task penalties) |
|
|
| ## Tasks |
|
|
| The environment provides six progressively difficult tasks, indexed as `task1` through `task6`. |
|
|
| ### task1 β Data Triage (Easy) |
| - **Description**: Compute total January 2024 revenue from the `sales` table. |
| - **Goal**: Run a SQL aggregation that returns the exact total value. |
| - **Success Criteria**: Reward `1.0` if the result matches `1000.00`. |
|
|
| ### task2 β Data Cleaning (Medium) |
| - **Description**: Clean the `users` table: |
| - Lowercase all emails. |
| - Remove duplicate emails (retain lowest `id`). |
| - Replace NULL ages with `0`. |
| - **Reward Breakdown**: `0.3` for Lowercase, `0.4` for No Duplicates, `0.3` for No NULLs. |
|
|
| ### task3 β Schema Migration (Hard) |
| - **Description**: Normalize `flat_orders` into separate `customers` and `orders` tables. |
| - **Reward Breakdown**: |
| - `0.2` for correct `customers` schema. |
| - `0.2` for correct `orders` schema. |
| - `0.6` for accurate data migration and referential integrity. |
|
|
| ### task4 β Incident Response (Advanced) |
| - **Description**: Identify an IP address spamming 403 errors: |
| - Create a `blocked_ips` table. |
| - Move the offending IP into the blocklist. |
| - Prune the offending records from the master `server_logs`. |
| - **Reward Breakdown**: `0.2` for table creation, `0.3` for correct IP identification, `0.5` for successful log pruning. |
| - **Penalty**: Deductions occur if legitimate traffic logs are accidentally deleted. |
|
|
| ### task5 β Data Imputation & Revenue View (Advanced) |
| - **Description**: Standardize corrupted date strings and calculate Life Time Value: |
| - Find and replace "NULL", "N/A", or empty strings in `end_date_str` with "2024-12-31". |
| - Create a view `user_ltv` calculating revenue using `julianday()` arithmetic. |
| - **Reward Breakdown**: `0.3` for data cleaning, `0.3` for view creation, `0.4` for calculation accuracy. |
|
|
| ### task6 β JSON Analysis & Ranking (Expert) |
| - **Description**: Extract nested JSON data and rank performance: |
| - Add a `total_comp` column to `employees`. |
| - Extract `bonus_pct` from a nested JSON string to compute total compensation. |
| - Create a view `department_all_stars` showing the top earner in each department with performance rating "A". |
| - **Reward Breakdown**: `0.2` for schema mutation, `0.3` for JSON extraction accuracy, `0.5` for correct ranking logic. |
|
|
| ## Reward Mechanism |
|
|
| Each step is scored by the task-specific grader in `server/environment.py`. |
| - The grader inspects the current database state and latest output. |
| - Reward is clamped to the range `0.01` to `0.99`. |
| - Episodes end when the step count reaches `max_steps` or reward reaches `0.99`. |
| - Errors subtract `0.05` from the step reward. |
| - Destructive or incorrect data modifications in advanced tasks result in score penalties. |
|
|
| ## Baseline Scores |
|
|
| Recent reference runs using robust capable LLMs (e.g., `llama-3.3-70b-versatile` via Groq) indicate the environment is reliably solvable but effectively differentiates between model reasoning capabilities on the later multi-step tasks. |
|
|
| | Model | Task 1 (Easy) | Task 2 (Medium) | Task 3 (Hard) | Task 4 (Advanced) | Task 5 (Advanced) | Task 6 (Expert) | |
| |---|---|---|---|---|---|---| |
| | Llama-3.3-70B | ~1.00 | ~1.00 | ~1.00 | ~0.99 | ~0.90 | ~0.99 | |
| | Llama-3.1-8B | ~0.99 | ~0.60 | ~0.40 | ~0.30 | ~0.10 | ~0.00 | |
|
|
| *Note: Scores represent typical final-step partial-progress rewards. Simpler models often struggle to complete Schema Migration (Task 3) or JSON extraction windowing (Task 6), while advanced models can typically achieve near-perfect rewards within 3 to 6 execution steps per task.* |
|
|
| ## Local Setup |
|
|
| ### Install Python dependencies |
|
|
| ```bash |
| cd MetaPytorch-Hackathon-3 |
| pip install -r server/requirements.txt |
| pip install -e . |
| ``` |
|
|
| ### Run the sandbox server locally |
|
|
| ```bash |
| python -m uvicorn server.app:app --host 0.0.0.0 --port 7860 |
| ``` |
|
|
| ## Run inference and evaluation |
|
|
| Ensure `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` (or `OPENAI_API_KEY`) are set. |
|
|
| ```bash |
| cd MetaPytorch-Hackathon-3 |
| python inference.py |
| ``` |
|
|
| ## Docker Setup |
|
|
| ```bash |
| cd MetaPytorch-Hackathon-3 |
| |
| docker build -t sql-sandbox . |
| docker run -p 7860:7860 sql-sandbox |
| ``` |
|
|
| ## Project structure |
|
|
| - `client.py` β OpenEnv client wrapper |
| - `models.py` β action and observation models |
| - `openenv.yaml` β environment manifest |
| - `inference.py` β OpenAI baseline runner |
| - `inference_groq.py` β Groq baseline runner |
| - `server/app.py` β FastAPI app entrypoint |
| - `server/environment.py` β task logic, grading, and reward mechanics |
|
|
| ## License |
|
|
| BSD-3-Clause |
|
|