--- title: SQL-Agent-RL emoji: 🐳 colorFrom: blue colorTo: indigo sdk: docker pinned: false tag: openenv --- # SQL / Data Cleaning Sandbox A FastAPI OpenEnv environment for evaluating AI agents on realistic SQLite data tasks. Agents interact using SQL and Python to triage, clean, and normalize messy datasets across 6 diverse tasks. ## Motivation This environment targets data engineering and debugging workflows where an agent must: - inspect database state, - correct broken or inconsistent data, - calculate complex financial or system metrics, - migrate flat schemas into normalized tables, - and do so using incremental feedback. It is designed for benchmarks with partial progress scoring and explicit penalties for destructive actions. ## Action Space Agents submit actions as JSON objects: - `tool`: `sql` or `python` - `command`: the SQL query or Python code to execute Example: ```json { "tool": "sql", "command": "SELECT COUNT(*) FROM users WHERE email IS NULL" } ``` ## Observation Space Each environment response includes: - `output`: command output text - `error`: raw execution error or `null` - `current_step`: current step index - `max_steps`: allowed step budget - `task_description`: active task prompt - `done`: whether the episode finished - `reward`: partial reward for the step (includes potential late-task penalties) ## Tasks The environment provides six progressively difficult tasks, indexed as `task1` through `task6`. ### task1 — Data Triage (Easy) - **Description**: Compute total January 2024 revenue from the `sales` table. - **Goal**: Run a SQL aggregation that returns the exact total value. - **Success Criteria**: Reward `1.0` if the result matches `1000.00`. ### task2 — Data Cleaning (Medium) - **Description**: Clean the `users` table: - Lowercase all emails. - Remove duplicate emails (retain lowest `id`). - Replace NULL ages with `0`. - **Reward Breakdown**: `0.3` for Lowercase, `0.4` for No Duplicates, `0.3` for No NULLs. ### task3 — Schema Migration (Hard) - **Description**: Normalize `flat_orders` into separate `customers` and `orders` tables. - **Reward Breakdown**: - `0.2` for correct `customers` schema. - `0.2` for correct `orders` schema. - `0.6` for accurate data migration and referential integrity. ### task4 — Incident Response (Advanced) - **Description**: Identify an IP address spamming 403 errors: - Create a `blocked_ips` table. - Move the offending IP into the blocklist. - Prune the offending records from the master `server_logs`. - **Reward Breakdown**: `0.2` for table creation, `0.3` for correct IP identification, `0.5` for successful log pruning. - **Penalty**: Deductions occur if legitimate traffic logs are accidentally deleted. ### task5 — Data Imputation & Revenue View (Advanced) - **Description**: Standardize corrupted date strings and calculate Life Time Value: - Find and replace "NULL", "N/A", or empty strings in `end_date_str` with "2024-12-31". - Create a view `user_ltv` calculating revenue using `julianday()` arithmetic. - **Reward Breakdown**: `0.3` for data cleaning, `0.3` for view creation, `0.4` for calculation accuracy. ### task6 — JSON Analysis & Ranking (Expert) - **Description**: Extract nested JSON data and rank performance: - Add a `total_comp` column to `employees`. - Extract `bonus_pct` from a nested JSON string to compute total compensation. - Create a view `department_all_stars` showing the top earner in each department with performance rating "A". - **Reward Breakdown**: `0.2` for schema mutation, `0.3` for JSON extraction accuracy, `0.5` for correct ranking logic. ## Reward Mechanism Each step is scored by the task-specific grader in `server/environment.py`. - The grader inspects the current database state and latest output. - Reward is clamped to the range `0.01` to `0.99`. - Episodes end when the step count reaches `max_steps` or reward reaches `0.99`. - Errors subtract `0.05` from the step reward. - Destructive or incorrect data modifications in advanced tasks result in score penalties. ## Baseline Scores Recent reference runs using robust capable LLMs (e.g., `llama-3.3-70b-versatile` via Groq) indicate the environment is reliably solvable but effectively differentiates between model reasoning capabilities on the later multi-step tasks. | Model | Task 1 (Easy) | Task 2 (Medium) | Task 3 (Hard) | Task 4 (Advanced) | Task 5 (Advanced) | Task 6 (Expert) | |---|---|---|---|---|---|---| | Llama-3.3-70B | ~1.00 | ~1.00 | ~1.00 | ~0.99 | ~0.90 | ~0.99 | | Llama-3.1-8B | ~0.99 | ~0.60 | ~0.40 | ~0.30 | ~0.10 | ~0.00 | *Note: Scores represent typical final-step partial-progress rewards. Simpler models often struggle to complete Schema Migration (Task 3) or JSON extraction windowing (Task 6), while advanced models can typically achieve near-perfect rewards within 3 to 6 execution steps per task.* ## Local Setup ### Install Python dependencies ```bash cd MetaPytorch-Hackathon-3 pip install -r server/requirements.txt pip install -e . ``` ### Run the sandbox server locally ```bash python -m uvicorn server.app:app --host 0.0.0.0 --port 7860 ``` ## Run inference and evaluation Ensure `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` (or `OPENAI_API_KEY`) are set. ```bash cd MetaPytorch-Hackathon-3 python inference.py ``` ## Docker Setup ```bash cd MetaPytorch-Hackathon-3 docker build -t sql-sandbox . docker run -p 7860:7860 sql-sandbox ``` ## Project structure - `client.py` — OpenEnv client wrapper - `models.py` — action and observation models - `openenv.yaml` — environment manifest - `inference.py` — OpenAI baseline runner - `inference_groq.py` — Groq baseline runner - `server/app.py` — FastAPI app entrypoint - `server/environment.py` — task logic, grading, and reward mechanics ## License BSD-3-Clause