File size: 5,844 Bytes
569613d
 
 
 
 
 
 
 
 
 
 
 
eed7fd0
569613d
 
 
 
 
 
eed7fd0
569613d
 
 
eed7fd0
569613d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eed7fd0
569613d
 
 
eed7fd0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
569613d
 
 
 
 
 
 
eed7fd0
 
569613d
eed7fd0
 
 
 
 
 
 
 
 
 
569613d
 
 
 
 
 
eed7fd0
569613d
 
 
 
 
 
 
 
 
 
 
 
eed7fd0
569613d
 
eed7fd0
569613d
 
 
 
 
 
eed7fd0
569613d
eed7fd0
569613d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
title: SQL-Agent-RL
emoji: 🐳
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
tag: openenv
---
# SQL / Data Cleaning Sandbox

A FastAPI OpenEnv environment for evaluating AI agents on realistic SQLite data tasks.
Agents interact using SQL and Python to triage, clean, and normalize messy datasets across 6 diverse tasks.

## Motivation

This environment targets data engineering and debugging workflows where an agent must:
- inspect database state,
- correct broken or inconsistent data,
- calculate complex financial or system metrics,
- migrate flat schemas into normalized tables,
- and do so using incremental feedback.

It is designed for benchmarks with partial progress scoring and explicit penalties for destructive actions.

## Action Space

Agents submit actions as JSON objects:
- `tool`: `sql` or `python`
- `command`: the SQL query or Python code to execute

Example:

```json
{
  "tool": "sql",
  "command": "SELECT COUNT(*) FROM users WHERE email IS NULL"
}
```

## Observation Space

Each environment response includes:
- `output`: command output text
- `error`: raw execution error or `null`
- `current_step`: current step index
- `max_steps`: allowed step budget
- `task_description`: active task prompt
- `done`: whether the episode finished
- `reward`: partial reward for the step (includes potential late-task penalties)

## Tasks

The environment provides six progressively difficult tasks, indexed as `task1` through `task6`.

### task1 β€” Data Triage (Easy)
- **Description**: Compute total January 2024 revenue from the `sales` table.
- **Goal**: Run a SQL aggregation that returns the exact total value.
- **Success Criteria**: Reward `1.0` if the result matches `1000.00`.

### task2 β€” Data Cleaning (Medium)
- **Description**: Clean the `users` table:
  - Lowercase all emails.
  - Remove duplicate emails (retain lowest `id`).
  - Replace NULL ages with `0`.
- **Reward Breakdown**: `0.3` for Lowercase, `0.4` for No Duplicates, `0.3` for No NULLs.

### task3 β€” Schema Migration (Hard)
- **Description**: Normalize `flat_orders` into separate `customers` and `orders` tables.
- **Reward Breakdown**:
  - `0.2` for correct `customers` schema.
  - `0.2` for correct `orders` schema.
  - `0.6` for accurate data migration and referential integrity.

### task4 β€” Incident Response (Advanced)
- **Description**: Identify an IP address spamming 403 errors:
  - Create a `blocked_ips` table.
  - Move the offending IP into the blocklist.
  - Prune the offending records from the master `server_logs`.
- **Reward Breakdown**: `0.2` for table creation, `0.3` for correct IP identification, `0.5` for successful log pruning.
- **Penalty**: Deductions occur if legitimate traffic logs are accidentally deleted.

### task5 β€” Data Imputation & Revenue View (Advanced)
- **Description**: Standardize corrupted date strings and calculate Life Time Value:
  - Find and replace "NULL", "N/A", or empty strings in `end_date_str` with "2024-12-31".
  - Create a view `user_ltv` calculating revenue using `julianday()` arithmetic.
- **Reward Breakdown**: `0.3` for data cleaning, `0.3` for view creation, `0.4` for calculation accuracy.

### task6 β€” JSON Analysis & Ranking (Expert)
- **Description**: Extract nested JSON data and rank performance:
  - Add a `total_comp` column to `employees`.
  - Extract `bonus_pct` from a nested JSON string to compute total compensation.
  - Create a view `department_all_stars` showing the top earner in each department with performance rating "A".
- **Reward Breakdown**: `0.2` for schema mutation, `0.3` for JSON extraction accuracy, `0.5` for correct ranking logic.

## Reward Mechanism

Each step is scored by the task-specific grader in `server/environment.py`.
- The grader inspects the current database state and latest output.
- Reward is clamped to the range `0.01` to `0.99`.
- Episodes end when the step count reaches `max_steps` or reward reaches `0.99`.
- Errors subtract `0.05` from the step reward.
- Destructive or incorrect data modifications in advanced tasks result in score penalties.

## Baseline Scores

Recent reference runs using robust capable LLMs (e.g., `llama-3.3-70b-versatile` via Groq) indicate the environment is reliably solvable but effectively differentiates between model reasoning capabilities on the later multi-step tasks.

| Model | Task 1 (Easy) | Task 2 (Medium) | Task 3 (Hard) | Task 4 (Advanced) | Task 5 (Advanced) | Task 6 (Expert) |
|---|---|---|---|---|---|---|
| Llama-3.3-70B | ~1.00 | ~1.00 | ~1.00 | ~0.99 | ~0.90 | ~0.99 |
| Llama-3.1-8B | ~0.99 | ~0.60 | ~0.40 | ~0.30 | ~0.10 | ~0.00 |

*Note: Scores represent typical final-step partial-progress rewards. Simpler models often struggle to complete Schema Migration (Task 3) or JSON extraction windowing (Task 6), while advanced models can typically achieve near-perfect rewards within 3 to 6 execution steps per task.*

## Local Setup

### Install Python dependencies

```bash
cd MetaPytorch-Hackathon-3
pip install -r server/requirements.txt
pip install -e .
```

### Run the sandbox server locally

```bash
python -m uvicorn server.app:app --host 0.0.0.0 --port 7860
```

## Run inference and evaluation

Ensure `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` (or `OPENAI_API_KEY`) are set.

```bash
cd MetaPytorch-Hackathon-3
python inference.py
```

## Docker Setup

```bash
cd MetaPytorch-Hackathon-3

docker build -t sql-sandbox .
docker run -p 7860:7860 sql-sandbox
```

## Project structure

- `client.py` β€” OpenEnv client wrapper
- `models.py` β€” action and observation models
- `openenv.yaml` β€” environment manifest
- `inference.py` β€” OpenAI baseline runner
- `inference_groq.py` β€” Groq baseline runner
- `server/app.py` β€” FastAPI app entrypoint
- `server/environment.py` β€” task logic, grading, and reward mechanics

## License

BSD-3-Clause