File size: 6,266 Bytes
f10b032
41cae03
f10b032
 
41cae03
f10b032
41cae03
f10b032
 
1b42f19
41cae03
 
1b42f19
41cae03
 
 
1b42f19
5f32203
1b42f19
41cae03
1b42f19
41cae03
 
5f32203
41cae03
5f32203
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41cae03
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a32325
1b42f19
41cae03
82a6e6c
41cae03
6a32325
5f32203
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a32325
41cae03
1b42f19
5f32203
41cae03
5f32203
 
 
 
 
 
 
 
 
41cae03
 
 
5f32203
1b42f19
5f32203
 
 
 
 
 
 
 
 
 
 
 
41cae03
 
 
5f32203
41cae03
5f32203
6a32325
5f32203
 
6a32325
1b42f19
5f32203
6a32325
5f32203
 
 
1b42f19
 
5f32203
 
 
 
 
1b42f19
41cae03
 
5f32203
 
1b42f19
41cae03
5f32203
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
title: SQL Migration Agent Benchmark
emoji: πŸ—„οΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_file: server/app.py
pinned: false
---

# SQL Migration Agent Benchmark (OpenEnv)
> **A Production-Grade Evaluation Suite for Database Engineering Agents.**

[![OpenEnv Compliant](https://img.shields.io/badge/OpenEnv-Compliant-success)](https://github.com/openenv/core)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Hugging Face Space](https://img.shields.io/badge/HF%20Space-Deployed-orange)](https://huggingface.co/spaces/Eishaan/sql-migration-env)

An OpenEnv-compatible environment for evaluating AI agents on autonomous SQLite database migration tasks. The agent receives a broken/drifted schema and must write SQL to transform it to a target state without losing data.

---

## πŸ—οΈ Architecture Overview

The suite combines formal sequence modeling with a modular local engine.

### System Mapping
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  inference.py (Baseline Agent)  β”‚
β”‚  - LLM API calls (OpenAI fmt)  β”‚
β”‚  - JSON mode + fallback parser β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚ MigrationAction
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  environment.py (OpenEnv Env)   β”‚
β”‚  - SQLite execution engine      β”‚
β”‚  - ERD & Schema Diff generator  β”‚
β”‚  - SQL timeout & Blacklist      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚ score()
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  grader.py (Golden DB Engine)   β”‚
β”‚  - Dynamic golden reference DB  β”‚
β”‚  - Schema + data + FK scoring   β”‚
β”‚  - Anti-exploit checks          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Protocol Flow
```mermaid
sequenceDiagram
    participant Agent
    participant Env as MigrationEnv Server
    participant DB as SQLite (:memory:)
    participant Grader as Dynamic Golden Grader

    Agent->>Env: POST /reset (task_name)
    Env->>DB: Seed Source Data
    Env->>Grader: Build Golden Reference
    Grader-->>Env: Initial Score
    Env-->>Agent: Observation (DDL, Schema Diff, ERD)

    loop Migration Steps
        Agent->>Env: POST /step (SQL, Reasoning)
        Env->>DB: Execute SQL (w/ Timeout & Blacklist)
        Env->>Grader: Compute Delta Reward
        Grader-->>Env: current_score, reward
        Env-->>Agent: New Observation + ERD (Visualization)
    end

    Agent->>Env: submit_final = True
    Env->>Grader: Final Integrity & FK Check
    Env-->>Agent: Final Episode Summary (Trajectory)
```

---

## 🎯 Benchmark Tasks

| # | Task | Difficulty | Challenge |
|---|------|-----------|-----------|
| 1 | `column-restructure` | 🟒 Easy | Merge first_name + last_name β†’ full_name (with apostrophes) |
| 2 | `soft-delete-restoration` | 🟒 Easy | Restore deleted products from `deletion_log` |
| 3 | `table-normalization` | 🟑 Medium | Normalize `purchases` β†’ `customers` + `orders` + FK |
| 4 | `schema-version-merge` | 🟑 Medium | Merge v1/v2 product tables with price coercion |
| 5 | `multi-entity-extraction` | 🟑 Medium | 3NF decomposition with invalid data routing |
| 6 | `cascade-migration` | πŸ”΄ Hard | 4-table FK cascade, type coercion, orphan audit |
| 7 | `dual-source-consolidation`| πŸ”΄ Hard | 6β†’4 table merge, cross-system email dedup |

### πŸ› οΈ Adversarial Edge Cases (The "Stress Tests")
- **O'Brien**: Apostrophe in data β€” tests SQL escaping and string literal handling.
- **$90,000 salary**: TEXT→INTEGER coercion — tests complex string parsing and casting.
- **Empty string emails**: NOT NULL vs Empty β€” tests data quality validation logic.
- **Leading whitespace**: ` alice@company.com` β€” tests TRIM awareness.
- **ID conflicts**: Overlapping IDs in dual sources β€” tests intelligent merge logic.
- **Orphaned FKs**: References to deleted entities β€” tests environment's audit logging.
- **NULL currency**: Must default to 'USD' β€” tests COALESCE usage.

---

## βš–οΈ Evaluation Baselines

| Task | Qwen 32B Score | GPT-OSS 120B |
|------|--------------|--------------|
| `column-restructure` | 0.99 | 0.99 |
| `soft-delete-restoration` | 0.99 | 0.99 |
| `table-normalization` | 0.94 | 0.99 |
| `schema-version-merge` | 0.93 | 0.98 |
| `multi-entity-extraction` | 0.35 | 0.65 |
| `cascade-migration` | 0.61 | 0.83 |
| `dual-source-consolidation`| 0.28 | 0.38 |

---

## πŸ›‘οΈ Security & Reward Function

### The Reward Formula
Rewards are calculated as progress deltas: $R_t = P_t - P_{t-1}$.
Progress $P_t$ is a weighted sum (0.01 to 0.99):
- **Schema Match (30%)**: Tables exist with correct `(name, type)` signatures.
- **Data Match (40%)**: Row content matches golden DB (order-independent).
- **FK & Integrity (20%)**: Foreign keys enforced, `integrity_check` passes.
- **Anti-Exploit (10%)**: Penalty for empty tables or schema pollution.

### Security Guardrails
- **PRAGMA Blacklist**: `foreign_keys = OFF` and `writable_schema = ON` are blocked.
- **Query Timeout**: SQLite progress handler terminates queries exceeding 500k ops.
- **Dangerous SQL**: `ATTACH`, `DETACH`, and `LOAD_EXTENSION` are filtered.

---

## πŸš€ Setup & Usage

### Local Deployment
```bash
pip install -r requirements.txt
python -m server.app  # Starts OpenEnv server on port 7860
```

### Environment Variables
```bash
export HF_TOKEN=your_token
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
```

### API Endpoints
- `POST /reset`: Initialize migration episode.
- `POST /step`: Execute SQL and reasoning.
- `GET /tasks`: List all available scenarios.
- `POST /grader`: Run deep comparison against Golden DB.

---

## πŸ–ΌοΈ Observations
Each observation includes `erd_visualization` (Mermaid.js) and `schema_diff` to assist agents in understanding the current drift.

## πŸ“„ License
MIT. Built for the **OpenEnv Hackathon 2026**.