Spaces:
Sleeping
Sleeping
File size: 6,266 Bytes
f10b032 41cae03 f10b032 41cae03 f10b032 41cae03 f10b032 1b42f19 41cae03 1b42f19 41cae03 1b42f19 5f32203 1b42f19 41cae03 1b42f19 41cae03 5f32203 41cae03 5f32203 41cae03 6a32325 1b42f19 41cae03 82a6e6c 41cae03 6a32325 5f32203 6a32325 41cae03 1b42f19 5f32203 41cae03 5f32203 41cae03 5f32203 1b42f19 5f32203 41cae03 5f32203 41cae03 5f32203 6a32325 5f32203 6a32325 1b42f19 5f32203 6a32325 5f32203 1b42f19 5f32203 1b42f19 41cae03 5f32203 1b42f19 41cae03 5f32203 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | ---
title: SQL Migration Agent Benchmark
emoji: ποΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_file: server/app.py
pinned: false
---
# SQL Migration Agent Benchmark (OpenEnv)
> **A Production-Grade Evaluation Suite for Database Engineering Agents.**
[](https://github.com/openenv/core)
[](https://opensource.org/licenses/MIT)
[](https://huggingface.co/spaces/Eishaan/sql-migration-env)
An OpenEnv-compatible environment for evaluating AI agents on autonomous SQLite database migration tasks. The agent receives a broken/drifted schema and must write SQL to transform it to a target state without losing data.
---
## ποΈ Architecture Overview
The suite combines formal sequence modeling with a modular local engine.
### System Mapping
```
βββββββββββββββββββββββββββββββββββ
β inference.py (Baseline Agent) β
β - LLM API calls (OpenAI fmt) β
β - JSON mode + fallback parser β
βββββββββββ¬ββββββββββββββββββββββββ
β MigrationAction
βββββββββββΌββββββββββββββββββββββββ
β environment.py (OpenEnv Env) β
β - SQLite execution engine β
β - ERD & Schema Diff generator β
β - SQL timeout & Blacklist β
βββββββββββ¬ββββββββββββββββββββββββ
β score()
βββββββββββΌββββββββββββββββββββββββ
β grader.py (Golden DB Engine) β
β - Dynamic golden reference DB β
β - Schema + data + FK scoring β
β - Anti-exploit checks β
βββββββββββββββββββββββββββββββββββ
```
### Protocol Flow
```mermaid
sequenceDiagram
participant Agent
participant Env as MigrationEnv Server
participant DB as SQLite (:memory:)
participant Grader as Dynamic Golden Grader
Agent->>Env: POST /reset (task_name)
Env->>DB: Seed Source Data
Env->>Grader: Build Golden Reference
Grader-->>Env: Initial Score
Env-->>Agent: Observation (DDL, Schema Diff, ERD)
loop Migration Steps
Agent->>Env: POST /step (SQL, Reasoning)
Env->>DB: Execute SQL (w/ Timeout & Blacklist)
Env->>Grader: Compute Delta Reward
Grader-->>Env: current_score, reward
Env-->>Agent: New Observation + ERD (Visualization)
end
Agent->>Env: submit_final = True
Env->>Grader: Final Integrity & FK Check
Env-->>Agent: Final Episode Summary (Trajectory)
```
---
## π― Benchmark Tasks
| # | Task | Difficulty | Challenge |
|---|------|-----------|-----------|
| 1 | `column-restructure` | π’ Easy | Merge first_name + last_name β full_name (with apostrophes) |
| 2 | `soft-delete-restoration` | π’ Easy | Restore deleted products from `deletion_log` |
| 3 | `table-normalization` | π‘ Medium | Normalize `purchases` β `customers` + `orders` + FK |
| 4 | `schema-version-merge` | π‘ Medium | Merge v1/v2 product tables with price coercion |
| 5 | `multi-entity-extraction` | π‘ Medium | 3NF decomposition with invalid data routing |
| 6 | `cascade-migration` | π΄ Hard | 4-table FK cascade, type coercion, orphan audit |
| 7 | `dual-source-consolidation`| π΄ Hard | 6β4 table merge, cross-system email dedup |
### π οΈ Adversarial Edge Cases (The "Stress Tests")
- **O'Brien**: Apostrophe in data β tests SQL escaping and string literal handling.
- **$90,000 salary**: TEXTβINTEGER coercion β tests complex string parsing and casting.
- **Empty string emails**: NOT NULL vs Empty β tests data quality validation logic.
- **Leading whitespace**: ` alice@company.com` β tests TRIM awareness.
- **ID conflicts**: Overlapping IDs in dual sources β tests intelligent merge logic.
- **Orphaned FKs**: References to deleted entities β tests environment's audit logging.
- **NULL currency**: Must default to 'USD' β tests COALESCE usage.
---
## βοΈ Evaluation Baselines
| Task | Qwen 32B Score | GPT-OSS 120B |
|------|--------------|--------------|
| `column-restructure` | 0.99 | 0.99 |
| `soft-delete-restoration` | 0.99 | 0.99 |
| `table-normalization` | 0.94 | 0.99 |
| `schema-version-merge` | 0.93 | 0.98 |
| `multi-entity-extraction` | 0.35 | 0.65 |
| `cascade-migration` | 0.61 | 0.83 |
| `dual-source-consolidation`| 0.28 | 0.38 |
---
## π‘οΈ Security & Reward Function
### The Reward Formula
Rewards are calculated as progress deltas: $R_t = P_t - P_{t-1}$.
Progress $P_t$ is a weighted sum (0.01 to 0.99):
- **Schema Match (30%)**: Tables exist with correct `(name, type)` signatures.
- **Data Match (40%)**: Row content matches golden DB (order-independent).
- **FK & Integrity (20%)**: Foreign keys enforced, `integrity_check` passes.
- **Anti-Exploit (10%)**: Penalty for empty tables or schema pollution.
### Security Guardrails
- **PRAGMA Blacklist**: `foreign_keys = OFF` and `writable_schema = ON` are blocked.
- **Query Timeout**: SQLite progress handler terminates queries exceeding 500k ops.
- **Dangerous SQL**: `ATTACH`, `DETACH`, and `LOAD_EXTENSION` are filtered.
---
## π Setup & Usage
### Local Deployment
```bash
pip install -r requirements.txt
python -m server.app # Starts OpenEnv server on port 7860
```
### Environment Variables
```bash
export HF_TOKEN=your_token
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
```
### API Endpoints
- `POST /reset`: Initialize migration episode.
- `POST /step`: Execute SQL and reasoning.
- `GET /tasks`: List all available scenarios.
- `POST /grader`: Run deep comparison against Golden DB.
---
## πΌοΈ Observations
Each observation includes `erd_visualization` (Mermaid.js) and `schema_diff` to assist agents in understanding the current drift.
## π License
MIT. Built for the **OpenEnv Hackathon 2026**.
|