sql-migration-env / README.md
Eishaan's picture
docs: merge professional visuals with detailed technical specs for ultimate README
5f32203
---
title: SQL Migration Agent Benchmark
emoji: πŸ—„οΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_file: server/app.py
pinned: false
---
# SQL Migration Agent Benchmark (OpenEnv)
> **A Production-Grade Evaluation Suite for Database Engineering Agents.**
[![OpenEnv Compliant](https://img.shields.io/badge/OpenEnv-Compliant-success)](https://github.com/openenv/core)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Hugging Face Space](https://img.shields.io/badge/HF%20Space-Deployed-orange)](https://huggingface.co/spaces/Eishaan/sql-migration-env)
An OpenEnv-compatible environment for evaluating AI agents on autonomous SQLite database migration tasks. The agent receives a broken/drifted schema and must write SQL to transform it to a target state without losing data.
---
## πŸ—οΈ Architecture Overview
The suite combines formal sequence modeling with a modular local engine.
### System Mapping
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ inference.py (Baseline Agent) β”‚
β”‚ - LLM API calls (OpenAI fmt) β”‚
β”‚ - JSON mode + fallback parser β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ MigrationAction
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ environment.py (OpenEnv Env) β”‚
β”‚ - SQLite execution engine β”‚
β”‚ - ERD & Schema Diff generator β”‚
β”‚ - SQL timeout & Blacklist β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ score()
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ grader.py (Golden DB Engine) β”‚
β”‚ - Dynamic golden reference DB β”‚
β”‚ - Schema + data + FK scoring β”‚
β”‚ - Anti-exploit checks β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Protocol Flow
```mermaid
sequenceDiagram
participant Agent
participant Env as MigrationEnv Server
participant DB as SQLite (:memory:)
participant Grader as Dynamic Golden Grader
Agent->>Env: POST /reset (task_name)
Env->>DB: Seed Source Data
Env->>Grader: Build Golden Reference
Grader-->>Env: Initial Score
Env-->>Agent: Observation (DDL, Schema Diff, ERD)
loop Migration Steps
Agent->>Env: POST /step (SQL, Reasoning)
Env->>DB: Execute SQL (w/ Timeout & Blacklist)
Env->>Grader: Compute Delta Reward
Grader-->>Env: current_score, reward
Env-->>Agent: New Observation + ERD (Visualization)
end
Agent->>Env: submit_final = True
Env->>Grader: Final Integrity & FK Check
Env-->>Agent: Final Episode Summary (Trajectory)
```
---
## 🎯 Benchmark Tasks
| # | Task | Difficulty | Challenge |
|---|------|-----------|-----------|
| 1 | `column-restructure` | 🟒 Easy | Merge first_name + last_name β†’ full_name (with apostrophes) |
| 2 | `soft-delete-restoration` | 🟒 Easy | Restore deleted products from `deletion_log` |
| 3 | `table-normalization` | 🟑 Medium | Normalize `purchases` β†’ `customers` + `orders` + FK |
| 4 | `schema-version-merge` | 🟑 Medium | Merge v1/v2 product tables with price coercion |
| 5 | `multi-entity-extraction` | 🟑 Medium | 3NF decomposition with invalid data routing |
| 6 | `cascade-migration` | πŸ”΄ Hard | 4-table FK cascade, type coercion, orphan audit |
| 7 | `dual-source-consolidation`| πŸ”΄ Hard | 6β†’4 table merge, cross-system email dedup |
### πŸ› οΈ Adversarial Edge Cases (The "Stress Tests")
- **O'Brien**: Apostrophe in data β€” tests SQL escaping and string literal handling.
- **$90,000 salary**: TEXT→INTEGER coercion — tests complex string parsing and casting.
- **Empty string emails**: NOT NULL vs Empty β€” tests data quality validation logic.
- **Leading whitespace**: ` alice@company.com` β€” tests TRIM awareness.
- **ID conflicts**: Overlapping IDs in dual sources β€” tests intelligent merge logic.
- **Orphaned FKs**: References to deleted entities β€” tests environment's audit logging.
- **NULL currency**: Must default to 'USD' β€” tests COALESCE usage.
---
## βš–οΈ Evaluation Baselines
| Task | Qwen 32B Score | GPT-OSS 120B |
|------|--------------|--------------|
| `column-restructure` | 0.99 | 0.99 |
| `soft-delete-restoration` | 0.99 | 0.99 |
| `table-normalization` | 0.94 | 0.99 |
| `schema-version-merge` | 0.93 | 0.98 |
| `multi-entity-extraction` | 0.35 | 0.65 |
| `cascade-migration` | 0.61 | 0.83 |
| `dual-source-consolidation`| 0.28 | 0.38 |
---
## πŸ›‘οΈ Security & Reward Function
### The Reward Formula
Rewards are calculated as progress deltas: $R_t = P_t - P_{t-1}$.
Progress $P_t$ is a weighted sum (0.01 to 0.99):
- **Schema Match (30%)**: Tables exist with correct `(name, type)` signatures.
- **Data Match (40%)**: Row content matches golden DB (order-independent).
- **FK & Integrity (20%)**: Foreign keys enforced, `integrity_check` passes.
- **Anti-Exploit (10%)**: Penalty for empty tables or schema pollution.
### Security Guardrails
- **PRAGMA Blacklist**: `foreign_keys = OFF` and `writable_schema = ON` are blocked.
- **Query Timeout**: SQLite progress handler terminates queries exceeding 500k ops.
- **Dangerous SQL**: `ATTACH`, `DETACH`, and `LOAD_EXTENSION` are filtered.
---
## πŸš€ Setup & Usage
### Local Deployment
```bash
pip install -r requirements.txt
python -m server.app # Starts OpenEnv server on port 7860
```
### Environment Variables
```bash
export HF_TOKEN=your_token
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
```
### API Endpoints
- `POST /reset`: Initialize migration episode.
- `POST /step`: Execute SQL and reasoning.
- `GET /tasks`: List all available scenarios.
- `POST /grader`: Run deep comparison against Golden DB.
---
## πŸ–ΌοΈ Observations
Each observation includes `erd_visualization` (Mermaid.js) and `schema_diff` to assist agents in understanding the current drift.
## πŸ“„ License
MIT. Built for the **OpenEnv Hackathon 2026**.