Spaces:
Sleeping
Sleeping
| title: SQL Migration Agent Benchmark | |
| emoji: ποΈ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_file: server/app.py | |
| pinned: false | |
| # SQL Migration Agent Benchmark (OpenEnv) | |
| > **A Production-Grade Evaluation Suite for Database Engineering Agents.** | |
| [](https://github.com/openenv/core) | |
| [](https://opensource.org/licenses/MIT) | |
| [](https://huggingface.co/spaces/Eishaan/sql-migration-env) | |
| An OpenEnv-compatible environment for evaluating AI agents on autonomous SQLite database migration tasks. The agent receives a broken/drifted schema and must write SQL to transform it to a target state without losing data. | |
| --- | |
| ## ποΈ Architecture Overview | |
| The suite combines formal sequence modeling with a modular local engine. | |
| ### System Mapping | |
| ``` | |
| βββββββββββββββββββββββββββββββββββ | |
| β inference.py (Baseline Agent) β | |
| β - LLM API calls (OpenAI fmt) β | |
| β - JSON mode + fallback parser β | |
| βββββββββββ¬ββββββββββββββββββββββββ | |
| β MigrationAction | |
| βββββββββββΌββββββββββββββββββββββββ | |
| β environment.py (OpenEnv Env) β | |
| β - SQLite execution engine β | |
| β - ERD & Schema Diff generator β | |
| β - SQL timeout & Blacklist β | |
| βββββββββββ¬ββββββββββββββββββββββββ | |
| β score() | |
| βββββββββββΌββββββββββββββββββββββββ | |
| β grader.py (Golden DB Engine) β | |
| β - Dynamic golden reference DB β | |
| β - Schema + data + FK scoring β | |
| β - Anti-exploit checks β | |
| βββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### Protocol Flow | |
| ```mermaid | |
| sequenceDiagram | |
| participant Agent | |
| participant Env as MigrationEnv Server | |
| participant DB as SQLite (:memory:) | |
| participant Grader as Dynamic Golden Grader | |
| Agent->>Env: POST /reset (task_name) | |
| Env->>DB: Seed Source Data | |
| Env->>Grader: Build Golden Reference | |
| Grader-->>Env: Initial Score | |
| Env-->>Agent: Observation (DDL, Schema Diff, ERD) | |
| loop Migration Steps | |
| Agent->>Env: POST /step (SQL, Reasoning) | |
| Env->>DB: Execute SQL (w/ Timeout & Blacklist) | |
| Env->>Grader: Compute Delta Reward | |
| Grader-->>Env: current_score, reward | |
| Env-->>Agent: New Observation + ERD (Visualization) | |
| end | |
| Agent->>Env: submit_final = True | |
| Env->>Grader: Final Integrity & FK Check | |
| Env-->>Agent: Final Episode Summary (Trajectory) | |
| ``` | |
| --- | |
| ## π― Benchmark Tasks | |
| | # | Task | Difficulty | Challenge | | |
| |---|------|-----------|-----------| | |
| | 1 | `column-restructure` | π’ Easy | Merge first_name + last_name β full_name (with apostrophes) | | |
| | 2 | `soft-delete-restoration` | π’ Easy | Restore deleted products from `deletion_log` | | |
| | 3 | `table-normalization` | π‘ Medium | Normalize `purchases` β `customers` + `orders` + FK | | |
| | 4 | `schema-version-merge` | π‘ Medium | Merge v1/v2 product tables with price coercion | | |
| | 5 | `multi-entity-extraction` | π‘ Medium | 3NF decomposition with invalid data routing | | |
| | 6 | `cascade-migration` | π΄ Hard | 4-table FK cascade, type coercion, orphan audit | | |
| | 7 | `dual-source-consolidation`| π΄ Hard | 6β4 table merge, cross-system email dedup | | |
| ### π οΈ Adversarial Edge Cases (The "Stress Tests") | |
| - **O'Brien**: Apostrophe in data β tests SQL escaping and string literal handling. | |
| - **$90,000 salary**: TEXTβINTEGER coercion β tests complex string parsing and casting. | |
| - **Empty string emails**: NOT NULL vs Empty β tests data quality validation logic. | |
| - **Leading whitespace**: ` alice@company.com` β tests TRIM awareness. | |
| - **ID conflicts**: Overlapping IDs in dual sources β tests intelligent merge logic. | |
| - **Orphaned FKs**: References to deleted entities β tests environment's audit logging. | |
| - **NULL currency**: Must default to 'USD' β tests COALESCE usage. | |
| --- | |
| ## βοΈ Evaluation Baselines | |
| | Task | Qwen 32B Score | GPT-OSS 120B | | |
| |------|--------------|--------------| | |
| | `column-restructure` | 0.99 | 0.99 | | |
| | `soft-delete-restoration` | 0.99 | 0.99 | | |
| | `table-normalization` | 0.94 | 0.99 | | |
| | `schema-version-merge` | 0.93 | 0.98 | | |
| | `multi-entity-extraction` | 0.35 | 0.65 | | |
| | `cascade-migration` | 0.61 | 0.83 | | |
| | `dual-source-consolidation`| 0.28 | 0.38 | | |
| --- | |
| ## π‘οΈ Security & Reward Function | |
| ### The Reward Formula | |
| Rewards are calculated as progress deltas: $R_t = P_t - P_{t-1}$. | |
| Progress $P_t$ is a weighted sum (0.01 to 0.99): | |
| - **Schema Match (30%)**: Tables exist with correct `(name, type)` signatures. | |
| - **Data Match (40%)**: Row content matches golden DB (order-independent). | |
| - **FK & Integrity (20%)**: Foreign keys enforced, `integrity_check` passes. | |
| - **Anti-Exploit (10%)**: Penalty for empty tables or schema pollution. | |
| ### Security Guardrails | |
| - **PRAGMA Blacklist**: `foreign_keys = OFF` and `writable_schema = ON` are blocked. | |
| - **Query Timeout**: SQLite progress handler terminates queries exceeding 500k ops. | |
| - **Dangerous SQL**: `ATTACH`, `DETACH`, and `LOAD_EXTENSION` are filtered. | |
| --- | |
| ## π Setup & Usage | |
| ### Local Deployment | |
| ```bash | |
| pip install -r requirements.txt | |
| python -m server.app # Starts OpenEnv server on port 7860 | |
| ``` | |
| ### Environment Variables | |
| ```bash | |
| export HF_TOKEN=your_token | |
| export API_BASE_URL=https://router.huggingface.co/v1 | |
| export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct | |
| ``` | |
| ### API Endpoints | |
| - `POST /reset`: Initialize migration episode. | |
| - `POST /step`: Execute SQL and reasoning. | |
| - `GET /tasks`: List all available scenarios. | |
| - `POST /grader`: Run deep comparison against Golden DB. | |
| --- | |
| ## πΌοΈ Observations | |
| Each observation includes `erd_visualization` (Mermaid.js) and `schema_diff` to assist agents in understanding the current drift. | |
| ## π License | |
| MIT. Built for the **OpenEnv Hackathon 2026**. | |