--- title: SQL Migration Agent Benchmark emoji: πŸ—„οΈ colorFrom: blue colorTo: purple sdk: docker app_file: server/app.py pinned: false --- # SQL Migration Agent Benchmark (OpenEnv) > **A Production-Grade Evaluation Suite for Database Engineering Agents.** [![OpenEnv Compliant](https://img.shields.io/badge/OpenEnv-Compliant-success)](https://github.com/openenv/core) [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) [![Hugging Face Space](https://img.shields.io/badge/HF%20Space-Deployed-orange)](https://huggingface.co/spaces/Eishaan/sql-migration-env) An OpenEnv-compatible environment for evaluating AI agents on autonomous SQLite database migration tasks. The agent receives a broken/drifted schema and must write SQL to transform it to a target state without losing data. --- ## πŸ—οΈ Architecture Overview The suite combines formal sequence modeling with a modular local engine. ### System Mapping ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ inference.py (Baseline Agent) β”‚ β”‚ - LLM API calls (OpenAI fmt) β”‚ β”‚ - JSON mode + fallback parser β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ MigrationAction β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ environment.py (OpenEnv Env) β”‚ β”‚ - SQLite execution engine β”‚ β”‚ - ERD & Schema Diff generator β”‚ β”‚ - SQL timeout & Blacklist β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ score() β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ grader.py (Golden DB Engine) β”‚ β”‚ - Dynamic golden reference DB β”‚ β”‚ - Schema + data + FK scoring β”‚ β”‚ - Anti-exploit checks β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### Protocol Flow ```mermaid sequenceDiagram participant Agent participant Env as MigrationEnv Server participant DB as SQLite (:memory:) participant Grader as Dynamic Golden Grader Agent->>Env: POST /reset (task_name) Env->>DB: Seed Source Data Env->>Grader: Build Golden Reference Grader-->>Env: Initial Score Env-->>Agent: Observation (DDL, Schema Diff, ERD) loop Migration Steps Agent->>Env: POST /step (SQL, Reasoning) Env->>DB: Execute SQL (w/ Timeout & Blacklist) Env->>Grader: Compute Delta Reward Grader-->>Env: current_score, reward Env-->>Agent: New Observation + ERD (Visualization) end Agent->>Env: submit_final = True Env->>Grader: Final Integrity & FK Check Env-->>Agent: Final Episode Summary (Trajectory) ``` --- ## 🎯 Benchmark Tasks | # | Task | Difficulty | Challenge | |---|------|-----------|-----------| | 1 | `column-restructure` | 🟒 Easy | Merge first_name + last_name β†’ full_name (with apostrophes) | | 2 | `soft-delete-restoration` | 🟒 Easy | Restore deleted products from `deletion_log` | | 3 | `table-normalization` | 🟑 Medium | Normalize `purchases` β†’ `customers` + `orders` + FK | | 4 | `schema-version-merge` | 🟑 Medium | Merge v1/v2 product tables with price coercion | | 5 | `multi-entity-extraction` | 🟑 Medium | 3NF decomposition with invalid data routing | | 6 | `cascade-migration` | πŸ”΄ Hard | 4-table FK cascade, type coercion, orphan audit | | 7 | `dual-source-consolidation`| πŸ”΄ Hard | 6β†’4 table merge, cross-system email dedup | ### πŸ› οΈ Adversarial Edge Cases (The "Stress Tests") - **O'Brien**: Apostrophe in data β€” tests SQL escaping and string literal handling. - **$90,000 salary**: TEXTβ†’INTEGER coercion β€” tests complex string parsing and casting. - **Empty string emails**: NOT NULL vs Empty β€” tests data quality validation logic. - **Leading whitespace**: ` alice@company.com` β€” tests TRIM awareness. - **ID conflicts**: Overlapping IDs in dual sources β€” tests intelligent merge logic. - **Orphaned FKs**: References to deleted entities β€” tests environment's audit logging. - **NULL currency**: Must default to 'USD' β€” tests COALESCE usage. --- ## βš–οΈ Evaluation Baselines | Task | Qwen 32B Score | GPT-OSS 120B | |------|--------------|--------------| | `column-restructure` | 0.99 | 0.99 | | `soft-delete-restoration` | 0.99 | 0.99 | | `table-normalization` | 0.94 | 0.99 | | `schema-version-merge` | 0.93 | 0.98 | | `multi-entity-extraction` | 0.35 | 0.65 | | `cascade-migration` | 0.61 | 0.83 | | `dual-source-consolidation`| 0.28 | 0.38 | --- ## πŸ›‘οΈ Security & Reward Function ### The Reward Formula Rewards are calculated as progress deltas: $R_t = P_t - P_{t-1}$. Progress $P_t$ is a weighted sum (0.01 to 0.99): - **Schema Match (30%)**: Tables exist with correct `(name, type)` signatures. - **Data Match (40%)**: Row content matches golden DB (order-independent). - **FK & Integrity (20%)**: Foreign keys enforced, `integrity_check` passes. - **Anti-Exploit (10%)**: Penalty for empty tables or schema pollution. ### Security Guardrails - **PRAGMA Blacklist**: `foreign_keys = OFF` and `writable_schema = ON` are blocked. - **Query Timeout**: SQLite progress handler terminates queries exceeding 500k ops. - **Dangerous SQL**: `ATTACH`, `DETACH`, and `LOAD_EXTENSION` are filtered. --- ## πŸš€ Setup & Usage ### Local Deployment ```bash pip install -r requirements.txt python -m server.app # Starts OpenEnv server on port 7860 ``` ### Environment Variables ```bash export HF_TOKEN=your_token export API_BASE_URL=https://router.huggingface.co/v1 export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct ``` ### API Endpoints - `POST /reset`: Initialize migration episode. - `POST /step`: Execute SQL and reasoning. - `GET /tasks`: List all available scenarios. - `POST /grader`: Run deep comparison against Golden DB. --- ## πŸ–ΌοΈ Observations Each observation includes `erd_visualization` (Mermaid.js) and `schema_diff` to assist agents in understanding the current drift. ## πŸ“„ License MIT. Built for the **OpenEnv Hackathon 2026**.