sql-migration-env / README.md
Eishaan's picture
docs: merge professional visuals with detailed technical specs for ultimate README
5f32203
metadata
title: SQL Migration Agent Benchmark
emoji: πŸ—„οΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_file: server/app.py
pinned: false

SQL Migration Agent Benchmark (OpenEnv)

A Production-Grade Evaluation Suite for Database Engineering Agents.

OpenEnv Compliant License: MIT Hugging Face Space

An OpenEnv-compatible environment for evaluating AI agents on autonomous SQLite database migration tasks. The agent receives a broken/drifted schema and must write SQL to transform it to a target state without losing data.


πŸ—οΈ Architecture Overview

The suite combines formal sequence modeling with a modular local engine.

System Mapping

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  inference.py (Baseline Agent)  β”‚
β”‚  - LLM API calls (OpenAI fmt)  β”‚
β”‚  - JSON mode + fallback parser β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚ MigrationAction
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  environment.py (OpenEnv Env)   β”‚
β”‚  - SQLite execution engine      β”‚
β”‚  - ERD & Schema Diff generator  β”‚
β”‚  - SQL timeout & Blacklist      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚ score()
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  grader.py (Golden DB Engine)   β”‚
β”‚  - Dynamic golden reference DB  β”‚
β”‚  - Schema + data + FK scoring   β”‚
β”‚  - Anti-exploit checks          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Protocol Flow

sequenceDiagram
    participant Agent
    participant Env as MigrationEnv Server
    participant DB as SQLite (:memory:)
    participant Grader as Dynamic Golden Grader

    Agent->>Env: POST /reset (task_name)
    Env->>DB: Seed Source Data
    Env->>Grader: Build Golden Reference
    Grader-->>Env: Initial Score
    Env-->>Agent: Observation (DDL, Schema Diff, ERD)

    loop Migration Steps
        Agent->>Env: POST /step (SQL, Reasoning)
        Env->>DB: Execute SQL (w/ Timeout & Blacklist)
        Env->>Grader: Compute Delta Reward
        Grader-->>Env: current_score, reward
        Env-->>Agent: New Observation + ERD (Visualization)
    end

    Agent->>Env: submit_final = True
    Env->>Grader: Final Integrity & FK Check
    Env-->>Agent: Final Episode Summary (Trajectory)

🎯 Benchmark Tasks

# Task Difficulty Challenge
1 column-restructure 🟒 Easy Merge first_name + last_name β†’ full_name (with apostrophes)
2 soft-delete-restoration 🟒 Easy Restore deleted products from deletion_log
3 table-normalization 🟑 Medium Normalize purchases β†’ customers + orders + FK
4 schema-version-merge 🟑 Medium Merge v1/v2 product tables with price coercion
5 multi-entity-extraction 🟑 Medium 3NF decomposition with invalid data routing
6 cascade-migration πŸ”΄ Hard 4-table FK cascade, type coercion, orphan audit
7 dual-source-consolidation πŸ”΄ Hard 6β†’4 table merge, cross-system email dedup

πŸ› οΈ Adversarial Edge Cases (The "Stress Tests")

  • O'Brien: Apostrophe in data β€” tests SQL escaping and string literal handling.
  • $90,000 salary: TEXTβ†’INTEGER coercion β€” tests complex string parsing and casting.
  • Empty string emails: NOT NULL vs Empty β€” tests data quality validation logic.
  • Leading whitespace: alice@company.com β€” tests TRIM awareness.
  • ID conflicts: Overlapping IDs in dual sources β€” tests intelligent merge logic.
  • Orphaned FKs: References to deleted entities β€” tests environment's audit logging.
  • NULL currency: Must default to 'USD' β€” tests COALESCE usage.

βš–οΈ Evaluation Baselines

Task Qwen 32B Score GPT-OSS 120B
column-restructure 0.99 0.99
soft-delete-restoration 0.99 0.99
table-normalization 0.94 0.99
schema-version-merge 0.93 0.98
multi-entity-extraction 0.35 0.65
cascade-migration 0.61 0.83
dual-source-consolidation 0.28 0.38

πŸ›‘οΈ Security & Reward Function

The Reward Formula

Rewards are calculated as progress deltas: $R_t = P_t - P_{t-1}$. Progress $P_t$ is a weighted sum (0.01 to 0.99):

  • Schema Match (30%): Tables exist with correct (name, type) signatures.
  • Data Match (40%): Row content matches golden DB (order-independent).
  • FK & Integrity (20%): Foreign keys enforced, integrity_check passes.
  • Anti-Exploit (10%): Penalty for empty tables or schema pollution.

Security Guardrails

  • PRAGMA Blacklist: foreign_keys = OFF and writable_schema = ON are blocked.
  • Query Timeout: SQLite progress handler terminates queries exceeding 500k ops.
  • Dangerous SQL: ATTACH, DETACH, and LOAD_EXTENSION are filtered.

πŸš€ Setup & Usage

Local Deployment

pip install -r requirements.txt
python -m server.app  # Starts OpenEnv server on port 7860

Environment Variables

export HF_TOKEN=your_token
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct

API Endpoints

  • POST /reset: Initialize migration episode.
  • POST /step: Execute SQL and reasoning.
  • GET /tasks: List all available scenarios.
  • POST /grader: Run deep comparison against Golden DB.

πŸ–ΌοΈ Observations

Each observation includes erd_visualization (Mermaid.js) and schema_diff to assist agents in understanding the current drift.

πŸ“„ License

MIT. Built for the OpenEnv Hackathon 2026.