Spaces:

Eishaan
/

sql-migration-env

Sleeping

App Files Files Community

sql-migration-env / README.md

Eishaan

docs: merge professional visuals with detailed technical specs for ultimate README

5f32203 about 1 month ago

preview code

raw

history blame contribute delete

6.27 kB

	---
	title: SQL Migration Agent Benchmark
	emoji: 🗄️
	colorFrom: blue
	colorTo: purple
	sdk: docker
	app_file: server/app.py
	pinned: false
	---

	# SQL Migration Agent Benchmark (OpenEnv)
	> A Production-Grade Evaluation Suite for Database Engineering Agents.

	[![OpenEnv Compliant](https://img.shields.io/badge/OpenEnv-Compliant-success)](https://github.com/openenv/core)
	[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
	[![Hugging Face Space](https://img.shields.io/badge/HF%20Space-Deployed-orange)](https://huggingface.co/spaces/Eishaan/sql-migration-env)

	An OpenEnv-compatible environment for evaluating AI agents on autonomous SQLite database migration tasks. The agent receives a broken/drifted schema and must write SQL to transform it to a target state without losing data.

	---

	## 🏗️ Architecture Overview

	The suite combines formal sequence modeling with a modular local engine.

	### System Mapping
	```
	┌─────────────────────────────────┐
	│ inference.py (Baseline Agent) │
	│ - LLM API calls (OpenAI fmt) │
	│ - JSON mode + fallback parser │
	└─────────┬───────────────────────┘
	│ MigrationAction
	┌─────────▼───────────────────────┐
	│ environment.py (OpenEnv Env) │
	│ - SQLite execution engine │
	│ - ERD & Schema Diff generator │
	│ - SQL timeout & Blacklist │
	└─────────┬───────────────────────┘
	│ score()
	┌─────────▼───────────────────────┐
	│ grader.py (Golden DB Engine) │
	│ - Dynamic golden reference DB │
	│ - Schema + data + FK scoring │
	│ - Anti-exploit checks │
	└─────────────────────────────────┘
	```

	### Protocol Flow
	```mermaid
	sequenceDiagram
	participant Agent
	participant Env as MigrationEnv Server
	participant DB as SQLite (:memory:)
	participant Grader as Dynamic Golden Grader

	Agent->>Env: POST /reset (task_name)
	Env->>DB: Seed Source Data
	Env->>Grader: Build Golden Reference
	Grader-->>Env: Initial Score
	Env-->>Agent: Observation (DDL, Schema Diff, ERD)

	loop Migration Steps
	Agent->>Env: POST /step (SQL, Reasoning)
	Env->>DB: Execute SQL (w/ Timeout & Blacklist)
	Env->>Grader: Compute Delta Reward
	Grader-->>Env: current_score, reward
	Env-->>Agent: New Observation + ERD (Visualization)
	end

	Agent->>Env: submit_final = True
	Env->>Grader: Final Integrity & FK Check
	Env-->>Agent: Final Episode Summary (Trajectory)
	```

	---

	## 🎯 Benchmark Tasks

	\| # \| Task \| Difficulty \| Challenge \|
	\|---\|------\|-----------\|-----------\|
	\| 1 \| `column-restructure` \| 🟢 Easy \| Merge first_name + last_name → full_name (with apostrophes) \|
	\| 2 \| `soft-delete-restoration` \| 🟢 Easy \| Restore deleted products from `deletion_log` \|
	\| 3 \| `table-normalization` \| 🟡 Medium \| Normalize `purchases` → `customers` + `orders` + FK \|
	\| 4 \| `schema-version-merge` \| 🟡 Medium \| Merge v1/v2 product tables with price coercion \|
	\| 5 \| `multi-entity-extraction` \| 🟡 Medium \| 3NF decomposition with invalid data routing \|
	\| 6 \| `cascade-migration` \| 🔴 Hard \| 4-table FK cascade, type coercion, orphan audit \|
	\| 7 \| `dual-source-consolidation`\| 🔴 Hard \| 6→4 table merge, cross-system email dedup \|

	### 🛠️ Adversarial Edge Cases (The "Stress Tests")
	- O'Brien: Apostrophe in data — tests SQL escaping and string literal handling.
	- $90,000 salary: TEXT→INTEGER coercion — tests complex string parsing and casting.
	- Empty string emails: NOT NULL vs Empty — tests data quality validation logic.
	- Leading whitespace: ` alice@company.com` — tests TRIM awareness.
	- ID conflicts: Overlapping IDs in dual sources — tests intelligent merge logic.
	- Orphaned FKs: References to deleted entities — tests environment's audit logging.
	- NULL currency: Must default to 'USD' — tests COALESCE usage.

	---

	## ⚖️ Evaluation Baselines

	\| Task \| Qwen 32B Score \| GPT-OSS 120B \|
	\|------\|--------------\|--------------\|
	\| `column-restructure` \| 0.99 \| 0.99 \|
	\| `soft-delete-restoration` \| 0.99 \| 0.99 \|
	\| `table-normalization` \| 0.94 \| 0.99 \|
	\| `schema-version-merge` \| 0.93 \| 0.98 \|
	\| `multi-entity-extraction` \| 0.35 \| 0.65 \|
	\| `cascade-migration` \| 0.61 \| 0.83 \|
	\| `dual-source-consolidation`\| 0.28 \| 0.38 \|

	---

	## 🛡️ Security & Reward Function

	### The Reward Formula
	Rewards are calculated as progress deltas: $R_t = P_t - P_{t-1}$.
	Progress $P_t$ is a weighted sum (0.01 to 0.99):
	- Schema Match (30%): Tables exist with correct `(name, type)` signatures.
	- Data Match (40%): Row content matches golden DB (order-independent).
	- FK & Integrity (20%): Foreign keys enforced, `integrity_check` passes.
	- Anti-Exploit (10%): Penalty for empty tables or schema pollution.

	### Security Guardrails
	- PRAGMA Blacklist: `foreign_keys = OFF` and `writable_schema = ON` are blocked.
	- Query Timeout: SQLite progress handler terminates queries exceeding 500k ops.
	- Dangerous SQL: `ATTACH`, `DETACH`, and `LOAD_EXTENSION` are filtered.

	---

	## 🚀 Setup & Usage

	### Local Deployment
	```bash
	pip install -r requirements.txt
	python -m server.app # Starts OpenEnv server on port 7860
	```

	### Environment Variables
	```bash
	export HF_TOKEN=your_token
	export API_BASE_URL=https://router.huggingface.co/v1
	export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
	```

	### API Endpoints
	- `POST /reset`: Initialize migration episode.
	- `POST /step`: Execute SQL and reasoning.
	- `GET /tasks`: List all available scenarios.
	- `POST /grader`: Run deep comparison against Golden DB.

	---

	## 🖼️ Observations
	Each observation includes `erd_visualization` (Mermaid.js) and `schema_diff` to assist agents in understanding the current drift.

	## 📄 License
	MIT. Built for the OpenEnv Hackathon 2026.