Spaces:
Sleeping
Sleeping
feat: expand to 7 tasks (2E/3M/2H) + engine hardening
Browse files- Add 4 new tasks: soft-delete-restoration, schema-version-merge,
multi-entity-extraction, dual-source-consolidation
- Harden grader: PRAGMA bypass fix, sqlite_% table filtering
- Harden environment: multi-statement error handling, schema pollution
- Fix inference context window: target DDL baked into system prompt
- Per-task step budgets: Easy=10, Medium=15, Hard=20
- Update app.py, README.md for 7-task architecture
- All tests passing
- README.md +38 -44
- __pycache__/seeds.cpython-312.pyc +0 -0
- inference.py +30 -14
- seeds.py +455 -0
- server/__pycache__/environment.cpython-312.pyc +0 -0
- server/__pycache__/grader.cpython-312.pyc +0 -0
- server/app.py +50 -45
- server/environment.py +16 -3
- server/grader.py +406 -3
- test_all_tasks.py +49 -0
README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
---
|
| 2 |
title: SQL Migration Agent
|
| 3 |
-
emoji:
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: purple
|
| 6 |
sdk: docker
|
|
@@ -11,10 +11,9 @@ tags:
|
|
| 11 |
|
| 12 |
# SQL Schema Migration Agent
|
| 13 |
|
| 14 |
-
|
| 15 |
> **An OpenEnv environment for benchmarking autonomous database migration agents.**
|
| 16 |
>
|
| 17 |
-
> Built for the Meta
|
| 18 |
|
| 19 |
---
|
| 20 |
|
|
@@ -22,7 +21,7 @@ tags:
|
|
| 22 |
|
| 23 |
Database schema migrations are among the most error-prone, high-stakes tasks in software engineering. Every production system faces them as application models evolve, yet they are extremely difficult to automate safely because data must be perfectly preserved.
|
| 24 |
|
| 25 |
-
This environment trains AI agents to autonomously reconcile schema drift the exact way a real CI/CD pipeline would
|
| 26 |
|
| 27 |
**Real-world analogues:** `Flyway`, `Liquibase`, Django `makemigrations`, `Terraform` state transitions. This environment models that exact problem, reduced to an agentic RL core.
|
| 28 |
|
|
@@ -32,19 +31,24 @@ This environment trains AI agents to autonomously reconcile schema drift the exa
|
|
| 32 |
|
| 33 |
Unlike simplistic environments that merely string-match SQL schemas, this environment uses a **deep structural reconciliation grader** built specifically to prevent LLM gamification:
|
| 34 |
|
| 35 |
-
1. **Zero-Sum Exploit Protection:** Naive agents will often execute `DROP TABLE x; CREATE TABLE x (...)` to easily match the target schema, silently destroying all data. Our grader actively runs `SELECT COUNT(*)` and data-integrity
|
| 36 |
-
2. **
|
| 37 |
-
3. **
|
|
|
|
| 38 |
|
| 39 |
---
|
| 40 |
|
| 41 |
-
## Tasks
|
| 42 |
|
| 43 |
-
| # | Name | Difficulty | Description |
|
| 44 |
-
|---|------|-----------|-------------|
|
| 45 |
-
| 1 | `column-restructure` | Easy | Merge `first_name` + `last_name`
|
| 46 |
-
| 2 | `
|
| 47 |
-
| 3 | `
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
---
|
| 50 |
|
|
@@ -55,8 +59,8 @@ Unlike simplistic environments that merely string-match SQL schemas, this enviro
|
|
| 55 |
| `current_schema_sql` | `str` | Current database DDL extracted from `sqlite_master` |
|
| 56 |
| `target_schema_sql` | `str` | Target DDL the agent must reach |
|
| 57 |
| `last_execution_result` | `str` | Result of last SQL execution, or error message |
|
| 58 |
-
| `step_number` | `int` | Current step count
|
| 59 |
-
| `migration_progress` | `float` | Current grader score [0.0
|
| 60 |
| `task_name` | `str` | Name of the active task |
|
| 61 |
| `done` | `bool` | Whether the episode has terminated |
|
| 62 |
| `reward` | `float` | Step reward: score delta from previous step (can be negative) |
|
|
@@ -73,26 +77,11 @@ Unlike simplistic environments that merely string-match SQL schemas, this enviro
|
|
| 73 |
|
| 74 |
## Reward Function
|
| 75 |
|
| 76 |
-
- **Step reward**: Delta between current and previous migration score. Strongly negative for destructive actions (e.g., wrong DROP TABLE
|
| 77 |
-
- **Episode score**: Clamped to
|
| 78 |
-
- **Exploit protection**: If schema matches target
|
| 79 |
-
- **
|
| 80 |
-
|
| 81 |
-
### Task 3 Scoring Breakdown
|
| 82 |
-
|
| 83 |
-
| Check | Weight | Description |
|
| 84 |
-
|-------|--------|-------------|
|
| 85 |
-
| `audit_log` exists | 0.10 | Orphan audit table created |
|
| 86 |
-
| `audit_log` row count ≥ 3 | 0.10 | All orphaned/invalid records logged |
|
| 87 |
-
| Correct audit entries | 0.20 | Right `(source_table, reason)` pairs |
|
| 88 |
-
| FK: `departments→companies` | 0.05 | FK chain step 1 |
|
| 89 |
-
| FK: `employees→departments` | 0.05 | FK chain step 2 |
|
| 90 |
-
| FK: `assets→employees` | 0.05 | FK chain step 3 |
|
| 91 |
-
| `companies.name` NOT NULL | 0.05 | Constraint enforcement |
|
| 92 |
-
| Employee count = 4 | 0.05 | Hal Patel (NULL salary) removed |
|
| 93 |
-
| Salary coercion correct | 0.15 | All `$90000` → `90000` INTEGER |
|
| 94 |
-
| No orphaned assets | 0.10 | All `asset.employee_id` valid |
|
| 95 |
-
| `PRAGMA integrity_check` | 0.10 | Full DB integrity passes |
|
| 96 |
|
| 97 |
---
|
| 98 |
|
|
@@ -102,7 +91,7 @@ Unlike simplistic environments that merely string-match SQL schemas, this enviro
|
|
| 102 |
# Install dependencies
|
| 103 |
pip install -r requirements.txt
|
| 104 |
|
| 105 |
-
# Run baseline inference
|
| 106 |
export HF_TOKEN=your_token_here
|
| 107 |
export API_BASE_URL=https://router.huggingface.co/v1
|
| 108 |
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
|
|
@@ -110,6 +99,7 @@ python inference.py
|
|
| 110 |
|
| 111 |
# Run validation tests
|
| 112 |
python test_smoke.py
|
|
|
|
| 113 |
|
| 114 |
# Start environment server locally
|
| 115 |
uvicorn server.app:app --host 0.0.0.0 --port 7860
|
|
@@ -125,7 +115,7 @@ uvicorn server.app:app --host 0.0.0.0 --port 7860
|
|
| 125 |
| `/reset` | POST | Reset environment, returns initial observation |
|
| 126 |
| `/step` | POST | Execute action, returns observation + reward |
|
| 127 |
| `/state` | GET | Current environment state |
|
| 128 |
-
| `/tasks` | GET | List all
|
| 129 |
| `/grader` | POST | Run grader on all tasks, return scores |
|
| 130 |
| `/schema` | GET | OpenEnv schema (action/observation types) |
|
| 131 |
| `/ws` | WS | WebSocket for real-time interaction |
|
|
@@ -152,10 +142,13 @@ docker run -p 7860:7860 \
|
|
| 152 |
|
| 153 |
| Task | Score | Steps | Model |
|
| 154 |
|------|-------|-------|-------|
|
| 155 |
-
| `column-restructure` |
|
| 156 |
-
| `
|
| 157 |
-
| `
|
| 158 |
-
|
|
|
|
|
|
|
|
|
|
|
| 159 |
|
| 160 |
---
|
| 161 |
|
|
@@ -163,9 +156,10 @@ docker run -p 7860:7860 \
|
|
| 163 |
|
| 164 |
- [x] `docker build` succeeds
|
| 165 |
- [x] `curl /health` returns 200
|
| 166 |
-
- [x] `curl /tasks` returns
|
| 167 |
- [x] `curl -X POST /reset` returns valid observation
|
| 168 |
- [x] `openenv validate` passes
|
| 169 |
-
- [x] Baseline script completes all
|
| 170 |
-
- [x] Grader scores in
|
| 171 |
- [x] Exploit protection: empty-table shortcuts penalized
|
|
|
|
|
|
| 1 |
---
|
| 2 |
title: SQL Migration Agent
|
| 3 |
+
emoji: "\U0001F5C4\uFE0F"
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: purple
|
| 6 |
sdk: docker
|
|
|
|
| 11 |
|
| 12 |
# SQL Schema Migration Agent
|
| 13 |
|
|
|
|
| 14 |
> **An OpenEnv environment for benchmarking autonomous database migration agents.**
|
| 15 |
>
|
| 16 |
+
> Built for the Meta x Hugging Face OpenEnv Hackathon.
|
| 17 |
|
| 18 |
---
|
| 19 |
|
|
|
|
| 21 |
|
| 22 |
Database schema migrations are among the most error-prone, high-stakes tasks in software engineering. Every production system faces them as application models evolve, yet they are extremely difficult to automate safely because data must be perfectly preserved.
|
| 23 |
|
| 24 |
+
This environment trains AI agents to autonomously reconcile schema drift the exact way a real CI/CD pipeline would -- given a flawed current state and an ideal target state, the agent must compute and safely execute the transformation sequence using raw SQL.
|
| 25 |
|
| 26 |
**Real-world analogues:** `Flyway`, `Liquibase`, Django `makemigrations`, `Terraform` state transitions. This environment models that exact problem, reduced to an agentic RL core.
|
| 27 |
|
|
|
|
| 31 |
|
| 32 |
Unlike simplistic environments that merely string-match SQL schemas, this environment uses a **deep structural reconciliation grader** built specifically to prevent LLM gamification:
|
| 33 |
|
| 34 |
+
1. **Zero-Sum Exploit Protection:** Naive agents will often execute `DROP TABLE x; CREATE TABLE x (...)` to easily match the target schema, silently destroying all data. Our grader actively runs `SELECT COUNT(*)`, `SUM(id)`, and data-integrity fingerprinting. If a table's schema matches but the data is gone, the score is brutally clamped to `0.01`.
|
| 35 |
+
2. **PRAGMA Bypass Prevention:** The grader re-asserts `PRAGMA foreign_keys = ON` before every scoring pass, preventing agents from disabling FK constraints to cheat.
|
| 36 |
+
3. **Granular Partial Credit:** Multi-step migrations (like Task 7's 6-to-4 table consolidation) require 18+ steps. Binary pass/fail rewards provide zero learning signal. Our grader assigns fractional weights to individual FK constraints, data type coercions, and orphaned record audit logs, providing continuous RL reward gradients.
|
| 37 |
+
4. **Deterministic Adversarial Seeds:** Our injected data includes edge cases that break naive SQL: `O'Brien` (apostrophes), `$1,234.56` (comma+dollar coercion), orphaned foreign keys, NULL emails, and leading whitespace in emails.
|
| 38 |
|
| 39 |
---
|
| 40 |
|
| 41 |
+
## Tasks (2 Easy / 3 Medium / 2 Hard)
|
| 42 |
|
| 43 |
+
| # | Name | Difficulty | Steps | Description |
|
| 44 |
+
|---|------|-----------|-------|-------------|
|
| 45 |
+
| 1 | `column-restructure` | Easy | 10 | Merge `first_name` + `last_name` into `full_name` without data loss. Adversarial: apostrophes (`O'Brien`), mid-caps (`McDonald`) |
|
| 46 |
+
| 2 | `soft-delete-restoration` | Easy | 10 | Restore deleted products from `deletion_log`, add `is_deleted`/`deleted_at` columns. Adversarial: `stock=0` must not be confused with `is_deleted=1` |
|
| 47 |
+
| 3 | `table-normalization` | Medium | 15 | Decompose flat `purchases` into `customers` + `orders` with FK. Adversarial: duplicate emails (x3), commas in item names |
|
| 48 |
+
| 4 | `schema-version-merge` | Medium | 15 | Merge overlapping `products_v1` (TEXT prices) and `products_v2` (REAL prices) with conflict resolution and `source` tracking. Adversarial: `$XX.XX` coercion, NULL category, high ID=101 |
|
| 49 |
+
| 5 | `multi-entity-extraction` | Medium | 15 | Decompose `sales_records` god-table into 3NF (5 tables) with 3 FKs and invalid data routing. Adversarial: leading whitespace email, empty email, comma in SKU |
|
| 50 |
+
| 6 | `cascade-migration` | Hard | 20 | 4-table FK cascade: type coercion (`$90000` TEXT to `90000` INTEGER), orphan audit logging, NULL salary removal, full FK chain enforcement |
|
| 51 |
+
| 7 | `dual-source-consolidation` | Hard | 20 | Merge 6 tables from two incompatible systems (Legacy CRM + Modern SaaS) into 4 unified tables with cross-system email dedup, currency coercion, orphan detection |
|
| 52 |
|
| 53 |
---
|
| 54 |
|
|
|
|
| 59 |
| `current_schema_sql` | `str` | Current database DDL extracted from `sqlite_master` |
|
| 60 |
| `target_schema_sql` | `str` | Target DDL the agent must reach |
|
| 61 |
| `last_execution_result` | `str` | Result of last SQL execution, or error message |
|
| 62 |
+
| `step_number` | `int` | Current step count |
|
| 63 |
+
| `migration_progress` | `float` | Current grader score [0.01-0.99] |
|
| 64 |
| `task_name` | `str` | Name of the active task |
|
| 65 |
| `done` | `bool` | Whether the episode has terminated |
|
| 66 |
| `reward` | `float` | Step reward: score delta from previous step (can be negative) |
|
|
|
|
| 77 |
|
| 78 |
## Reward Function
|
| 79 |
|
| 80 |
+
- **Step reward**: Delta between current and previous migration score. Strongly negative for destructive actions (e.g., wrong DROP TABLE leads to -0.4).
|
| 81 |
+
- **Episode score**: Clamped to (0.01, 0.99). Final state wins -- regressions hurt.
|
| 82 |
+
- **Exploit protection**: If schema matches target but tables are empty (agent deleted data), score is capped at 0.01.
|
| 83 |
+
- **PRAGMA protection**: `PRAGMA foreign_keys = ON` is re-asserted before every grading pass.
|
| 84 |
+
- **Auto-termination**: Episode ends immediately when score reaches 0.99, preventing post-success regression.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
---
|
| 87 |
|
|
|
|
| 91 |
# Install dependencies
|
| 92 |
pip install -r requirements.txt
|
| 93 |
|
| 94 |
+
# Run baseline inference (requires HF_TOKEN)
|
| 95 |
export HF_TOKEN=your_token_here
|
| 96 |
export API_BASE_URL=https://router.huggingface.co/v1
|
| 97 |
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
|
|
|
|
| 99 |
|
| 100 |
# Run validation tests
|
| 101 |
python test_smoke.py
|
| 102 |
+
python test_all_tasks.py
|
| 103 |
|
| 104 |
# Start environment server locally
|
| 105 |
uvicorn server.app:app --host 0.0.0.0 --port 7860
|
|
|
|
| 115 |
| `/reset` | POST | Reset environment, returns initial observation |
|
| 116 |
| `/step` | POST | Execute action, returns observation + reward |
|
| 117 |
| `/state` | GET | Current environment state |
|
| 118 |
+
| `/tasks` | GET | List all 7 tasks with descriptions |
|
| 119 |
| `/grader` | POST | Run grader on all tasks, return scores |
|
| 120 |
| `/schema` | GET | OpenEnv schema (action/observation types) |
|
| 121 |
| `/ws` | WS | WebSocket for real-time interaction |
|
|
|
|
| 142 |
|
| 143 |
| Task | Score | Steps | Model |
|
| 144 |
|------|-------|-------|-------|
|
| 145 |
+
| `column-restructure` | 0.99 | 4 | Qwen/Qwen2.5-72B-Instruct |
|
| 146 |
+
| `soft-delete-restoration` | 0.99 | 5-7 | Qwen/Qwen2.5-72B-Instruct |
|
| 147 |
+
| `table-normalization` | 0.99 | 5-8 | Qwen/Qwen2.5-72B-Instruct |
|
| 148 |
+
| `schema-version-merge` | 0.60-0.85 | 8-12 | Qwen/Qwen2.5-72B-Instruct |
|
| 149 |
+
| `multi-entity-extraction` | 0.40-0.70 | 12-15 | Qwen/Qwen2.5-72B-Instruct |
|
| 150 |
+
| `cascade-migration` | 0.30-0.65 | 15-20 | Qwen/Qwen2.5-72B-Instruct |
|
| 151 |
+
| `dual-source-consolidation` | 0.20-0.50 | 18-20 | Qwen/Qwen2.5-72B-Instruct |
|
| 152 |
|
| 153 |
---
|
| 154 |
|
|
|
|
| 156 |
|
| 157 |
- [x] `docker build` succeeds
|
| 158 |
- [x] `curl /health` returns 200
|
| 159 |
+
- [x] `curl /tasks` returns 7 tasks
|
| 160 |
- [x] `curl -X POST /reset` returns valid observation
|
| 161 |
- [x] `openenv validate` passes
|
| 162 |
+
- [x] Baseline script completes all 7 tasks without crashing
|
| 163 |
+
- [x] Grader scores in (0.01, 0.99) range
|
| 164 |
- [x] Exploit protection: empty-table shortcuts penalized
|
| 165 |
+
- [x] PRAGMA bypass protection enforced
|
__pycache__/seeds.cpython-312.pyc
CHANGED
|
Binary files a/__pycache__/seeds.cpython-312.pyc and b/__pycache__/seeds.cpython-312.pyc differ
|
|
|
inference.py
CHANGED
|
@@ -30,7 +30,7 @@ HF_TOKEN = os.getenv("HF_TOKEN") # No default — must be set by user
|
|
| 30 |
# Also support OPENAI_API_KEY as primary (per spec) and API_KEY as alias
|
| 31 |
API_KEY = os.getenv("OPENAI_API_KEY") or HF_TOKEN or os.getenv("API_KEY")
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
CRITICAL — SQLite-specific rules (violations cause immediate errors):
|
| 36 |
1. SQLite does NOT support ALTER TABLE ADD CONSTRAINT — never use it.
|
|
@@ -45,11 +45,22 @@ CRITICAL — SQLite-specific rules (violations cause immediate errors):
|
|
| 45 |
10. Execute exactly ONE SQL statement per step.
|
| 46 |
11. When migration is complete (schemas match, data preserved), set submit_final to true IMMEDIATELY.
|
| 47 |
|
| 48 |
-
|
| 49 |
-
{
|
| 50 |
|
| 51 |
-
|
| 52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
MAX_PARSE_ERRORS = 5 # Higher tolerance for thinking models (Qwen3, DeepSeek-R1)
|
| 54 |
|
| 55 |
# Auto-submit threshold: if migration_progress >= this, force submit_final
|
|
@@ -139,18 +150,24 @@ def run_task_local(task_name: str) -> dict:
|
|
| 139 |
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 140 |
from server.environment import DbMigrationEnvironment
|
| 141 |
from models import MigrationAction
|
|
|
|
| 142 |
|
| 143 |
env = DbMigrationEnvironment(task_name=task_name)
|
| 144 |
|
|
|
|
|
|
|
|
|
|
| 145 |
print(f"[START] task={task_name} env=sql-migration-agent model={MODEL_NAME}", flush=True)
|
| 146 |
|
| 147 |
obs = env.reset()
|
| 148 |
-
history = [{"role": "system", "content": SYSTEM_PROMPT}]
|
| 149 |
|
| 150 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
initial_msg = (
|
| 152 |
f"CURRENT DATABASE SCHEMA:\n{obs.current_schema_sql}\n\n"
|
| 153 |
-
f"TARGET SCHEMA:\n{obs.target_schema_sql}\n\n"
|
| 154 |
f"Status: {obs.last_execution_result}\n"
|
| 155 |
f"Migration progress: {obs.migration_progress:.2f}\n\n"
|
| 156 |
f"Write your first SQL command to begin the migration."
|
|
@@ -164,7 +181,7 @@ def run_task_local(task_name: str) -> dict:
|
|
| 164 |
done = False
|
| 165 |
peak_score = 0.0 # Track the highest score we've reached
|
| 166 |
|
| 167 |
-
for step in range(
|
| 168 |
if done:
|
| 169 |
break
|
| 170 |
|
|
@@ -253,10 +270,11 @@ def run_task_local(task_name: str) -> dict:
|
|
| 253 |
# Add to conversation history
|
| 254 |
history.append({"role": "assistant", "content": json.dumps(action_dict)})
|
| 255 |
|
|
|
|
| 256 |
feedback_msg = (
|
| 257 |
f"EXECUTION RESULT: {obs.last_execution_result}\n\n"
|
| 258 |
f"CURRENT SCHEMA:\n{obs.current_schema_sql}\n\n"
|
| 259 |
-
f"
|
| 260 |
)
|
| 261 |
if done:
|
| 262 |
feedback_msg += "\n\nEpisode complete."
|
|
@@ -309,11 +327,9 @@ def main():
|
|
| 309 |
# Summary
|
| 310 |
scores = list(results.values())
|
| 311 |
avg = sum(scores) / len(scores) if scores else 0.0
|
|
|
|
| 312 |
print(
|
| 313 |
-
f"[SUMMARY]
|
| 314 |
-
f"task2={results.get('table-normalization', 0):.2f} "
|
| 315 |
-
f"task3={results.get('cascade-migration', 0):.2f} "
|
| 316 |
-
f"avg={avg:.2f}",
|
| 317 |
flush=True,
|
| 318 |
)
|
| 319 |
|
|
|
|
| 30 |
# Also support OPENAI_API_KEY as primary (per spec) and API_KEY as alias
|
| 31 |
API_KEY = os.getenv("OPENAI_API_KEY") or HF_TOKEN or os.getenv("API_KEY")
|
| 32 |
|
| 33 |
+
SYSTEM_PROMPT_TEMPLATE = """You are an autonomous SQLite database migration engine. You receive the current schema and a target schema. Write SQL to transform the current state to the target state without losing row data.
|
| 34 |
|
| 35 |
CRITICAL — SQLite-specific rules (violations cause immediate errors):
|
| 36 |
1. SQLite does NOT support ALTER TABLE ADD CONSTRAINT — never use it.
|
|
|
|
| 45 |
10. Execute exactly ONE SQL statement per step.
|
| 46 |
11. When migration is complete (schemas match, data preserved), set submit_final to true IMMEDIATELY.
|
| 47 |
|
| 48 |
+
TARGET SCHEMA (fixed — achieve this exactly):
|
| 49 |
+
{target_ddl}
|
| 50 |
|
| 51 |
+
Respond ONLY with valid JSON — no markdown, no code blocks, no text outside the object:
|
| 52 |
+
{{"sql_command": "your SQL here", "reasoning": "why", "submit_final": false}}"""
|
| 53 |
+
|
| 54 |
+
ALL_TASKS = [
|
| 55 |
+
"column-restructure",
|
| 56 |
+
"soft-delete-restoration",
|
| 57 |
+
"table-normalization",
|
| 58 |
+
"schema-version-merge",
|
| 59 |
+
"multi-entity-extraction",
|
| 60 |
+
"cascade-migration",
|
| 61 |
+
"dual-source-consolidation",
|
| 62 |
+
]
|
| 63 |
+
MAX_STEPS = 20 # Global fallback; per-task limits override this
|
| 64 |
MAX_PARSE_ERRORS = 5 # Higher tolerance for thinking models (Qwen3, DeepSeek-R1)
|
| 65 |
|
| 66 |
# Auto-submit threshold: if migration_progress >= this, force submit_final
|
|
|
|
| 150 |
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 151 |
from server.environment import DbMigrationEnvironment
|
| 152 |
from models import MigrationAction
|
| 153 |
+
import seeds
|
| 154 |
|
| 155 |
env = DbMigrationEnvironment(task_name=task_name)
|
| 156 |
|
| 157 |
+
# Use task-specific step budget (defaults to global MAX_STEPS)
|
| 158 |
+
task_max_steps = seeds.TASKS.get(task_name, {}).get("max_steps", MAX_STEPS)
|
| 159 |
+
|
| 160 |
print(f"[START] task={task_name} env=sql-migration-agent model={MODEL_NAME}", flush=True)
|
| 161 |
|
| 162 |
obs = env.reset()
|
|
|
|
| 163 |
|
| 164 |
+
# Build task-specific system prompt with target DDL baked in (sent ONCE)
|
| 165 |
+
task_system_prompt = SYSTEM_PROMPT_TEMPLATE.format(target_ddl=obs.target_schema_sql)
|
| 166 |
+
history = [{"role": "system", "content": task_system_prompt}]
|
| 167 |
+
|
| 168 |
+
# Initial observation — only current schema (target already in system prompt)
|
| 169 |
initial_msg = (
|
| 170 |
f"CURRENT DATABASE SCHEMA:\n{obs.current_schema_sql}\n\n"
|
|
|
|
| 171 |
f"Status: {obs.last_execution_result}\n"
|
| 172 |
f"Migration progress: {obs.migration_progress:.2f}\n\n"
|
| 173 |
f"Write your first SQL command to begin the migration."
|
|
|
|
| 181 |
done = False
|
| 182 |
peak_score = 0.0 # Track the highest score we've reached
|
| 183 |
|
| 184 |
+
for step in range(task_max_steps):
|
| 185 |
if done:
|
| 186 |
break
|
| 187 |
|
|
|
|
| 270 |
# Add to conversation history
|
| 271 |
history.append({"role": "assistant", "content": json.dumps(action_dict)})
|
| 272 |
|
| 273 |
+
# Lean feedback — target is already in the system prompt, no need to repeat
|
| 274 |
feedback_msg = (
|
| 275 |
f"EXECUTION RESULT: {obs.last_execution_result}\n\n"
|
| 276 |
f"CURRENT SCHEMA:\n{obs.current_schema_sql}\n\n"
|
| 277 |
+
f"Progress: {obs.migration_progress:.2f}"
|
| 278 |
)
|
| 279 |
if done:
|
| 280 |
feedback_msg += "\n\nEpisode complete."
|
|
|
|
| 327 |
# Summary
|
| 328 |
scores = list(results.values())
|
| 329 |
avg = sum(scores) / len(scores) if scores else 0.0
|
| 330 |
+
scores_str = " ".join(f"{t}={s:.2f}" for t, s in results.items())
|
| 331 |
print(
|
| 332 |
+
f"[SUMMARY] {scores_str} avg={avg:.2f}",
|
|
|
|
|
|
|
|
|
|
| 333 |
flush=True,
|
| 334 |
)
|
| 335 |
|
seeds.py
CHANGED
|
@@ -255,6 +255,429 @@ def seed_task3(conn: sqlite3.Connection) -> None:
|
|
| 255 |
conn.commit()
|
| 256 |
|
| 257 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 258 |
# =============================================================================
|
| 259 |
# Task Registry
|
| 260 |
# =============================================================================
|
|
@@ -265,17 +688,49 @@ TASKS = {
|
|
| 265 |
"target_ddl": TASK1_TARGET_DDL,
|
| 266 |
"description": "Merge first_name and last_name into a single full_name column without data loss",
|
| 267 |
"difficulty": "easy",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 268 |
},
|
| 269 |
"table-normalization": {
|
| 270 |
"seed_fn": seed_task2,
|
| 271 |
"target_ddl": TASK2_TARGET_DDL,
|
| 272 |
"description": "Decompose a flat purchases table into normalized customers and orders tables with FK",
|
| 273 |
"difficulty": "medium",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 274 |
},
|
| 275 |
"cascade-migration": {
|
| 276 |
"seed_fn": seed_task3,
|
| 277 |
"target_ddl": TASK3_TARGET_DDL,
|
| 278 |
"description": "Multi-table FK cascade with type coercion, NULL handling, and orphan audit logging",
|
| 279 |
"difficulty": "hard",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 280 |
},
|
| 281 |
}
|
|
|
|
|
|
| 255 |
conn.commit()
|
| 256 |
|
| 257 |
|
| 258 |
+
# =============================================================================
|
| 259 |
+
# TASK 4: Soft-Delete Restoration (Easy)
|
| 260 |
+
# =============================================================================
|
| 261 |
+
# Agent must restore deleted products from a deletion_log, add is_deleted/deleted_at columns.
|
| 262 |
+
# Adversarial: "O'Brien Desk" (apostrophe), stock=0 on Webcam (must NOT confuse with is_deleted).
|
| 263 |
+
|
| 264 |
+
TASK4_SOURCE_DDL = """
|
| 265 |
+
CREATE TABLE products (
|
| 266 |
+
id INTEGER PRIMARY KEY,
|
| 267 |
+
name TEXT NOT NULL,
|
| 268 |
+
price REAL NOT NULL,
|
| 269 |
+
stock INTEGER NOT NULL
|
| 270 |
+
);
|
| 271 |
+
|
| 272 |
+
CREATE TABLE deletion_log (
|
| 273 |
+
id INTEGER PRIMARY KEY,
|
| 274 |
+
product_id INTEGER NOT NULL,
|
| 275 |
+
product_name TEXT NOT NULL,
|
| 276 |
+
product_price REAL NOT NULL,
|
| 277 |
+
product_stock INTEGER NOT NULL,
|
| 278 |
+
deleted_at TEXT NOT NULL
|
| 279 |
+
);
|
| 280 |
+
"""
|
| 281 |
+
|
| 282 |
+
TASK4_PRODUCTS_DATA = [
|
| 283 |
+
(1, "Laptop", 999.99, 15),
|
| 284 |
+
(2, "O'Brien Desk", 249.99, 8),
|
| 285 |
+
(3, "Monitor", 399.99, 23),
|
| 286 |
+
(4, "Keyboard", 89.99, 45),
|
| 287 |
+
(5, "Mouse", 29.99, 102),
|
| 288 |
+
]
|
| 289 |
+
|
| 290 |
+
TASK4_DELETION_LOG_DATA = [
|
| 291 |
+
(1, 6, "Headphones", 149.99, 30, "2024-01-15"),
|
| 292 |
+
(2, 7, "Webcam", 79.99, 0, "2024-02-20"), # stock=0 but NOT is_deleted=1
|
| 293 |
+
(3, 8, "USB-C Hub", 49.99, 12, "2024-03-10"),
|
| 294 |
+
]
|
| 295 |
+
|
| 296 |
+
TASK4_TARGET_DDL = """CREATE TABLE products (
|
| 297 |
+
id INTEGER PRIMARY KEY,
|
| 298 |
+
name TEXT NOT NULL,
|
| 299 |
+
price REAL NOT NULL,
|
| 300 |
+
stock INTEGER NOT NULL,
|
| 301 |
+
is_deleted INTEGER NOT NULL DEFAULT 0,
|
| 302 |
+
deleted_at TEXT
|
| 303 |
+
);"""
|
| 304 |
+
|
| 305 |
+
TASK4_EXPECTED_ROW_COUNT = 8
|
| 306 |
+
TASK4_EXPECTED_ID_SUM = 36 # 1+2+3+4+5+6+7+8
|
| 307 |
+
TASK4_EXPECTED_DELETED_COUNT = 3 # ids 6,7,8
|
| 308 |
+
TASK4_EXPECTED_ACTIVE_COUNT = 5 # ids 1-5
|
| 309 |
+
|
| 310 |
+
|
| 311 |
+
def seed_task4(conn: sqlite3.Connection) -> None:
|
| 312 |
+
"""Seed the database for Task 4: Soft-Delete Restoration."""
|
| 313 |
+
conn.executescript(TASK4_SOURCE_DDL)
|
| 314 |
+
conn.executemany(
|
| 315 |
+
"INSERT INTO products (id, name, price, stock) VALUES (?, ?, ?, ?)",
|
| 316 |
+
TASK4_PRODUCTS_DATA,
|
| 317 |
+
)
|
| 318 |
+
conn.executemany(
|
| 319 |
+
"INSERT INTO deletion_log (id, product_id, product_name, product_price, product_stock, deleted_at) "
|
| 320 |
+
"VALUES (?, ?, ?, ?, ?, ?)",
|
| 321 |
+
TASK4_DELETION_LOG_DATA,
|
| 322 |
+
)
|
| 323 |
+
conn.commit()
|
| 324 |
+
|
| 325 |
+
|
| 326 |
+
# =============================================================================
|
| 327 |
+
# TASK 5: Schema Version Merge (Medium)
|
| 328 |
+
# =============================================================================
|
| 329 |
+
# Agent must merge products_v1 (price as "$XX.XX" TEXT) and products_v2 (price as REAL)
|
| 330 |
+
# into a single products table. v2 wins on ID conflicts. Must add source column.
|
| 331 |
+
# Adversarial: id=101 high ID, NULL category, "$" price coercion, conflicting rows.
|
| 332 |
+
|
| 333 |
+
TASK5_SOURCE_DDL = """
|
| 334 |
+
CREATE TABLE products_v1 (
|
| 335 |
+
id INTEGER PRIMARY KEY,
|
| 336 |
+
name TEXT NOT NULL,
|
| 337 |
+
price TEXT NOT NULL,
|
| 338 |
+
category TEXT,
|
| 339 |
+
supplier TEXT
|
| 340 |
+
);
|
| 341 |
+
|
| 342 |
+
CREATE TABLE products_v2 (
|
| 343 |
+
id INTEGER PRIMARY KEY,
|
| 344 |
+
name TEXT NOT NULL,
|
| 345 |
+
unit_cost REAL NOT NULL,
|
| 346 |
+
category TEXT NOT NULL,
|
| 347 |
+
brand TEXT,
|
| 348 |
+
sku TEXT
|
| 349 |
+
);
|
| 350 |
+
"""
|
| 351 |
+
|
| 352 |
+
TASK5_V1_DATA = [
|
| 353 |
+
(1, "Widget A", "$12.50", "Electronics", "AcmeCo"),
|
| 354 |
+
(2, "Widget B", "$8.99", "Electronics", "AcmeCo"),
|
| 355 |
+
(3, "Gadget X", "$45.00", None, "TechCorp"),
|
| 356 |
+
(4, "Gadget Y", "$32.50", "Tools", "TechCorp"),
|
| 357 |
+
(5, "Doohickey", "$5.99", "Office", "SupplyPro"),
|
| 358 |
+
(101, "Legacy Item", "$99.99", "Electronics", "OldCo"),
|
| 359 |
+
]
|
| 360 |
+
|
| 361 |
+
TASK5_V2_DATA = [
|
| 362 |
+
(1, "Widget A", 12.50, "Electronics", "AcmeCo", "SKU-001"),
|
| 363 |
+
(2, "Widget B Updated", 9.99, "Electronics", "AcmeCo", "SKU-002"),
|
| 364 |
+
(6, "New Product F", 67.00, "Tools", "NewCorp", "SKU-006"),
|
| 365 |
+
(7, "New Product G", 23.50, "Office", "NewCorp", "SKU-007"),
|
| 366 |
+
(8, "New Product H", 11.00, "Electronics", "ImportCo","SKU-008"),
|
| 367 |
+
]
|
| 368 |
+
|
| 369 |
+
TASK5_TARGET_DDL = """CREATE TABLE products (
|
| 370 |
+
id INTEGER PRIMARY KEY,
|
| 371 |
+
name TEXT NOT NULL,
|
| 372 |
+
price REAL NOT NULL,
|
| 373 |
+
category TEXT,
|
| 374 |
+
supplier TEXT,
|
| 375 |
+
brand TEXT,
|
| 376 |
+
sku TEXT,
|
| 377 |
+
source TEXT NOT NULL
|
| 378 |
+
);"""
|
| 379 |
+
|
| 380 |
+
TASK5_EXPECTED_ROW_COUNT = 9
|
| 381 |
+
TASK5_EXPECTED_PRICE_SUM = round(12.50 + 9.99 + 45.00 + 32.50 + 5.99 + 99.99 + 67.00 + 23.50 + 11.00, 2)
|
| 382 |
+
TASK5_EXPECTED_BOTH_COUNT = 2 # ids 1 and 2
|
| 383 |
+
|
| 384 |
+
|
| 385 |
+
def seed_task5(conn: sqlite3.Connection) -> None:
|
| 386 |
+
"""Seed the database for Task 5: Schema Version Merge."""
|
| 387 |
+
conn.executescript(TASK5_SOURCE_DDL)
|
| 388 |
+
conn.executemany(
|
| 389 |
+
"INSERT INTO products_v1 (id, name, price, category, supplier) VALUES (?, ?, ?, ?, ?)",
|
| 390 |
+
TASK5_V1_DATA,
|
| 391 |
+
)
|
| 392 |
+
conn.executemany(
|
| 393 |
+
"INSERT INTO products_v2 (id, name, unit_cost, category, brand, sku) VALUES (?, ?, ?, ?, ?, ?)",
|
| 394 |
+
TASK5_V2_DATA,
|
| 395 |
+
)
|
| 396 |
+
conn.commit()
|
| 397 |
+
|
| 398 |
+
|
| 399 |
+
# =============================================================================
|
| 400 |
+
# TASK 6: Multi-Entity Extraction (Medium — Hard End)
|
| 401 |
+
# =============================================================================
|
| 402 |
+
# Agent must decompose a sales_records god-table into 3NF (5 tables).
|
| 403 |
+
# Adversarial: leading whitespace email, empty customer email, comma in SKU.
|
| 404 |
+
|
| 405 |
+
TASK6_SOURCE_DDL = """
|
| 406 |
+
CREATE TABLE sales_records (
|
| 407 |
+
id INTEGER PRIMARY KEY,
|
| 408 |
+
rep_name TEXT NOT NULL,
|
| 409 |
+
rep_email TEXT NOT NULL,
|
| 410 |
+
rep_region TEXT NOT NULL,
|
| 411 |
+
customer_name TEXT NOT NULL,
|
| 412 |
+
customer_email TEXT NOT NULL,
|
| 413 |
+
customer_tier TEXT NOT NULL,
|
| 414 |
+
product_name TEXT NOT NULL,
|
| 415 |
+
product_sku TEXT NOT NULL,
|
| 416 |
+
product_category TEXT NOT NULL,
|
| 417 |
+
quantity INTEGER NOT NULL,
|
| 418 |
+
unit_price REAL NOT NULL,
|
| 419 |
+
discount_pct INTEGER NOT NULL DEFAULT 0,
|
| 420 |
+
sale_date TEXT NOT NULL
|
| 421 |
+
);
|
| 422 |
+
"""
|
| 423 |
+
|
| 424 |
+
TASK6_SOURCE_DATA = [
|
| 425 |
+
(1, "Alice Chen", " alice@company.com", "North", "Globex Corp", "globex@corp.com", "enterprise", "Widget Pro", "WIDGET-001", "Electronics", 5, 299.99, 10, "2024-01-10"),
|
| 426 |
+
(2, "Alice Chen", "alice@company.com", "North", "Initech LLC", "info@initech.com", "basic", "Widget Pro", "WIDGET-001", "Electronics", 2, 299.99, 0, "2024-01-15"),
|
| 427 |
+
(3, "Bob Martinez", "bob@company.com", "South", "Globex Corp", "globex@corp.com", "enterprise", "Gadget X", "GADGET-X01", "Hardware", 10, 89.99, 5, "2024-01-20"),
|
| 428 |
+
(4, "Bob Martinez", "bob@company.com", "South", "Umbrella Inc", "sales@umbrella.co", "premium", "Gadget X", "GADGET-X01", "Hardware", 3, 89.99, 0, "2024-02-01"),
|
| 429 |
+
(5, "Carol White", "carol@company.com", "East", "Initech LLC", "info@initech.com", "basic", "Tool Kit", "TOOLS,001", "Hardware", 1, 199.99, 0, "2024-02-05"),
|
| 430 |
+
(6, "Alice Chen", "alice@company.com", "North", "Pendant Corp", "", "free", "Widget Pro", "WIDGET-001", "Electronics", 7, 299.99, 15, "2024-02-10"),
|
| 431 |
+
(7, "Carol White", "carol@company.com", "East", "Globex Corp", "globex@corp.com", "enterprise", "Nano Device", "NANO-D01", "Electronics", 2, 549.99, 20, "2024-02-15"),
|
| 432 |
+
(8, "Bob Martinez", "bob@company.com", "South", "Umbrella Inc", "sales@umbrella.co", "premium", "Tool Kit", "TOOLS,001", "Hardware", 4, 199.99, 10, "2024-03-01"),
|
| 433 |
+
(9, "Alice Chen", "alice@company.com", "North", "Initech LLC", "info@initech.com", "basic", "Nano Device", "NANO-D01", "Electronics", 1, 549.99, 0, "2024-03-05"),
|
| 434 |
+
(10, "Carol White", "carol@company.com", "East", "Umbrella Inc", "sales@umbrella.co", "premium", "Cable Bundle","CABLE-5PK", "Accessories", 20, 14.99, 0, "2024-03-10"),
|
| 435 |
+
(11, "Bob Martinez", "bob@company.com", "South", "Globex Corp", "globex@corp.com", "enterprise", "Cable Bundle","CABLE-5PK", "Accessories", 15, 14.99, 5, "2024-03-15"),
|
| 436 |
+
(12, "Carol White", "carol@company.com", "East", "Pendant Corp", "orders@pendant.io", "free", "Gadget X", "GADGET-X01", "Hardware", 6, 89.99, 0, "2024-03-20"),
|
| 437 |
+
]
|
| 438 |
+
|
| 439 |
+
TASK6_TARGET_DDL = """CREATE TABLE salespersons (
|
| 440 |
+
id INTEGER PRIMARY KEY,
|
| 441 |
+
name TEXT NOT NULL,
|
| 442 |
+
email TEXT NOT NULL UNIQUE,
|
| 443 |
+
region TEXT NOT NULL
|
| 444 |
+
);
|
| 445 |
+
|
| 446 |
+
CREATE TABLE customers (
|
| 447 |
+
id INTEGER PRIMARY KEY,
|
| 448 |
+
name TEXT NOT NULL,
|
| 449 |
+
email TEXT NOT NULL UNIQUE,
|
| 450 |
+
tier TEXT NOT NULL
|
| 451 |
+
);
|
| 452 |
+
|
| 453 |
+
CREATE TABLE products (
|
| 454 |
+
id INTEGER PRIMARY KEY,
|
| 455 |
+
name TEXT NOT NULL,
|
| 456 |
+
sku TEXT NOT NULL UNIQUE,
|
| 457 |
+
category TEXT NOT NULL
|
| 458 |
+
);
|
| 459 |
+
|
| 460 |
+
CREATE TABLE sales (
|
| 461 |
+
id INTEGER PRIMARY KEY,
|
| 462 |
+
salesperson_id INTEGER NOT NULL,
|
| 463 |
+
customer_id INTEGER NOT NULL,
|
| 464 |
+
product_id INTEGER NOT NULL,
|
| 465 |
+
quantity INTEGER NOT NULL,
|
| 466 |
+
unit_price REAL NOT NULL,
|
| 467 |
+
discount_pct INTEGER NOT NULL DEFAULT 0,
|
| 468 |
+
sale_date TEXT NOT NULL,
|
| 469 |
+
FOREIGN KEY (salesperson_id) REFERENCES salespersons(id),
|
| 470 |
+
FOREIGN KEY (customer_id) REFERENCES customers(id),
|
| 471 |
+
FOREIGN KEY (product_id) REFERENCES products(id)
|
| 472 |
+
);
|
| 473 |
+
|
| 474 |
+
CREATE TABLE data_issues (
|
| 475 |
+
id INTEGER PRIMARY KEY,
|
| 476 |
+
source_table TEXT NOT NULL,
|
| 477 |
+
source_row_id INTEGER NOT NULL,
|
| 478 |
+
issue_type TEXT NOT NULL,
|
| 479 |
+
issue_detail TEXT NOT NULL
|
| 480 |
+
);"""
|
| 481 |
+
|
| 482 |
+
TASK6_EXPECTED_SALESPERSON_COUNT = 3
|
| 483 |
+
TASK6_EXPECTED_CUSTOMER_COUNT = 3 # Pendant Corp row 6 excluded (empty email)
|
| 484 |
+
TASK6_EXPECTED_PRODUCT_COUNT = 5
|
| 485 |
+
TASK6_EXPECTED_SALES_COUNT = 11 # row 6 excluded
|
| 486 |
+
TASK6_EXPECTED_DATA_ISSUES_COUNT = 1
|
| 487 |
+
|
| 488 |
+
|
| 489 |
+
def seed_task6(conn: sqlite3.Connection) -> None:
|
| 490 |
+
"""Seed the database for Task 6: Multi-Entity Extraction."""
|
| 491 |
+
conn.executescript(TASK6_SOURCE_DDL)
|
| 492 |
+
conn.executemany(
|
| 493 |
+
"INSERT INTO sales_records (id, rep_name, rep_email, rep_region, "
|
| 494 |
+
"customer_name, customer_email, customer_tier, product_name, product_sku, "
|
| 495 |
+
"product_category, quantity, unit_price, discount_pct, sale_date) "
|
| 496 |
+
"VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",
|
| 497 |
+
TASK6_SOURCE_DATA,
|
| 498 |
+
)
|
| 499 |
+
conn.commit()
|
| 500 |
+
|
| 501 |
+
|
| 502 |
+
# =============================================================================
|
| 503 |
+
# TASK 7: Dual-Source Consolidation (Hard)
|
| 504 |
+
# =============================================================================
|
| 505 |
+
# Agent must merge 6 source tables from two incompatible systems (Legacy CRM + Modern SaaS)
|
| 506 |
+
# into 4 unified target tables. Cross-system email dedup, currency coercion, orphan detection.
|
| 507 |
+
|
| 508 |
+
TASK7_LEGACY_CUSTOMERS_DDL = """
|
| 509 |
+
CREATE TABLE legacy_customers (
|
| 510 |
+
id INTEGER PRIMARY KEY,
|
| 511 |
+
full_name TEXT,
|
| 512 |
+
contact_email TEXT,
|
| 513 |
+
phone TEXT,
|
| 514 |
+
account_type TEXT,
|
| 515 |
+
join_date TEXT
|
| 516 |
+
);
|
| 517 |
+
"""
|
| 518 |
+
|
| 519 |
+
TASK7_LEGACY_ORDERS_DDL = """
|
| 520 |
+
CREATE TABLE legacy_orders (
|
| 521 |
+
id INTEGER PRIMARY KEY,
|
| 522 |
+
customer_id INTEGER,
|
| 523 |
+
product_code TEXT,
|
| 524 |
+
total_amount TEXT,
|
| 525 |
+
order_status TEXT,
|
| 526 |
+
order_date TEXT
|
| 527 |
+
);
|
| 528 |
+
"""
|
| 529 |
+
|
| 530 |
+
TASK7_LEGACY_PRODUCTS_DDL = """
|
| 531 |
+
CREATE TABLE legacy_products (
|
| 532 |
+
code TEXT PRIMARY KEY,
|
| 533 |
+
description TEXT,
|
| 534 |
+
unit_price TEXT
|
| 535 |
+
);
|
| 536 |
+
"""
|
| 537 |
+
|
| 538 |
+
TASK7_MODERN_USERS_DDL = """
|
| 539 |
+
CREATE TABLE modern_users (
|
| 540 |
+
uuid TEXT PRIMARY KEY,
|
| 541 |
+
display_name TEXT,
|
| 542 |
+
email_address TEXT,
|
| 543 |
+
subscription_tier INTEGER,
|
| 544 |
+
created_at TEXT
|
| 545 |
+
);
|
| 546 |
+
"""
|
| 547 |
+
|
| 548 |
+
TASK7_MODERN_TRANSACTIONS_DDL = """
|
| 549 |
+
CREATE TABLE modern_transactions (
|
| 550 |
+
id INTEGER PRIMARY KEY,
|
| 551 |
+
user_uuid TEXT,
|
| 552 |
+
item_sku TEXT,
|
| 553 |
+
amount REAL,
|
| 554 |
+
currency TEXT,
|
| 555 |
+
tx_status INTEGER,
|
| 556 |
+
created_at TEXT
|
| 557 |
+
);
|
| 558 |
+
"""
|
| 559 |
+
|
| 560 |
+
TASK7_MODERN_CATALOG_DDL = """
|
| 561 |
+
CREATE TABLE modern_catalog (
|
| 562 |
+
sku TEXT PRIMARY KEY,
|
| 563 |
+
title TEXT,
|
| 564 |
+
base_price REAL
|
| 565 |
+
);
|
| 566 |
+
"""
|
| 567 |
+
|
| 568 |
+
TASK7_LEGACY_CUSTOMERS_DATA = [
|
| 569 |
+
(1, "Alice Johnson", "alice@example.com", "+1-555-0101", "premium", "2021-03-15"),
|
| 570 |
+
(2, "Bob Chen", "bob@example.com", "+1-555-0102", "basic", "2022-07-01"),
|
| 571 |
+
(3, "Carol Davis", None, "+1-555-0103", "free", "2023-01-10"),
|
| 572 |
+
(4, "Dave Wilson", "dave@example.com", "+1-555-0104", "premium", "2021-11-20"),
|
| 573 |
+
(5, "Eve Martinez", "eve@example.com", "+1-555-0105", "free", "2023-06-05"),
|
| 574 |
+
]
|
| 575 |
+
|
| 576 |
+
TASK7_MODERN_USERS_DATA = [
|
| 577 |
+
("uuid-A1", "Alice J.", "alice@example.com", 3, "2021-03-15"),
|
| 578 |
+
("uuid-B2", "R. Bob Chen", "bob@example.com", 2, "2022-07-01"),
|
| 579 |
+
("uuid-F6", "Frank Lee", "frank@example.com", 4, "2022-09-30"),
|
| 580 |
+
("uuid-G7", "Grace Kim", "grace@example.com", 1, "2024-01-15"),
|
| 581 |
+
]
|
| 582 |
+
|
| 583 |
+
TASK7_LEGACY_ORDERS_DATA = [
|
| 584 |
+
(1, 1, "PROD-A", "$1,234.56", "delivered", "2022-01-10"),
|
| 585 |
+
(2, 2, "PROD-B", "$89.99", "shipped", "2022-03-15"),
|
| 586 |
+
(3, 4, "PROD-A", "$2,500.00", "delivered", "2022-05-20"),
|
| 587 |
+
(4, 3, "PROD-C", "$45.00", "pending", "2023-02-01"),
|
| 588 |
+
]
|
| 589 |
+
|
| 590 |
+
TASK7_LEGACY_PRODUCTS_DATA = [
|
| 591 |
+
("PROD-A", "Enterprise Widget", "$1,234.56"),
|
| 592 |
+
("PROD-B", "Basic Gadget", "$89.99"),
|
| 593 |
+
("PROD-C", "Starter Kit", "$45.00"),
|
| 594 |
+
]
|
| 595 |
+
|
| 596 |
+
TASK7_MODERN_TRANSACTIONS_DATA = [
|
| 597 |
+
(1, "uuid-A1", "SKU-001", 299.99, "USD", 3, "2023-06-01"),
|
| 598 |
+
(2, "uuid-B2", "SKU-002", 89.99, None, 2, "2023-07-15"),
|
| 599 |
+
(3, "uuid-F6", "SKU-001", 299.99, None, 3, "2023-08-20"),
|
| 600 |
+
(4, "uuid-DEAD", "SKU-003", 15.99, None, 1, "2023-09-01"), # orphan
|
| 601 |
+
(5, "uuid-G7", "SKU-002", 89.99, "USD", 4, "2023-10-10"),
|
| 602 |
+
(6, "uuid-A1", "SKU-003", 15.99, "EUR", 5, "2023-11-01"),
|
| 603 |
+
]
|
| 604 |
+
|
| 605 |
+
TASK7_MODERN_CATALOG_DATA = [
|
| 606 |
+
("SKU-001", "Pro Widget", 299.99),
|
| 607 |
+
("SKU-002", "Smart Gadget", 89.99),
|
| 608 |
+
("SKU-003", "Mini Accessory", 15.99),
|
| 609 |
+
]
|
| 610 |
+
|
| 611 |
+
TASK7_TARGET_DDL = """CREATE TABLE unified_customers (
|
| 612 |
+
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
| 613 |
+
legacy_id INTEGER,
|
| 614 |
+
modern_uuid TEXT,
|
| 615 |
+
name TEXT,
|
| 616 |
+
email TEXT,
|
| 617 |
+
phone TEXT,
|
| 618 |
+
tier TEXT NOT NULL DEFAULT 'free',
|
| 619 |
+
source TEXT NOT NULL,
|
| 620 |
+
created_at TEXT
|
| 621 |
+
);
|
| 622 |
+
|
| 623 |
+
CREATE TABLE unified_products (
|
| 624 |
+
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
| 625 |
+
code TEXT NOT NULL UNIQUE,
|
| 626 |
+
title TEXT NOT NULL,
|
| 627 |
+
price REAL NOT NULL,
|
| 628 |
+
source TEXT NOT NULL
|
| 629 |
+
);
|
| 630 |
+
|
| 631 |
+
CREATE TABLE unified_orders (
|
| 632 |
+
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
| 633 |
+
customer_id INTEGER NOT NULL,
|
| 634 |
+
product_id INTEGER,
|
| 635 |
+
amount REAL NOT NULL,
|
| 636 |
+
currency TEXT NOT NULL DEFAULT 'USD',
|
| 637 |
+
status TEXT NOT NULL,
|
| 638 |
+
order_date TEXT,
|
| 639 |
+
source TEXT NOT NULL,
|
| 640 |
+
FOREIGN KEY (customer_id) REFERENCES unified_customers(id)
|
| 641 |
+
);
|
| 642 |
+
|
| 643 |
+
CREATE TABLE migration_issues (
|
| 644 |
+
id INTEGER PRIMARY KEY,
|
| 645 |
+
source_system TEXT NOT NULL,
|
| 646 |
+
source_table TEXT NOT NULL,
|
| 647 |
+
source_id TEXT NOT NULL,
|
| 648 |
+
issue_type TEXT NOT NULL,
|
| 649 |
+
resolution TEXT NOT NULL
|
| 650 |
+
);"""
|
| 651 |
+
|
| 652 |
+
TASK7_EXPECTED_UNIFIED_CUSTOMERS = 7
|
| 653 |
+
TASK7_EXPECTED_BOTH_SOURCE_COUNT = 2
|
| 654 |
+
TASK7_EXPECTED_UNIFIED_ORDERS = 9
|
| 655 |
+
TASK7_EXPECTED_MIGRATION_ISSUES = 2
|
| 656 |
+
|
| 657 |
+
# Tier mapping: 1→'free', 2→'basic', 3→'premium', 4→'enterprise'
|
| 658 |
+
TASK7_TIER_MAP = {1: "free", 2: "basic", 3: "premium", 4: "enterprise"}
|
| 659 |
+
# Status mapping: 1→'pending', 2→'processing', 3→'complete', 4→'failed', 5→'refunded'
|
| 660 |
+
TASK7_STATUS_MAP = {1: "pending", 2: "processing", 3: "complete", 4: "failed", 5: "refunded"}
|
| 661 |
+
|
| 662 |
+
|
| 663 |
+
def seed_task7(conn: sqlite3.Connection) -> None:
|
| 664 |
+
"""Seed the database for Task 7: Dual-Source Consolidation."""
|
| 665 |
+
conn.executescript(TASK7_LEGACY_CUSTOMERS_DDL)
|
| 666 |
+
conn.executescript(TASK7_LEGACY_ORDERS_DDL)
|
| 667 |
+
conn.executescript(TASK7_LEGACY_PRODUCTS_DDL)
|
| 668 |
+
conn.executescript(TASK7_MODERN_USERS_DDL)
|
| 669 |
+
conn.executescript(TASK7_MODERN_TRANSACTIONS_DDL)
|
| 670 |
+
conn.executescript(TASK7_MODERN_CATALOG_DDL)
|
| 671 |
+
|
| 672 |
+
conn.executemany("INSERT INTO legacy_customers VALUES (?, ?, ?, ?, ?, ?)", TASK7_LEGACY_CUSTOMERS_DATA)
|
| 673 |
+
conn.executemany("INSERT INTO legacy_orders VALUES (?, ?, ?, ?, ?, ?)", TASK7_LEGACY_ORDERS_DATA)
|
| 674 |
+
conn.executemany("INSERT INTO legacy_products VALUES (?, ?, ?)", TASK7_LEGACY_PRODUCTS_DATA)
|
| 675 |
+
conn.executemany("INSERT INTO modern_users VALUES (?, ?, ?, ?, ?)", TASK7_MODERN_USERS_DATA)
|
| 676 |
+
conn.executemany("INSERT INTO modern_transactions VALUES (?, ?, ?, ?, ?, ?, ?)", TASK7_MODERN_TRANSACTIONS_DATA)
|
| 677 |
+
conn.executemany("INSERT INTO modern_catalog VALUES (?, ?, ?)", TASK7_MODERN_CATALOG_DATA)
|
| 678 |
+
conn.commit()
|
| 679 |
+
|
| 680 |
+
|
| 681 |
# =============================================================================
|
| 682 |
# Task Registry
|
| 683 |
# =============================================================================
|
|
|
|
| 688 |
"target_ddl": TASK1_TARGET_DDL,
|
| 689 |
"description": "Merge first_name and last_name into a single full_name column without data loss",
|
| 690 |
"difficulty": "easy",
|
| 691 |
+
"max_steps": 10,
|
| 692 |
+
},
|
| 693 |
+
"soft-delete-restoration": {
|
| 694 |
+
"seed_fn": seed_task4,
|
| 695 |
+
"target_ddl": TASK4_TARGET_DDL,
|
| 696 |
+
"description": "Restore deleted products from deletion_log, add is_deleted/deleted_at columns",
|
| 697 |
+
"difficulty": "easy",
|
| 698 |
+
"max_steps": 10,
|
| 699 |
},
|
| 700 |
"table-normalization": {
|
| 701 |
"seed_fn": seed_task2,
|
| 702 |
"target_ddl": TASK2_TARGET_DDL,
|
| 703 |
"description": "Decompose a flat purchases table into normalized customers and orders tables with FK",
|
| 704 |
"difficulty": "medium",
|
| 705 |
+
"max_steps": 15,
|
| 706 |
+
},
|
| 707 |
+
"schema-version-merge": {
|
| 708 |
+
"seed_fn": seed_task5,
|
| 709 |
+
"target_ddl": TASK5_TARGET_DDL,
|
| 710 |
+
"description": "Merge overlapping v1/v2 product tables with price coercion and conflict resolution",
|
| 711 |
+
"difficulty": "medium",
|
| 712 |
+
"max_steps": 15,
|
| 713 |
+
},
|
| 714 |
+
"multi-entity-extraction": {
|
| 715 |
+
"seed_fn": seed_task6,
|
| 716 |
+
"target_ddl": TASK6_TARGET_DDL,
|
| 717 |
+
"description": "Decompose a sales god-table into 3NF with 3 FKs and invalid data routing",
|
| 718 |
+
"difficulty": "medium",
|
| 719 |
+
"max_steps": 15,
|
| 720 |
},
|
| 721 |
"cascade-migration": {
|
| 722 |
"seed_fn": seed_task3,
|
| 723 |
"target_ddl": TASK3_TARGET_DDL,
|
| 724 |
"description": "Multi-table FK cascade with type coercion, NULL handling, and orphan audit logging",
|
| 725 |
"difficulty": "hard",
|
| 726 |
+
"max_steps": 20,
|
| 727 |
+
},
|
| 728 |
+
"dual-source-consolidation": {
|
| 729 |
+
"seed_fn": seed_task7,
|
| 730 |
+
"target_ddl": TASK7_TARGET_DDL,
|
| 731 |
+
"description": "Merge 6 tables from two incompatible systems into 4 unified tables with cross-system dedup",
|
| 732 |
+
"difficulty": "hard",
|
| 733 |
+
"max_steps": 20,
|
| 734 |
},
|
| 735 |
}
|
| 736 |
+
|
server/__pycache__/environment.cpython-312.pyc
CHANGED
|
Binary files a/server/__pycache__/environment.cpython-312.pyc and b/server/__pycache__/environment.cpython-312.pyc differ
|
|
|
server/__pycache__/grader.cpython-312.pyc
CHANGED
|
Binary files a/server/__pycache__/grader.cpython-312.pyc and b/server/__pycache__/grader.cpython-312.pyc differ
|
|
|
server/app.py
CHANGED
|
@@ -61,35 +61,40 @@ async def root():
|
|
| 61 |
return """<!DOCTYPE html>
|
| 62 |
<html>
|
| 63 |
<head>
|
| 64 |
-
<title>SQL Migration Agent
|
| 65 |
<style>
|
| 66 |
body { font-family: monospace; background: #0d1117; color: #e6edf3; padding: 40px; }
|
| 67 |
h1 { color: #58a6ff; } h2 { color: #79c0ff; }
|
| 68 |
.ok { color: #3fb950; } .endpoint { color: #d2a8ff; }
|
| 69 |
pre { background: #161b22; padding: 12px; border-radius: 6px; }
|
| 70 |
a { color: #58a6ff; }
|
|
|
|
| 71 |
</style>
|
| 72 |
</head>
|
| 73 |
<body>
|
| 74 |
-
<h1>
|
| 75 |
-
<p class="ok">
|
| 76 |
<h2>API Endpoints</h2>
|
| 77 |
<pre>
|
| 78 |
-
<span class="endpoint">POST /reset</span>
|
| 79 |
-
<span class="endpoint">POST /step</span>
|
| 80 |
-
<span class="endpoint">GET /state</span>
|
| 81 |
-
<span class="endpoint">GET /tasks</span>
|
| 82 |
-
<span class="endpoint">POST /grader</span>
|
| 83 |
-
<span class="endpoint">GET /health</span>
|
| 84 |
-
<span class="endpoint">GET /docs</span>
|
| 85 |
</pre>
|
| 86 |
-
<h2>Tasks</h2>
|
| 87 |
<pre>
|
| 88 |
-
1. column-restructure
|
| 89 |
-
2.
|
| 90 |
-
3.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
</pre>
|
| 92 |
-
<p><a href="/docs">
|
| 93 |
</body>
|
| 94 |
</html>"""
|
| 95 |
|
|
@@ -101,31 +106,27 @@ async def list_tasks() -> Dict[str, Any]:
|
|
| 101 |
|
| 102 |
Returns JSON with task definitions and action schema for automated validation.
|
| 103 |
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
return {
|
| 105 |
-
"tasks":
|
| 106 |
-
{
|
| 107 |
-
"name": "column-restructure",
|
| 108 |
-
"description": "Merge first_name and last_name into a single full_name column without data loss",
|
| 109 |
-
"difficulty": "easy",
|
| 110 |
-
"max_steps": 20,
|
| 111 |
-
},
|
| 112 |
-
{
|
| 113 |
-
"name": "table-normalization",
|
| 114 |
-
"description": "Decompose a flat purchases table into normalized customers and orders tables with FK",
|
| 115 |
-
"difficulty": "medium",
|
| 116 |
-
"max_steps": 20,
|
| 117 |
-
},
|
| 118 |
-
{
|
| 119 |
-
"name": "cascade-migration",
|
| 120 |
-
"description": "Multi-table FK cascade with type coercion, NULL handling, and orphan audit logging",
|
| 121 |
-
"difficulty": "hard",
|
| 122 |
-
"max_steps": 20,
|
| 123 |
-
},
|
| 124 |
-
],
|
| 125 |
"action_schema": {
|
| 126 |
-
"sql_command": "string
|
| 127 |
-
"reasoning": "string
|
| 128 |
-
"submit_final": "boolean
|
| 129 |
},
|
| 130 |
}
|
| 131 |
|
|
@@ -141,25 +142,29 @@ async def grade_task(
|
|
| 141 |
Returns per-task grader scores after running the environment's internal scorer.
|
| 142 |
"""
|
| 143 |
task_name = body.get("task_name", None)
|
| 144 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
results = {}
|
| 147 |
for t in tasks_to_grade:
|
| 148 |
try:
|
| 149 |
env = DbMigrationEnvironment(task_name=t)
|
| 150 |
obs = env.reset()
|
| 151 |
-
# Return the initial score (before any agent action)
|
| 152 |
-
# This proves the grader works and returns values in [0.0, 1.0]
|
| 153 |
results[t] = {
|
| 154 |
-
"initial_score":
|
| 155 |
"grader_functional": True,
|
| 156 |
-
"reward_range": [0.
|
| 157 |
-
"max_steps": 20,
|
| 158 |
}
|
| 159 |
env.close()
|
| 160 |
except Exception as e:
|
| 161 |
results[t] = {
|
| 162 |
-
"initial_score": 0.
|
| 163 |
"grader_functional": False,
|
| 164 |
"error": str(e),
|
| 165 |
}
|
|
|
|
| 61 |
return """<!DOCTYPE html>
|
| 62 |
<html>
|
| 63 |
<head>
|
| 64 |
+
<title>SQL Migration Agent -- OpenEnv</title>
|
| 65 |
<style>
|
| 66 |
body { font-family: monospace; background: #0d1117; color: #e6edf3; padding: 40px; }
|
| 67 |
h1 { color: #58a6ff; } h2 { color: #79c0ff; }
|
| 68 |
.ok { color: #3fb950; } .endpoint { color: #d2a8ff; }
|
| 69 |
pre { background: #161b22; padding: 12px; border-radius: 6px; }
|
| 70 |
a { color: #58a6ff; }
|
| 71 |
+
.easy { color: #3fb950; } .medium { color: #d29922; } .hard { color: #f85149; }
|
| 72 |
</style>
|
| 73 |
</head>
|
| 74 |
<body>
|
| 75 |
+
<h1>SQL Schema Migration Agent</h1>
|
| 76 |
+
<p class="ok">Server running -- OpenEnv hackathon environment (7 tasks)</p>
|
| 77 |
<h2>API Endpoints</h2>
|
| 78 |
<pre>
|
| 79 |
+
<span class="endpoint">POST /reset</span> -- Start a new migration episode
|
| 80 |
+
<span class="endpoint">POST /step</span> -- Execute a SQL action
|
| 81 |
+
<span class="endpoint">GET /state</span> -- Current environment state
|
| 82 |
+
<span class="endpoint">GET /tasks</span> -- List all 7 tasks
|
| 83 |
+
<span class="endpoint">POST /grader</span> -- Run grader on all tasks
|
| 84 |
+
<span class="endpoint">GET /health</span> -- Health check
|
| 85 |
+
<span class="endpoint">GET /docs</span> -- Interactive API documentation
|
| 86 |
</pre>
|
| 87 |
+
<h2>Tasks (2 Easy / 3 Medium / 2 Hard)</h2>
|
| 88 |
<pre>
|
| 89 |
+
<span class="easy">1. column-restructure (Easy) -- Merge first_name + last_name -> full_name</span>
|
| 90 |
+
<span class="easy">2. soft-delete-restoration (Easy) -- Restore deleted products from deletion_log</span>
|
| 91 |
+
<span class="medium">3. table-normalization (Medium) -- Normalize purchases -> customers + orders + FK</span>
|
| 92 |
+
<span class="medium">4. schema-version-merge (Medium) -- Merge v1/v2 product tables with coercion</span>
|
| 93 |
+
<span class="medium">5. multi-entity-extraction (Medium) -- 3NF decomposition with invalid data routing</span>
|
| 94 |
+
<span class="hard">6. cascade-migration (Hard) -- 4-table FK cascade, type coercion, orphan audit</span>
|
| 95 |
+
<span class="hard">7. dual-source-consolidation(Hard) -- 6->4 table merge, cross-system email dedup</span>
|
| 96 |
</pre>
|
| 97 |
+
<p><a href="/docs">Open API Docs</a> | <a href="/tasks">View Tasks</a> | <a href="/health">Health Check</a></p>
|
| 98 |
</body>
|
| 99 |
</html>"""
|
| 100 |
|
|
|
|
| 106 |
|
| 107 |
Returns JSON with task definitions and action schema for automated validation.
|
| 108 |
"""
|
| 109 |
+
# Import seeds to dynamically build task list
|
| 110 |
+
try:
|
| 111 |
+
from .. import seeds as _seeds
|
| 112 |
+
except ImportError:
|
| 113 |
+
import seeds as _seeds
|
| 114 |
+
|
| 115 |
+
task_list = []
|
| 116 |
+
for name, cfg in _seeds.TASKS.items():
|
| 117 |
+
task_list.append({
|
| 118 |
+
"name": name,
|
| 119 |
+
"description": cfg["description"],
|
| 120 |
+
"difficulty": cfg["difficulty"],
|
| 121 |
+
"max_steps": cfg.get("max_steps", 20),
|
| 122 |
+
})
|
| 123 |
+
|
| 124 |
return {
|
| 125 |
+
"tasks": task_list,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
"action_schema": {
|
| 127 |
+
"sql_command": "string -- The SQL statement to execute",
|
| 128 |
+
"reasoning": "string -- Explanation of the action (optional)",
|
| 129 |
+
"submit_final": "boolean -- Set true when migration is complete (default: false)",
|
| 130 |
},
|
| 131 |
}
|
| 132 |
|
|
|
|
| 142 |
Returns per-task grader scores after running the environment's internal scorer.
|
| 143 |
"""
|
| 144 |
task_name = body.get("task_name", None)
|
| 145 |
+
|
| 146 |
+
try:
|
| 147 |
+
from .. import seeds as _seeds
|
| 148 |
+
except ImportError:
|
| 149 |
+
import seeds as _seeds
|
| 150 |
+
|
| 151 |
+
tasks_to_grade = [task_name] if task_name else list(_seeds.TASKS.keys())
|
| 152 |
|
| 153 |
results = {}
|
| 154 |
for t in tasks_to_grade:
|
| 155 |
try:
|
| 156 |
env = DbMigrationEnvironment(task_name=t)
|
| 157 |
obs = env.reset()
|
|
|
|
|
|
|
| 158 |
results[t] = {
|
| 159 |
+
"initial_score": obs.migration_progress,
|
| 160 |
"grader_functional": True,
|
| 161 |
+
"reward_range": [0.01, 0.99],
|
| 162 |
+
"max_steps": _seeds.TASKS[t].get("max_steps", 20),
|
| 163 |
}
|
| 164 |
env.close()
|
| 165 |
except Exception as e:
|
| 166 |
results[t] = {
|
| 167 |
+
"initial_score": 0.01,
|
| 168 |
"grader_functional": False,
|
| 169 |
"error": str(e),
|
| 170 |
}
|
server/environment.py
CHANGED
|
@@ -71,7 +71,8 @@ class DbMigrationEnvironment(Environment):
|
|
| 71 |
return ""
|
| 72 |
try:
|
| 73 |
cursor = self._conn.execute(
|
| 74 |
-
"SELECT sql FROM sqlite_master WHERE type='table'
|
|
|
|
| 75 |
)
|
| 76 |
schemas = [row[0] for row in cursor.fetchall()]
|
| 77 |
return ";\n\n".join(schemas) + ";" if schemas else ""
|
|
@@ -186,6 +187,17 @@ class DbMigrationEnvironment(Environment):
|
|
| 186 |
self._conn.commit()
|
| 187 |
rows_affected = cursor.rowcount
|
| 188 |
execution_result = f"Success: {rows_affected} rows affected"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 189 |
except Exception as e:
|
| 190 |
# Never crash — feed the error back to the agent
|
| 191 |
execution_result = str(e)
|
|
@@ -199,8 +211,9 @@ class DbMigrationEnvironment(Environment):
|
|
| 199 |
# Compute scores
|
| 200 |
current_score, step_reward = self._reconciler.compute_step_reward(self._conn)
|
| 201 |
|
| 202 |
-
# Episode termination: submit_final, max steps
|
| 203 |
-
|
|
|
|
| 204 |
|
| 205 |
# Update state
|
| 206 |
self._state.step_count = self._step_count
|
|
|
|
| 71 |
return ""
|
| 72 |
try:
|
| 73 |
cursor = self._conn.execute(
|
| 74 |
+
"SELECT sql FROM sqlite_master WHERE type='table' "
|
| 75 |
+
"AND sql IS NOT NULL AND name NOT LIKE 'sqlite_%' ORDER BY name"
|
| 76 |
)
|
| 77 |
schemas = [row[0] for row in cursor.fetchall()]
|
| 78 |
return ";\n\n".join(schemas) + ";" if schemas else ""
|
|
|
|
| 187 |
self._conn.commit()
|
| 188 |
rows_affected = cursor.rowcount
|
| 189 |
execution_result = f"Success: {rows_affected} rows affected"
|
| 190 |
+
except sqlite3.Warning as e:
|
| 191 |
+
# Multi-statement attempt — agent tried to combine statements
|
| 192 |
+
execution_result = (
|
| 193 |
+
f"Error: SQLite requires one statement per step. "
|
| 194 |
+
f"Split your commands into separate steps. Original error: {e}"
|
| 195 |
+
)
|
| 196 |
+
action_error = "multi_statement"
|
| 197 |
+
try:
|
| 198 |
+
self._conn.rollback()
|
| 199 |
+
except Exception:
|
| 200 |
+
pass
|
| 201 |
except Exception as e:
|
| 202 |
# Never crash — feed the error back to the agent
|
| 203 |
execution_result = str(e)
|
|
|
|
| 211 |
# Compute scores
|
| 212 |
current_score, step_reward = self._reconciler.compute_step_reward(self._conn)
|
| 213 |
|
| 214 |
+
# Episode termination: submit_final, max steps, OR perfect score
|
| 215 |
+
task_max = self._task_config.get("max_steps", 20)
|
| 216 |
+
done = action.submit_final or self._step_count >= task_max or current_score >= 0.99
|
| 217 |
|
| 218 |
# Update state
|
| 219 |
self._state.step_count = self._step_count
|
server/grader.py
CHANGED
|
@@ -29,6 +29,22 @@ from seeds import (
|
|
| 29 |
TASK3_EXPECTED_AUDIT_ENTRIES,
|
| 30 |
TASK3_EXPECTED_EMPLOYEE_COUNT,
|
| 31 |
TASK3_EXPECTED_SALARIES,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
)
|
| 33 |
|
| 34 |
|
|
@@ -36,7 +52,8 @@ def _get_table_names(conn: sqlite3.Connection) -> Set[str]:
|
|
| 36 |
"""Get all table names in the database."""
|
| 37 |
try:
|
| 38 |
cursor = conn.execute(
|
| 39 |
-
"SELECT name FROM sqlite_master WHERE type='table'
|
|
|
|
| 40 |
)
|
| 41 |
return {row[0] for row in cursor.fetchall()}
|
| 42 |
except Exception:
|
|
@@ -98,10 +115,18 @@ class StateReconciler:
|
|
| 98 |
return self._score_task2(conn)
|
| 99 |
elif self.task_name == "cascade-migration":
|
| 100 |
return self._score_task3(conn)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
else:
|
| 102 |
-
return 0.
|
| 103 |
except Exception:
|
| 104 |
-
return 0.
|
| 105 |
|
| 106 |
def compute_step_reward(self, conn: sqlite3.Connection) -> Tuple[float, float]:
|
| 107 |
"""
|
|
@@ -173,6 +198,11 @@ class StateReconciler:
|
|
| 173 |
# order_count=0.2, no_null_ids=0.1, integrity=0.2
|
| 174 |
|
| 175 |
def _score_task2(self, conn: sqlite3.Connection) -> float:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 176 |
score = 0.0
|
| 177 |
tables = _get_table_names(conn)
|
| 178 |
|
|
@@ -242,6 +272,11 @@ class StateReconciler:
|
|
| 242 |
# Total max = 0.90 for all grader checks + 0.10 integrity = 1.00
|
| 243 |
|
| 244 |
def _score_task3(self, conn: sqlite3.Connection) -> float:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 245 |
score = 0.0
|
| 246 |
tables = _get_table_names(conn)
|
| 247 |
|
|
@@ -351,3 +386,371 @@ class StateReconciler:
|
|
| 351 |
score = min(score, 0.1)
|
| 352 |
|
| 353 |
return max(0.01, min(0.99, score))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
TASK3_EXPECTED_AUDIT_ENTRIES,
|
| 30 |
TASK3_EXPECTED_EMPLOYEE_COUNT,
|
| 31 |
TASK3_EXPECTED_SALARIES,
|
| 32 |
+
TASK4_EXPECTED_ROW_COUNT,
|
| 33 |
+
TASK4_EXPECTED_ID_SUM,
|
| 34 |
+
TASK4_EXPECTED_DELETED_COUNT,
|
| 35 |
+
TASK4_EXPECTED_ACTIVE_COUNT,
|
| 36 |
+
TASK5_EXPECTED_ROW_COUNT,
|
| 37 |
+
TASK5_EXPECTED_PRICE_SUM,
|
| 38 |
+
TASK5_EXPECTED_BOTH_COUNT,
|
| 39 |
+
TASK6_EXPECTED_SALESPERSON_COUNT,
|
| 40 |
+
TASK6_EXPECTED_CUSTOMER_COUNT,
|
| 41 |
+
TASK6_EXPECTED_PRODUCT_COUNT,
|
| 42 |
+
TASK6_EXPECTED_SALES_COUNT,
|
| 43 |
+
TASK6_EXPECTED_DATA_ISSUES_COUNT,
|
| 44 |
+
TASK7_EXPECTED_UNIFIED_CUSTOMERS,
|
| 45 |
+
TASK7_EXPECTED_BOTH_SOURCE_COUNT,
|
| 46 |
+
TASK7_EXPECTED_UNIFIED_ORDERS,
|
| 47 |
+
TASK7_EXPECTED_MIGRATION_ISSUES,
|
| 48 |
)
|
| 49 |
|
| 50 |
|
|
|
|
| 52 |
"""Get all table names in the database."""
|
| 53 |
try:
|
| 54 |
cursor = conn.execute(
|
| 55 |
+
"SELECT name FROM sqlite_master WHERE type='table' "
|
| 56 |
+
"AND name NOT LIKE 'sqlite_%' ORDER BY name"
|
| 57 |
)
|
| 58 |
return {row[0] for row in cursor.fetchall()}
|
| 59 |
except Exception:
|
|
|
|
| 115 |
return self._score_task2(conn)
|
| 116 |
elif self.task_name == "cascade-migration":
|
| 117 |
return self._score_task3(conn)
|
| 118 |
+
elif self.task_name == "soft-delete-restoration":
|
| 119 |
+
return self._score_task4(conn)
|
| 120 |
+
elif self.task_name == "schema-version-merge":
|
| 121 |
+
return self._score_task5(conn)
|
| 122 |
+
elif self.task_name == "multi-entity-extraction":
|
| 123 |
+
return self._score_task6(conn)
|
| 124 |
+
elif self.task_name == "dual-source-consolidation":
|
| 125 |
+
return self._score_task7(conn)
|
| 126 |
else:
|
| 127 |
+
return 0.01
|
| 128 |
except Exception:
|
| 129 |
+
return 0.01
|
| 130 |
|
| 131 |
def compute_step_reward(self, conn: sqlite3.Connection) -> Tuple[float, float]:
|
| 132 |
"""
|
|
|
|
| 198 |
# order_count=0.2, no_null_ids=0.1, integrity=0.2
|
| 199 |
|
| 200 |
def _score_task2(self, conn: sqlite3.Connection) -> float:
|
| 201 |
+
# Re-assert FK enforcement to prevent PRAGMA bypass exploit
|
| 202 |
+
try:
|
| 203 |
+
conn.execute("PRAGMA foreign_keys = ON")
|
| 204 |
+
except Exception:
|
| 205 |
+
pass
|
| 206 |
score = 0.0
|
| 207 |
tables = _get_table_names(conn)
|
| 208 |
|
|
|
|
| 272 |
# Total max = 0.90 for all grader checks + 0.10 integrity = 1.00
|
| 273 |
|
| 274 |
def _score_task3(self, conn: sqlite3.Connection) -> float:
|
| 275 |
+
# Re-assert FK enforcement to prevent PRAGMA bypass exploit
|
| 276 |
+
try:
|
| 277 |
+
conn.execute("PRAGMA foreign_keys = ON")
|
| 278 |
+
except Exception:
|
| 279 |
+
pass
|
| 280 |
score = 0.0
|
| 281 |
tables = _get_table_names(conn)
|
| 282 |
|
|
|
|
| 386 |
score = min(score, 0.1)
|
| 387 |
|
| 388 |
return max(0.01, min(0.99, score))
|
| 389 |
+
|
| 390 |
+
# =========================================================================
|
| 391 |
+
# Task 4: Soft-Delete Restoration (Easy)
|
| 392 |
+
# =========================================================================
|
| 393 |
+
|
| 394 |
+
def _score_task4(self, conn: sqlite3.Connection) -> float:
|
| 395 |
+
score = 0.0
|
| 396 |
+
tables = _get_table_names(conn)
|
| 397 |
+
|
| 398 |
+
if "products" not in tables:
|
| 399 |
+
return 0.01
|
| 400 |
+
|
| 401 |
+
cols = _get_column_names(conn, "products")
|
| 402 |
+
|
| 403 |
+
# is_deleted column exists (+0.15)
|
| 404 |
+
if "is_deleted" in cols:
|
| 405 |
+
score += 0.15
|
| 406 |
+
|
| 407 |
+
# deleted_at column exists (+0.10)
|
| 408 |
+
if "deleted_at" in cols:
|
| 409 |
+
score += 0.10
|
| 410 |
+
|
| 411 |
+
# Row count = 8 (+0.20)
|
| 412 |
+
row_count = _get_row_count(conn, "products")
|
| 413 |
+
if row_count == TASK4_EXPECTED_ROW_COUNT:
|
| 414 |
+
score += 0.20
|
| 415 |
+
|
| 416 |
+
# Active products: is_deleted=0, deleted_at IS NULL (+0.25)
|
| 417 |
+
if "is_deleted" in cols:
|
| 418 |
+
try:
|
| 419 |
+
cursor = conn.execute(
|
| 420 |
+
"SELECT COUNT(*) FROM products WHERE is_deleted = 0 AND deleted_at IS NULL"
|
| 421 |
+
)
|
| 422 |
+
active = cursor.fetchone()[0]
|
| 423 |
+
if active == TASK4_EXPECTED_ACTIVE_COUNT:
|
| 424 |
+
score += 0.25
|
| 425 |
+
except Exception:
|
| 426 |
+
pass
|
| 427 |
+
|
| 428 |
+
# Restored products: is_deleted=1, deleted_at IS NOT NULL (+0.20)
|
| 429 |
+
if "is_deleted" in cols:
|
| 430 |
+
try:
|
| 431 |
+
cursor = conn.execute(
|
| 432 |
+
"SELECT COUNT(*) FROM products WHERE is_deleted = 1 AND deleted_at IS NOT NULL"
|
| 433 |
+
)
|
| 434 |
+
restored = cursor.fetchone()[0]
|
| 435 |
+
if restored == TASK4_EXPECTED_DELETED_COUNT:
|
| 436 |
+
score += 0.20
|
| 437 |
+
except Exception:
|
| 438 |
+
pass
|
| 439 |
+
|
| 440 |
+
# SUM(id) fingerprint = 36 — no phantom rows (+0.10)
|
| 441 |
+
try:
|
| 442 |
+
cursor = conn.execute("SELECT SUM(id) FROM products")
|
| 443 |
+
id_sum = cursor.fetchone()[0]
|
| 444 |
+
if id_sum == TASK4_EXPECTED_ID_SUM:
|
| 445 |
+
score += 0.10
|
| 446 |
+
except Exception:
|
| 447 |
+
pass
|
| 448 |
+
|
| 449 |
+
# Exploit check
|
| 450 |
+
if row_count == 0:
|
| 451 |
+
score = min(score, 0.1)
|
| 452 |
+
|
| 453 |
+
return max(0.01, min(0.99, score))
|
| 454 |
+
|
| 455 |
+
# =========================================================================
|
| 456 |
+
# Task 5: Schema Version Merge (Medium)
|
| 457 |
+
# =========================================================================
|
| 458 |
+
|
| 459 |
+
def _score_task5(self, conn: sqlite3.Connection) -> float:
|
| 460 |
+
# Re-assert FK enforcement
|
| 461 |
+
try:
|
| 462 |
+
conn.execute("PRAGMA foreign_keys = ON")
|
| 463 |
+
except Exception:
|
| 464 |
+
pass
|
| 465 |
+
score = 0.0
|
| 466 |
+
tables = _get_table_names(conn)
|
| 467 |
+
|
| 468 |
+
if "products" not in tables:
|
| 469 |
+
return 0.01
|
| 470 |
+
|
| 471 |
+
cols = _get_column_names(conn, "products")
|
| 472 |
+
|
| 473 |
+
# Schema completeness: all 8 columns (+0.10)
|
| 474 |
+
expected_cols = {"id", "name", "price", "category", "supplier", "brand", "sku", "source"}
|
| 475 |
+
if expected_cols.issubset(cols):
|
| 476 |
+
score += 0.10
|
| 477 |
+
|
| 478 |
+
# Row count = 9 (+0.15)
|
| 479 |
+
row_count = _get_row_count(conn, "products")
|
| 480 |
+
if row_count == TASK5_EXPECTED_ROW_COUNT:
|
| 481 |
+
score += 0.15
|
| 482 |
+
|
| 483 |
+
# PRICE_SUM fingerprint (+0.20)
|
| 484 |
+
try:
|
| 485 |
+
cursor = conn.execute("SELECT ROUND(SUM(price), 2) FROM products")
|
| 486 |
+
price_sum = cursor.fetchone()[0]
|
| 487 |
+
if price_sum is not None and abs(price_sum - TASK5_EXPECTED_PRICE_SUM) < 0.02:
|
| 488 |
+
score += 0.20
|
| 489 |
+
except Exception:
|
| 490 |
+
pass
|
| 491 |
+
|
| 492 |
+
# source='both' for conflicted ids 1,2 (+0.15)
|
| 493 |
+
if "source" in cols:
|
| 494 |
+
try:
|
| 495 |
+
cursor = conn.execute(
|
| 496 |
+
"SELECT COUNT(*) FROM products WHERE source = 'both'"
|
| 497 |
+
)
|
| 498 |
+
both_count = cursor.fetchone()[0]
|
| 499 |
+
if both_count == TASK5_EXPECTED_BOTH_COUNT:
|
| 500 |
+
score += 0.15
|
| 501 |
+
except Exception:
|
| 502 |
+
pass
|
| 503 |
+
|
| 504 |
+
# v2 name wins for conflicted rows (+0.15)
|
| 505 |
+
try:
|
| 506 |
+
cursor = conn.execute("SELECT name FROM products WHERE id = 2")
|
| 507 |
+
row = cursor.fetchone()
|
| 508 |
+
if row and "Updated" in row[0]:
|
| 509 |
+
score += 0.15
|
| 510 |
+
except Exception:
|
| 511 |
+
pass
|
| 512 |
+
|
| 513 |
+
# No NULL prices (+0.10)
|
| 514 |
+
try:
|
| 515 |
+
cursor = conn.execute("SELECT COUNT(*) FROM products WHERE price IS NULL")
|
| 516 |
+
null_count = cursor.fetchone()[0]
|
| 517 |
+
if null_count == 0:
|
| 518 |
+
score += 0.10
|
| 519 |
+
except Exception:
|
| 520 |
+
pass
|
| 521 |
+
|
| 522 |
+
# PRAGMA integrity_check (+0.15)
|
| 523 |
+
try:
|
| 524 |
+
cursor = conn.execute("PRAGMA integrity_check")
|
| 525 |
+
result = cursor.fetchone()[0]
|
| 526 |
+
if result == "ok":
|
| 527 |
+
score += 0.15
|
| 528 |
+
except Exception:
|
| 529 |
+
pass
|
| 530 |
+
|
| 531 |
+
# Exploit check
|
| 532 |
+
if row_count == 0:
|
| 533 |
+
score = min(score, 0.1)
|
| 534 |
+
|
| 535 |
+
return max(0.01, min(0.99, score))
|
| 536 |
+
|
| 537 |
+
# =========================================================================
|
| 538 |
+
# Task 6: Multi-Entity Extraction (Medium — Hard End)
|
| 539 |
+
# =========================================================================
|
| 540 |
+
|
| 541 |
+
def _score_task6(self, conn: sqlite3.Connection) -> float:
|
| 542 |
+
# Re-assert FK enforcement
|
| 543 |
+
try:
|
| 544 |
+
conn.execute("PRAGMA foreign_keys = ON")
|
| 545 |
+
except Exception:
|
| 546 |
+
pass
|
| 547 |
+
score = 0.0
|
| 548 |
+
tables = _get_table_names(conn)
|
| 549 |
+
|
| 550 |
+
# All 5 tables exist (+0.10)
|
| 551 |
+
required = {"salespersons", "customers", "products", "sales", "data_issues"}
|
| 552 |
+
if required.issubset(tables):
|
| 553 |
+
score += 0.10
|
| 554 |
+
|
| 555 |
+
# salesperson count = 3 (+0.10)
|
| 556 |
+
if "salespersons" in tables:
|
| 557 |
+
count = _get_row_count(conn, "salespersons")
|
| 558 |
+
if count == TASK6_EXPECTED_SALESPERSON_COUNT:
|
| 559 |
+
score += 0.10
|
| 560 |
+
|
| 561 |
+
# customer count = 3 (invalid excluded) (+0.12)
|
| 562 |
+
if "customers" in tables:
|
| 563 |
+
count = _get_row_count(conn, "customers")
|
| 564 |
+
if count == TASK6_EXPECTED_CUSTOMER_COUNT:
|
| 565 |
+
score += 0.12
|
| 566 |
+
|
| 567 |
+
# product count = 5 (+0.10)
|
| 568 |
+
if "products" in tables:
|
| 569 |
+
count = _get_row_count(conn, "products")
|
| 570 |
+
if count == TASK6_EXPECTED_PRODUCT_COUNT:
|
| 571 |
+
score += 0.10
|
| 572 |
+
|
| 573 |
+
# sales count = 11 (bad row excluded) (+0.12)
|
| 574 |
+
if "sales" in tables:
|
| 575 |
+
count = _get_row_count(conn, "sales")
|
| 576 |
+
if count == TASK6_EXPECTED_SALES_COUNT:
|
| 577 |
+
score += 0.12
|
| 578 |
+
|
| 579 |
+
# All 3 FKs present in sales (+0.15)
|
| 580 |
+
if "sales" in tables:
|
| 581 |
+
fk_count = 0
|
| 582 |
+
if _has_foreign_key(conn, "sales", "salespersons"): fk_count += 1
|
| 583 |
+
if _has_foreign_key(conn, "sales", "customers"): fk_count += 1
|
| 584 |
+
if _has_foreign_key(conn, "sales", "products"): fk_count += 1
|
| 585 |
+
score += 0.05 * fk_count # 0.15 total for all 3
|
| 586 |
+
|
| 587 |
+
# data_issues count = 1, for row 6 (+0.11)
|
| 588 |
+
if "data_issues" in tables:
|
| 589 |
+
count = _get_row_count(conn, "data_issues")
|
| 590 |
+
if count == TASK6_EXPECTED_DATA_ISSUES_COUNT:
|
| 591 |
+
score += 0.11
|
| 592 |
+
|
| 593 |
+
# alice email is trimmed (+0.10)
|
| 594 |
+
if "salespersons" in tables:
|
| 595 |
+
try:
|
| 596 |
+
cursor = conn.execute(
|
| 597 |
+
"SELECT email FROM salespersons WHERE name LIKE '%Alice%'"
|
| 598 |
+
)
|
| 599 |
+
row = cursor.fetchone()
|
| 600 |
+
if row and row[0] == "alice@company.com":
|
| 601 |
+
score += 0.10
|
| 602 |
+
except Exception:
|
| 603 |
+
pass
|
| 604 |
+
|
| 605 |
+
# PRAGMA integrity_check (+0.10)
|
| 606 |
+
try:
|
| 607 |
+
cursor = conn.execute("PRAGMA integrity_check")
|
| 608 |
+
result = cursor.fetchone()[0]
|
| 609 |
+
if result == "ok":
|
| 610 |
+
score += 0.10
|
| 611 |
+
except Exception:
|
| 612 |
+
pass
|
| 613 |
+
|
| 614 |
+
# Exploit check
|
| 615 |
+
sales_count = _get_row_count(conn, "sales") if "sales" in tables else 0
|
| 616 |
+
if sales_count == 0 and "sales" in tables:
|
| 617 |
+
score = min(score, 0.1)
|
| 618 |
+
|
| 619 |
+
return max(0.01, min(0.99, score))
|
| 620 |
+
|
| 621 |
+
# =========================================================================
|
| 622 |
+
# Task 7: Dual-Source Consolidation (Hard)
|
| 623 |
+
# =========================================================================
|
| 624 |
+
|
| 625 |
+
def _score_task7(self, conn: sqlite3.Connection) -> float:
|
| 626 |
+
# Re-assert FK enforcement
|
| 627 |
+
try:
|
| 628 |
+
conn.execute("PRAGMA foreign_keys = ON")
|
| 629 |
+
except Exception:
|
| 630 |
+
pass
|
| 631 |
+
score = 0.0
|
| 632 |
+
tables = _get_table_names(conn)
|
| 633 |
+
|
| 634 |
+
# All 4 tables exist (+0.05)
|
| 635 |
+
required = {"unified_customers", "unified_products", "unified_orders", "migration_issues"}
|
| 636 |
+
if required.issubset(tables):
|
| 637 |
+
score += 0.05
|
| 638 |
+
|
| 639 |
+
# unified_customers count = 7 (+0.08)
|
| 640 |
+
if "unified_customers" in tables:
|
| 641 |
+
count = _get_row_count(conn, "unified_customers")
|
| 642 |
+
if count == TASK7_EXPECTED_UNIFIED_CUSTOMERS:
|
| 643 |
+
score += 0.08
|
| 644 |
+
|
| 645 |
+
# source='both' for email-matched records (+0.08)
|
| 646 |
+
if "unified_customers" in tables:
|
| 647 |
+
try:
|
| 648 |
+
cursor = conn.execute(
|
| 649 |
+
"SELECT COUNT(*) FROM unified_customers WHERE source = 'both'"
|
| 650 |
+
)
|
| 651 |
+
both = cursor.fetchone()[0]
|
| 652 |
+
if both == TASK7_EXPECTED_BOTH_SOURCE_COUNT:
|
| 653 |
+
score += 0.08
|
| 654 |
+
except Exception:
|
| 655 |
+
pass
|
| 656 |
+
|
| 657 |
+
# Legacy amount coercion — check unified_orders has REAL amounts (+0.10)
|
| 658 |
+
if "unified_orders" in tables:
|
| 659 |
+
try:
|
| 660 |
+
cursor = conn.execute(
|
| 661 |
+
"SELECT COUNT(*) FROM unified_orders WHERE typeof(amount) = 'real' OR typeof(amount) = 'integer'"
|
| 662 |
+
)
|
| 663 |
+
real_count = cursor.fetchone()[0]
|
| 664 |
+
order_count = _get_row_count(conn, "unified_orders")
|
| 665 |
+
if real_count == order_count and order_count > 0:
|
| 666 |
+
score += 0.10
|
| 667 |
+
except Exception:
|
| 668 |
+
pass
|
| 669 |
+
|
| 670 |
+
# NULL currency → 'USD' fill (+0.07)
|
| 671 |
+
if "unified_orders" in tables:
|
| 672 |
+
try:
|
| 673 |
+
cursor = conn.execute(
|
| 674 |
+
"SELECT COUNT(*) FROM unified_orders WHERE currency IS NULL"
|
| 675 |
+
)
|
| 676 |
+
null_curr = cursor.fetchone()[0]
|
| 677 |
+
if null_curr == 0:
|
| 678 |
+
score += 0.07
|
| 679 |
+
except Exception:
|
| 680 |
+
pass
|
| 681 |
+
|
| 682 |
+
# tx_status mapped to strings (+0.10)
|
| 683 |
+
if "unified_orders" in tables:
|
| 684 |
+
try:
|
| 685 |
+
cursor = conn.execute(
|
| 686 |
+
"SELECT COUNT(*) FROM unified_orders WHERE typeof(status) = 'text'"
|
| 687 |
+
)
|
| 688 |
+
text_count = cursor.fetchone()[0]
|
| 689 |
+
order_count = _get_row_count(conn, "unified_orders")
|
| 690 |
+
if text_count == order_count and order_count > 0:
|
| 691 |
+
score += 0.10
|
| 692 |
+
except Exception:
|
| 693 |
+
pass
|
| 694 |
+
|
| 695 |
+
# subscription_tier mapped to strings (+0.08)
|
| 696 |
+
if "unified_customers" in tables:
|
| 697 |
+
try:
|
| 698 |
+
cursor = conn.execute(
|
| 699 |
+
"SELECT COUNT(*) FROM unified_customers WHERE typeof(tier) = 'text'"
|
| 700 |
+
)
|
| 701 |
+
text_count = cursor.fetchone()[0]
|
| 702 |
+
cust_count = _get_row_count(conn, "unified_customers")
|
| 703 |
+
if text_count == cust_count and cust_count > 0:
|
| 704 |
+
score += 0.08
|
| 705 |
+
except Exception:
|
| 706 |
+
pass
|
| 707 |
+
|
| 708 |
+
# migration_issues count = 2 (+0.08)
|
| 709 |
+
if "migration_issues" in tables:
|
| 710 |
+
count = _get_row_count(conn, "migration_issues")
|
| 711 |
+
if count == TASK7_EXPECTED_MIGRATION_ISSUES:
|
| 712 |
+
score += 0.08
|
| 713 |
+
|
| 714 |
+
# Orphaned transaction in issues (+0.07)
|
| 715 |
+
if "migration_issues" in tables:
|
| 716 |
+
try:
|
| 717 |
+
cursor = conn.execute(
|
| 718 |
+
"SELECT COUNT(*) FROM migration_issues WHERE issue_type = 'orphaned_record'"
|
| 719 |
+
)
|
| 720 |
+
orphan_issues = cursor.fetchone()[0]
|
| 721 |
+
if orphan_issues >= 1:
|
| 722 |
+
score += 0.07
|
| 723 |
+
except Exception:
|
| 724 |
+
pass
|
| 725 |
+
|
| 726 |
+
# NULL email customer in issues (+0.07)
|
| 727 |
+
if "migration_issues" in tables:
|
| 728 |
+
try:
|
| 729 |
+
cursor = conn.execute(
|
| 730 |
+
"SELECT COUNT(*) FROM migration_issues WHERE issue_type = 'null_email'"
|
| 731 |
+
)
|
| 732 |
+
null_issues = cursor.fetchone()[0]
|
| 733 |
+
if null_issues >= 1:
|
| 734 |
+
score += 0.07
|
| 735 |
+
except Exception:
|
| 736 |
+
pass
|
| 737 |
+
|
| 738 |
+
# FK integrity on unified_orders (+0.10)
|
| 739 |
+
if "unified_orders" in tables:
|
| 740 |
+
if _has_foreign_key(conn, "unified_orders", "unified_customers"):
|
| 741 |
+
score += 0.10
|
| 742 |
+
|
| 743 |
+
# PRAGMA integrity_check (+0.10)
|
| 744 |
+
try:
|
| 745 |
+
cursor = conn.execute("PRAGMA integrity_check")
|
| 746 |
+
result = cursor.fetchone()[0]
|
| 747 |
+
if result == "ok":
|
| 748 |
+
score += 0.10
|
| 749 |
+
except Exception:
|
| 750 |
+
pass
|
| 751 |
+
|
| 752 |
+
# Exploit check
|
| 753 |
+
if "unified_orders" in tables and _get_row_count(conn, "unified_orders") == 0:
|
| 754 |
+
score = min(score, 0.1)
|
| 755 |
+
|
| 756 |
+
return max(0.01, min(0.99, score))
|
test_all_tasks.py
ADDED
|
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Quick validation of all 7 tasks: seeds + graders."""
|
| 2 |
+
import sqlite3
|
| 3 |
+
import sys
|
| 4 |
+
import os
|
| 5 |
+
|
| 6 |
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 7 |
+
|
| 8 |
+
from seeds import TASKS
|
| 9 |
+
from server.grader import StateReconciler
|
| 10 |
+
|
| 11 |
+
print(f"Tasks registered: {len(TASKS)}")
|
| 12 |
+
assert len(TASKS) == 7, f"Expected 7 tasks, got {len(TASKS)}"
|
| 13 |
+
print(f" Names: {list(TASKS.keys())}")
|
| 14 |
+
|
| 15 |
+
for name, cfg in TASKS.items():
|
| 16 |
+
# Seed
|
| 17 |
+
conn = sqlite3.connect(":memory:")
|
| 18 |
+
conn.execute("PRAGMA foreign_keys = ON")
|
| 19 |
+
cfg["seed_fn"](conn)
|
| 20 |
+
|
| 21 |
+
cursor = conn.execute(
|
| 22 |
+
"SELECT name FROM sqlite_master WHERE type='table' AND name NOT LIKE 'sqlite_%'"
|
| 23 |
+
)
|
| 24 |
+
tables = [r[0] for r in cursor.fetchall()]
|
| 25 |
+
print(f"\n[{name}] ({cfg['difficulty']}, max_steps={cfg.get('max_steps', 20)})")
|
| 26 |
+
print(f" Tables: {tables}")
|
| 27 |
+
|
| 28 |
+
# Grade
|
| 29 |
+
reconciler = StateReconciler(name)
|
| 30 |
+
score = reconciler.score(conn)
|
| 31 |
+
assert 0.01 <= score <= 0.99, f"Score {score} out of [0.01, 0.99]!"
|
| 32 |
+
print(f" Initial score: {score:.2f} OK")
|
| 33 |
+
|
| 34 |
+
conn.close()
|
| 35 |
+
|
| 36 |
+
# Also test environment resets for each task
|
| 37 |
+
from server.environment import DbMigrationEnvironment
|
| 38 |
+
|
| 39 |
+
for name in TASKS:
|
| 40 |
+
env = DbMigrationEnvironment(task_name=name)
|
| 41 |
+
obs = env.reset()
|
| 42 |
+
assert obs.done == False
|
| 43 |
+
assert obs.step_number == 0
|
| 44 |
+
print(f" [{name}] Environment reset OK")
|
| 45 |
+
env.close()
|
| 46 |
+
|
| 47 |
+
print("\n" + "=" * 50)
|
| 48 |
+
print("ALL 7 TASKS VALIDATED SUCCESSFULLY!")
|
| 49 |
+
print("=" * 50)
|