db-schema-migration / README.md
hissterical's picture
Upload 9 files
a5c89a3 verified

DB Schema Migration β€” OpenEnv Environment

OpenEnv

Motivation

Database schema migrations are one of the most error-prone, high-stakes tasks in software engineering. A wrong migration can corrupt production data, break foreign key constraints, or leave systems in inconsistent states. Yet there is no RL benchmark that trains agents to perform this task.

This environment simulates a real engineering workflow: an agent receives a legacy schema riddled with bad names, wrong types, and no normalization β€” and must transform it into a clean, production-grade design through a sequence of atomic migration operations. The environment is deterministic, programmatically gradeable, and scales from trivial renaming tasks to full third-normal-form (3NF) normalization.

Who benefits: Teams building AI coding assistants, database tools, or automated refactoring systems can use this benchmark to evaluate and train models on structured reasoning over relational data.


Action Space

Actions are Pydantic models sent as JSON to POST /step.

Field Type Description
operation enum One of: rename_table, rename_column, add_column, drop_column, change_type, add_foreign_key, normalize_table, done
table str Current name of the target table
column str? Current column name (for column ops)
new_name str? New name (for rename/normalize ops)
data_type str? SQL type string (e.g. DECIMAL(10,2), TIMESTAMP)
reference_table str? For FK: referenced table name
reference_column str? For FK: referenced column name
reason str? Free-text explanation (ignored by grader, useful for chain-of-thought)

Example:

{
  "operation": "rename_column",
  "table": "tbl_usr",
  "column": "usr_nm",
  "new_name": "username",
  "reason": "usr_nm is an abbreviation; username is the clean convention"
}

Observation Space

Observations are returned from POST /reset and POST /step.

Field Type Description
current_schema List[TableInfo] All tables with columns, types, PKs, FKs
target_requirements List[str] Natural language list of what must be achieved
steps_taken List[dict] Last 10 actions with rewards
violations List[str] Current FK or integrity violations
hints List[str] Optional hints for the current task
step_count int Steps used so far
max_steps int Steps allowed before forced termination

Tasks

Easy β€” Column Renaming

Difficulty: Easy | Max steps: 10

A single tbl_usr table with aggressively abbreviated column names (usr_id, usr_nm, usr_eml, dt_crt, stat_cd). The agent must rename the table to users and all 5 columns to human-readable names following snake_case convention.

All required operations are rename_table or rename_column. An optimal agent completes this in exactly 6 steps.

Grader: Exact match on table name and all column names. Efficiency penalty (-0.02/step) for steps beyond the minimum 6.


Medium β€” Multi-table Fix

Difficulty: Medium | Max steps: 20

Two tables (order_tbl, prod_tbl) with abbreviated names, wrong data types (money stored as VARCHAR, dates as VARCHAR), and missing columns (stock_quantity, status). Agent must rename both tables and all columns, fix types to DECIMAL(10,2) and TIMESTAMP, and add the missing columns.

Grader: Partial credit per table. Full score requires all 14 requirements satisfied. Efficiency penalty for steps > 14.


Hard β€” Full 3NF Normalization

Difficulty: Hard | Max steps: 30

A single everything table with 9 columns mixing customer, product, and order data (classic 1NF violation). Agent must:

  1. Extract 4 normalized tables: customers, products, orders, order_items
  2. Set correct column types throughout
  3. Establish 3 foreign key relationships
  4. Remove the original everything table

Grader: Weighted β€” 25% tables present, 50% correct columns/types, 25% correct foreign keys. Penalty if everything still exists at done.


Reward Function

Event Reward
Correct rename/type/add operation +0.10 to +0.15
Correct normalize_table extraction +0.20
Correct foreign key +0.15
Wrong/unexpected operation -0.05
Referencing non-existent table -0.10 to -0.15
Dropping a primary key -0.20
Exceeding max steps -0.20
done action Final grader score (0.0 – 1.0)

Rewards are dense: the agent receives a signal after every step, not just at the end.


Setup & Running

1. Build the Docker image

docker build -t db-schema-migration .
docker run -p 7860:7860 db-schema-migration

Verify it's live:

curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{"task":"easy"}'

2. Environment variables for inference.py

export ENV_URL=http://localhost:7860          # or your HF Space URL
export API_BASE_URL=https://api-inference.huggingface.co/v1
export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
export HF_TOKEN=hf_your_token_here
export TASK=easy                              # easy | medium | hard

3. Run the inference script

python inference.py easy
python inference.py medium
python inference.py hard

4. Run the OpenEnv validation

openenv validate

Baseline Scores

Scores from meta-llama/Llama-3.1-8B-Instruct via HF inference:

Task Score Steps Notes
Easy 0.92 7 Near-optimal, one extra step
Medium 0.74 18 Misses one type change
Hard 0.48 28 Tables correct, FKs partially missed

API Reference

Endpoint Method Body Returns
/reset POST {"task": "easy"} ResetResult
/step POST Action JSON StepResult
/state GET β€” StateResult
/tasks GET β€” Task descriptions
/health GET β€” {"status": "ok"}

Project Structure

.
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ main.py          # FastAPI app
β”‚   β”œβ”€β”€ env.py           # Environment logic + graders
β”‚   └── schemas.py       # Pydantic models
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ inference.py
└── README.md