File size: 6,540 Bytes
a5c89a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d9b728
a5c89a3
 
 
 
 
 
 
 
 
 
 
5d9b728
 
a5c89a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
# DB Schema Migration β€” OpenEnv Environment

[![OpenEnv](https://img.shields.io/badge/openenv-tagged-blue)](https://huggingface.co/spaces?tag=openenv)

## Motivation

Database schema migrations are one of the most error-prone, high-stakes tasks in software engineering. A wrong migration can corrupt production data, break foreign key constraints, or leave systems in inconsistent states. Yet there is no RL benchmark that trains agents to perform this task.

This environment simulates a real engineering workflow: an agent receives a legacy schema riddled with bad names, wrong types, and no normalization β€” and must transform it into a clean, production-grade design through a sequence of atomic migration operations. The environment is deterministic, programmatically gradeable, and scales from trivial renaming tasks to full third-normal-form (3NF) normalization.

**Who benefits:** Teams building AI coding assistants, database tools, or automated refactoring systems can use this benchmark to evaluate and train models on structured reasoning over relational data.

---

## Action Space

Actions are Pydantic models sent as JSON to `POST /step`.

| Field | Type | Description |
|---|---|---|
| `operation` | enum | One of: `rename_table`, `rename_column`, `add_column`, `drop_column`, `change_type`, `add_foreign_key`, `normalize_table`, `done` |
| `table` | str | Current name of the target table |
| `column` | str? | Current column name (for column ops) |
| `new_name` | str? | New name (for rename/normalize ops) |
| `data_type` | str? | SQL type string (e.g. `DECIMAL(10,2)`, `TIMESTAMP`) |
| `reference_table` | str? | For FK: referenced table name |
| `reference_column` | str? | For FK: referenced column name |
| `reason` | str? | Free-text explanation (ignored by grader, useful for chain-of-thought) |

**Example:**
```json
{
  "operation": "rename_column",
  "table": "tbl_usr",
  "column": "usr_nm",
  "new_name": "username",
  "reason": "usr_nm is an abbreviation; username is the clean convention"
}
```

---

## Observation Space

Observations are returned from `POST /reset` and `POST /step`.

| Field | Type | Description |
|---|---|---|
| `current_schema` | List[TableInfo] | All tables with columns, types, PKs, FKs |
| `target_requirements` | List[str] | Natural language list of what must be achieved |
| `steps_taken` | List[dict] | Last 10 actions with rewards |
| `violations` | List[str] | Current FK or integrity violations |
| `hints` | List[str] | Optional hints for the current task |
| `step_count` | int | Steps used so far |
| `max_steps` | int | Steps allowed before forced termination |

---

## Tasks

### Easy β€” Column Renaming
**Difficulty:** Easy | **Max steps:** 10

A single `tbl_usr` table with aggressively abbreviated column names (`usr_id`, `usr_nm`, `usr_eml`, `dt_crt`, `stat_cd`). The agent must rename the table to `users` and all 5 columns to human-readable names following snake_case convention.

All required operations are `rename_table` or `rename_column`. An optimal agent completes this in exactly 6 steps.

**Grader:** Exact match on table name and all column names. Efficiency penalty (-0.02/step) for steps beyond the minimum 6.

---

### Medium β€” Multi-table Fix
**Difficulty:** Medium | **Max steps:** 20

Two tables (`order_tbl`, `prod_tbl`) with abbreviated names, wrong data types (money stored as `VARCHAR`, dates as `VARCHAR`), and missing columns (`stock_quantity`, `status`). Agent must rename both tables and all columns, fix types to `DECIMAL(10,2)` and `TIMESTAMP`, and add the missing columns.

**Grader:** Partial credit per table. Full score requires all 14 requirements satisfied. Efficiency penalty for steps > 14.

---

### Hard β€” Full 3NF Normalization
**Difficulty:** Hard | **Max steps:** 30

A single `everything` table with 9 columns mixing customer, product, and order data (classic 1NF violation). Agent must:
1. Extract 4 normalized tables: `customers`, `products`, `orders`, `order_items`
2. Set correct column types throughout
3. Establish 3 foreign key relationships
4. Remove the original `everything` table

**Grader:** Weighted β€” 25% tables present, 50% correct columns/types, 25% correct foreign keys. Penalty if `everything` still exists at done.

---

## Reward Function

| Event | Reward |
|---|---|
| Correct rename/type/add operation | +0.10 to +0.15 |
| Correct normalize_table extraction | +0.20 |
| Correct foreign key | +0.15 |
| Wrong/unexpected operation | -0.05 |
| Referencing non-existent table | -0.10 to -0.15 |
| Dropping a primary key | -0.20 |
| Exceeding max steps | -0.20 |
| `done` action | Final grader score (0.0 – 1.0) |

Rewards are dense: the agent receives a signal after every step, not just at the end.

---

## Setup & Running

### 1. Build the Docker image

```bash
docker build -t db-schema-migration .
docker run -p 7860:7860 db-schema-migration
```

Verify it's live:
```bash
curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{"task":"easy"}'
```

### 2. Environment variables for inference.py

```bash
export ENV_URL=http://localhost:7860          # or your HF Space URL
export API_BASE_URL=https://api-inference.huggingface.co/v1
export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
export HF_TOKEN=hf_your_token_here
export TASK=easy                              # easy | medium | hard
```

### 3. Run the inference script

```bash
python inference.py easy
python inference.py medium
python inference.py hard
```

### 4. Run the OpenEnv validation

```bash
openenv validate
```

---

## Baseline Scores

Scores from `meta-llama/Llama-3.1-8B-Instruct` via HF inference:

| Task | Score | Steps | Notes |
|---|---|---|---|
| Easy | 0.92 | 7 | Near-optimal, one extra step |
| Medium | 0.74 | 18 | Misses one type change |
| Hard | 0.48 | 28 | Tables correct, FKs partially missed |

---

## API Reference

| Endpoint | Method | Body | Returns |
|---|---|---|---|
| `/reset` | POST | `{"task": "easy"}` | `ResetResult` |
| `/step` | POST | `Action` JSON | `StepResult` |
| `/state` | GET | β€” | `StateResult` |
| `/tasks` | GET | β€” | Task descriptions |
| `/health` | GET | β€” | `{"status": "ok"}` |

---

## Project Structure

```
.
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ main.py          # FastAPI app
β”‚   β”œβ”€β”€ env.py           # Environment logic + graders
β”‚   └── schemas.py       # Pydantic models
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ inference.py
└── README.md
```