shreyas231219 commited on
Commit
1f008d6
·
verified ·
1 Parent(s): 7cbfaf0

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +104 -107
  2. inference.py +200 -400
README.md CHANGED
@@ -1,107 +1,104 @@
1
- ---
2
- title: Meta-Pytorch-Openenv
3
- emoji: 🦀
4
- colorFrom: blue
5
- colorTo: green
6
- sdk: docker
7
- app_port: 7860
8
- base_path: /web
9
- ---
10
- # SQL / Data Cleaning Sandbox
11
-
12
- An **OpenEnv**-compliant environment where AI agents clean messy SQLite databases
13
- using SQL queries and Python code.
14
-
15
- ## Overview
16
-
17
- | Feature | Details |
18
- |---|---|
19
- | **Interface** | `step()` / `reset()` / `state()` |
20
- | **Action space** | `{ tool: "sql" \| "python", command: "..." }` |
21
- | **Observation** | `{ output, error, current_step, max_steps, task_description }` |
22
- | **Reward** | 0.0 - 1.0 with **partial progress signals** |
23
- | **Tasks** | 3 (easy, medium, hard) |
24
-
25
- ## Tasks
26
-
27
- ### Easy - Data Triage
28
- > Find the total revenue from the `sales` table for January 2024.
29
-
30
- **Grader**: Checks if the computed total matches the expected float value (1000.00).
31
-
32
- ### Medium - Data Cleaning
33
- > Fix duplicate emails, NULL ages, and uppercase emails in the `users` table.
34
-
35
- **Grader**: Partial scoring:
36
- - 0.3 for all emails lowercase
37
- - 0.4 for no duplicate emails
38
- - 0.3 for no NULL ages
39
-
40
- ### Hard - Schema Migration
41
- > Normalize `flat_orders` into `customers` + `orders` tables with foreign keys.
42
-
43
- **Grader**: Partial scoring:
44
- - 0.2 for correct `customers` schema
45
- - 0.2 for correct `orders` schema
46
- - 0.2 for 4 unique customers
47
- - 0.2 for 6 orders migrated
48
- - 0.2 for valid FK integrity
49
-
50
- ## Quick Start
51
-
52
-
53
-
54
- ### Docker (Hugging Face Spaces Ready)
55
-
56
- ```bash
57
- # Build
58
- docker build -t sql-sandbox:latest .
59
-
60
- # Run on HF Spaces default port 7860
61
- docker run -p 7860:7860 sql-sandbox:latest
62
- ```
63
-
64
- ## Baseline Inference
65
-
66
- Runs GPT-4o on all three tasks and prints reproducible scores:
67
-
68
- ```bash
69
- export OPENAI_API_KEY=sk-...
70
- export MODEL_NAME=gpt-4o
71
- python inference.py --url http://localhost:7860
72
- ```
73
-
74
- ### Local Development
75
-
76
- ```bash
77
- # Install dependencies
78
- pip install openenv-core
79
-
80
- # Run the server (defaults to the 'easy' task)
81
- cd sql_sandbox
82
- TASK_ID=easy python -m server.app
83
-
84
- # Switch tasks via env var
85
- TASK_ID=medium python -m server.app
86
- TASK_ID=hard python -m server.app
87
- ```
88
-
89
- ## Project Structure
90
-
91
- ```
92
- sql_sandbox/
93
- ├── init.py # Package exports
94
- ├── models.py # Action & Observation Pydantic models
95
- ├── client.py # EnvClient subclass
96
- ├── openenv.yaml # OpenEnv manifest
97
- ├── pyproject.toml # Dependencies
98
- ── inference.py # GPT-4o baseline script
99
- ├── inference_groq.py # use Groq for baseline inference
100
- ├── README.md # This file
101
- ── server/
102
- ├── init.py
103
- ── app.py # FastAPI application
104
- ├── environment.py # Core environment logic + graders
105
- ├── requirements.txt
106
- └── Dockerfile
107
- ```
 
1
+ ---
2
+ title: Meta-Pytorch-Openenv
3
+ emoji: 🦀
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: docker
7
+ app_port: 7860
8
+ base_path: /web
9
+ ---
10
+ # SQL / Data Cleaning Sandbox
11
+
12
+ An **OpenEnv**-compliant environment where AI agents clean messy SQLite databases
13
+ using SQL queries and Python code.
14
+
15
+ ## Overview
16
+
17
+ | Feature | Details |
18
+ |---|---|
19
+ | **Interface** | `step()` / `reset()` / `state()` |
20
+ | **Action space** | `{ tool: "sql" \| "python", command: "..." }` |
21
+ | **Observation** | `{ output, error, current_step, max_steps, task_description }` |
22
+ | **Reward** | 0.0 - 1.0 with **partial progress signals** |
23
+ | **Tasks** | 3 (easy, medium, hard) |
24
+
25
+ ## Tasks
26
+
27
+ ### Easy - Data Triage
28
+ > Find the total revenue from the `sales` table for January 2024.
29
+
30
+ **Grader**: Checks if the computed total matches the expected float value (1000.00).
31
+
32
+ ### Medium - Data Cleaning
33
+ > Fix duplicate emails, NULL ages, and uppercase emails in the `users` table.
34
+
35
+ **Grader**: Partial scoring:
36
+ - 0.3 for all emails lowercase
37
+ - 0.4 for no duplicate emails
38
+ - 0.3 for no NULL ages
39
+
40
+ ### Hard - Schema Migration
41
+ > Normalize `flat_orders` into `customers` + `orders` tables with foreign keys.
42
+
43
+ **Grader**: Partial scoring:
44
+ - 0.2 for correct `customers` schema
45
+ - 0.2 for correct `orders` schema
46
+ - 0.2 for 4 unique customers
47
+ - 0.2 for 6 orders migrated
48
+ - 0.2 for valid FK integrity
49
+
50
+ ## Quick Start
51
+
52
+ ### Local Development
53
+
54
+ ```bash
55
+ # Install dependencies
56
+ pip install openenv-core
57
+
58
+ # Run the server (defaults to the 'easy' task)
59
+ cd sql_sandbox
60
+ TASK_ID=easy python -m server.app
61
+
62
+ # Switch tasks via env var
63
+ TASK_ID=medium python -m server.app
64
+ TASK_ID=hard python -m server.app
65
+ ```
66
+
67
+ ### Docker (Hugging Face Spaces Ready)
68
+
69
+ ```bash
70
+ # Build
71
+ docker build -t sql-sandbox:latest .
72
+
73
+ # Run on HF Spaces default port 7860
74
+ docker run -p 7860:7860 sql-sandbox:latest
75
+ ```
76
+
77
+ ## Baseline Inference
78
+
79
+ Runs GPT-4o on all three tasks and prints reproducible scores:
80
+
81
+ ```bash
82
+ export HF_TOKEN=sk-...
83
+ export MODEL_NAME=gpt-4o
84
+ python inference.py --url http://localhost:7860
85
+ ```
86
+
87
+ ## Project Structure
88
+
89
+ ```
90
+ sql_sandbox/
91
+ ├── init.py # Package exports
92
+ ├── models.py # Action & Observation Pydantic models
93
+ ├── client.py # EnvClient subclass
94
+ ├── openenv.yaml # OpenEnv manifest
95
+ ├── pyproject.toml # Dependencies
96
+ ├── inference.py # GPT-4o baseline script
97
+ ├── README.md # This file
98
+ ── server/
99
+ ├── init.py
100
+ ├── app.py # FastAPI application
101
+ ── environment.py # Core environment logic + graders
102
+ ├── requirements.txt
103
+ ── Dockerfile
104
+ ```
 
 
 
inference.py CHANGED
@@ -1,400 +1,200 @@
1
- """
2
-
3
- Baseline Inference Script for SQL/Data Cleaning Sandbox OpenAI Edition.
4
-
5
-
6
-
7
- Uses OpenAI (gpt-4o) to solve all three tasks and prints reproducible
8
-
9
- scores via the OpenEnv WebSocket client.
10
-
11
-
12
-
13
- Usage:
14
-
15
- set OPENAI_API_KEY=sk-... # Windows
16
-
17
- export OPENAI_API_KEY=sk-... # Linux/macOS
18
-
19
- python inference.py # local server
20
-
21
- python inference.py --url https://... # remote server
22
-
23
- """
24
-
25
-
26
-
27
- import argparse
28
-
29
- import json
30
-
31
- import os
32
-
33
- import sys
34
-
35
-
36
-
37
- from dotenv import load_dotenv
38
-
39
- load_dotenv()
40
-
41
-
42
-
43
- from openai import OpenAI
44
-
45
-
46
-
47
- from client import SqlSandboxEnv
48
-
49
- from models import SqlSandboxAction
50
-
51
-
52
-
53
-
54
-
55
- # ---------------------------------------------------------------------------
56
-
57
- # System prompt shared across all tasks
58
-
59
- # ---------------------------------------------------------------------------
60
-
61
- SYSTEM_PROMPT = """\
62
-
63
- You are a data engineering assistant working inside a SQLite sandbox.
64
-
65
-
66
-
67
- You can execute two types of actions:
68
-
69
- 1. {"tool": "sql", "command": "<SQL query>"}
70
-
71
- 2. {"tool": "python", "command": "<Python code>"}
72
-
73
-
74
-
75
- Rules:
76
-
77
- - Respond with EXACTLY ONE JSON object per turn no markdown, no explanation.
78
-
79
- - In Python code, the variables `conn` (sqlite3.Connection) and `cursor`
80
-
81
- (sqlite3.Cursor) are already available. Do NOT call sqlite3.connect().
82
-
83
- - SQLite STRFTIME months are zero-padded: use '01' not '1', or use LIKE '2024-01-%'.
84
-
85
- - When you believe the task is fully complete, send:
86
-
87
- {"tool": "sql", "command": "SELECT 'DONE'"}
88
-
89
- """
90
-
91
-
92
-
93
-
94
-
95
- # ---------------------------------------------------------------------------
96
-
97
- # Core agent loop one task, one WebSocket session
98
-
99
- # ---------------------------------------------------------------------------
100
-
101
- def _run_task_agent(base_url: str, task_id: str, max_turns: int = 15) -> float:
102
-
103
- """
104
-
105
- Open a fresh WebSocket session, reset the environment to the given task,
106
-
107
- then run an LLM agent loop until done or max_turns is reached.
108
-
109
- Returns the final reward (0.0 1.0).
110
-
111
- """
112
-
113
- api_key = os.environ.get("OPENAI_API_KEY") or os.environ.get("HF_TOKEN")
114
-
115
- api_base_url = os.environ.get("API_BASE_URL")
116
-
117
- model_name = os.environ.get("MODEL_NAME", "gpt-4o")
118
-
119
-
120
-
121
- client_llm = OpenAI(
122
-
123
- api_key=api_key,
124
-
125
- base_url=api_base_url,
126
-
127
- )
128
-
129
- final_reward = 0.0
130
-
131
-
132
-
133
- # Each task gets its own WebSocket session to avoid state leakage
134
-
135
- with SqlSandboxEnv(base_url=base_url).sync() as env:
136
-
137
- # reset() with task_id seeds the correct DB table for this task
138
-
139
- reset_resp = env.reset(task_id=task_id)
140
-
141
- task_desc = reset_resp.observation.task_description
142
-
143
-
144
-
145
- messages = [
146
-
147
- {"role": "system", "content": SYSTEM_PROMPT},
148
-
149
- {"role": "user", "content": f"Task: {task_desc}\n\nBegin."},
150
-
151
- ]
152
-
153
-
154
-
155
- print(f"\n --- Session: {task_id} ---")
156
-
157
-
158
-
159
- for turn in range(max_turns):
160
-
161
- # 1. Ask the LLM
162
-
163
- response = client_llm.chat.completions.create(
164
-
165
- model=model_name,
166
-
167
- messages=messages,
168
-
169
- temperature=0.0,
170
-
171
- max_tokens=512,
172
-
173
- )
174
-
175
- assistant_msg = response.choices[0].message.content.strip()
176
-
177
-
178
-
179
- # 2. Parse action JSON (handle optional markdown fences)
180
-
181
- try:
182
-
183
- raw = assistant_msg
184
-
185
- if raw.startswith("```"):
186
-
187
- raw = raw.split("```")[1]
188
-
189
- if raw.startswith("json"):
190
-
191
- raw = raw[4:]
192
-
193
- action_data = json.loads(raw)
194
-
195
- tool = action_data["tool"]
196
-
197
- command = action_data["command"]
198
-
199
- except (json.JSONDecodeError, KeyError):
200
-
201
- # Feed parse error back to LLM, do NOT count as a step
202
-
203
- messages.append({"role": "assistant", "content": assistant_msg})
204
-
205
- messages.append({
206
-
207
- "role": "user",
208
-
209
- "content": (
210
-
211
- 'Invalid JSON. Reply with exactly one JSON object:\n'
212
-
213
- '{"tool": "sql" | "python", "command": "..."}'
214
-
215
- ),
216
-
217
- })
218
-
219
- continue
220
-
221
-
222
-
223
- # 3. Execute the action via OpenEnv step()
224
-
225
- step_resp = env.step(SqlSandboxAction(tool=tool, command=command))
226
-
227
-
228
-
229
- reward = step_resp.reward or 0.0
230
-
231
- done = step_resp.done
232
-
233
- output = step_resp.observation.output or ""
234
-
235
- error = step_resp.observation.error or ""
236
-
237
-
238
-
239
- final_reward = reward
240
-
241
- print(f" [Turn {turn+1:02d}] tool={tool:<6} | reward={reward:.4f} | done={done}")
242
-
243
-
244
-
245
- if done:
246
-
247
- break
248
-
249
-
250
-
251
- # 4. Feed result back to LLM for the next turn
252
-
253
- messages.append({"role": "assistant", "content": assistant_msg})
254
-
255
- feedback = f"Output:\n{output[:1500]}"
256
-
257
- if error:
258
-
259
- feedback += f"\nError:\n{error[:500]}"
260
-
261
- feedback += f"\nReward so far: {reward:.4f}"
262
-
263
- messages.append({"role": "user", "content": feedback})
264
-
265
-
266
-
267
- return final_reward
268
-
269
-
270
-
271
-
272
-
273
- # ---------------------------------------------------------------------------
274
-
275
- # Per-difficulty entry points (called by main, importable for custom use)
276
-
277
- # ---------------------------------------------------------------------------
278
-
279
- def easy_run(base_url: str, max_turns: int = 15) -> float:
280
-
281
- print(f"\n{'='*50}\nRunning task: easy\n{'='*50}")
282
-
283
- score = _run_task_agent(base_url, "easy", max_turns)
284
-
285
- print(f" Final score: {score:.4f}")
286
-
287
- return score
288
-
289
-
290
-
291
-
292
-
293
- def med_run(base_url: str, max_turns: int = 15) -> float:
294
-
295
- print(f"\n{'='*50}\nRunning task: medium\n{'='*50}")
296
-
297
- score = _run_task_agent(base_url, "medium", max_turns)
298
-
299
- print(f" Final score: {score:.4f}")
300
-
301
- return score
302
-
303
-
304
-
305
-
306
-
307
- def hard_run(base_url: str, max_turns: int = 15) -> float:
308
-
309
- print(f"\n{'='*50}\nRunning task: hard\n{'='*50}")
310
-
311
- score = _run_task_agent(base_url, "hard", max_turns)
312
-
313
- print(f" Final score: {score:.4f}")
314
-
315
- return score
316
-
317
-
318
-
319
-
320
-
321
- # ---------------------------------------------------------------------------
322
-
323
- # CLI entry point
324
-
325
- # ---------------------------------------------------------------------------
326
-
327
- def main():
328
-
329
- parser = argparse.ArgumentParser(
330
-
331
- description="OpenAI baseline inference for the SQL/Data Cleaning Sandbox"
332
-
333
- )
334
-
335
- parser.add_argument(
336
-
337
- "--url",
338
-
339
- default="http://localhost:8000",
340
-
341
- help="Base URL of the running environment server (default: http://localhost:8000)",
342
-
343
- )
344
-
345
- parser.add_argument(
346
-
347
- "--max-turns",
348
-
349
- type=int,
350
-
351
- default=15,
352
-
353
- help="Maximum agent turns per task (default: 15)",
354
-
355
- )
356
-
357
- args = parser.parse_args()
358
-
359
-
360
-
361
- if not os.environ.get("HF_TOKEN") and not os.environ.get("OPENAI_API_KEY"):
362
-
363
- print("ERROR: HF_TOKEN (or OPENAI_API_KEY) environment variable is not set per checklist.")
364
-
365
- sys.exit(1)
366
-
367
-
368
-
369
- results: dict[str, float] = {}
370
-
371
- results["easy"] = easy_run(args.url, args.max_turns)
372
-
373
- results["medium"] = med_run(args.url, args.max_turns)
374
-
375
- results["hard"] = hard_run(args.url, args.max_turns)
376
-
377
-
378
-
379
- avg = sum(results.values()) / len(results)
380
-
381
- print(f"\n{'='*50}")
382
-
383
- print("RESULTS SUMMARY")
384
-
385
- print(f"{'='*50}")
386
-
387
- for task_id, score in results.items():
388
-
389
- print(f" {task_id:<10}: {score:.4f}")
390
-
391
- print(f" {'average':<10}: {avg:.4f}")
392
-
393
-
394
-
395
-
396
-
397
- if __name__ == "__main__":
398
-
399
- main()
400
-
 
1
+ """
2
+ Baseline Inference Script for SQL/Data Cleaning Sandbox OpenAI Edition.
3
+
4
+ Uses OpenAI (gpt-4o) to solve all three tasks and prints reproducible
5
+ scores via the OpenEnv WebSocket client.
6
+
7
+ Usage:
8
+ set HF_TOKEN=sk-... # Windows
9
+ export HF_TOKEN=sk-... # Linux/macOS
10
+ python inference.py # local server
11
+ python inference.py --url https://... # remote server
12
+ """
13
+
14
+ import argparse
15
+ import json
16
+ import os
17
+ import sys
18
+
19
+ from dotenv import load_dotenv
20
+ load_dotenv()
21
+
22
+ from openai import OpenAI
23
+
24
+ from client import SqlSandboxEnv
25
+ from models import SqlSandboxAction
26
+
27
+
28
+ # ---------------------------------------------------------------------------
29
+ # System prompt shared across all tasks
30
+ # ---------------------------------------------------------------------------
31
+ SYSTEM_PROMPT = """\
32
+ You are a data engineering assistant working inside a SQLite sandbox.
33
+
34
+ You can execute two types of actions:
35
+ 1. {"tool": "sql", "command": "<SQL query>"}
36
+ 2. {"tool": "python", "command": "<Python code>"}
37
+
38
+ Rules:
39
+ - Respond with EXACTLY ONE JSON object per turn no markdown, no explanation.
40
+ - In Python code, the variables `conn` (sqlite3.Connection) and `cursor`
41
+ (sqlite3.Cursor) are already available. Do NOT call sqlite3.connect().
42
+ - SQLite STRFTIME months are zero-padded: use '01' not '1', or use LIKE '2024-01-%'.
43
+ - When you believe the task is fully complete, send:
44
+ {"tool": "sql", "command": "SELECT 'DONE'"}
45
+ """
46
+
47
+
48
+ # ---------------------------------------------------------------------------
49
+ # Core agent loop one task, one WebSocket session
50
+ # ---------------------------------------------------------------------------
51
+ def _run_task_agent(base_url: str, task_id: str, max_turns: int = 15) -> float:
52
+ """
53
+ Open a fresh WebSocket session, reset the environment to the given task,
54
+ then run an LLM agent loop until done or max_turns is reached.
55
+ Returns the final reward (0.0 1.0).
56
+ """
57
+ api_key = os.environ.get("HF_TOKEN") or os.environ.get("OPENAI_API_KEY")
58
+ api_base_url = os.environ.get("API_BASE_URL")
59
+ model_name = os.environ.get("MODEL_NAME", "gpt-4o")
60
+
61
+ client_llm = OpenAI(
62
+ api_key=api_key,
63
+ base_url=api_base_url,
64
+ )
65
+ final_reward = 0.0
66
+
67
+ # Each task gets its own WebSocket session to avoid state leakage
68
+ with SqlSandboxEnv(base_url=base_url).sync() as env:
69
+ # reset() with task_id seeds the correct DB table for this task
70
+ reset_resp = env.reset(task_id=task_id)
71
+ task_desc = reset_resp.observation.task_description
72
+
73
+ messages = [
74
+ {"role": "system", "content": SYSTEM_PROMPT},
75
+ {"role": "user", "content": f"Task: {task_desc}\n\nBegin."},
76
+ ]
77
+
78
+ print(f"\n --- Session: {task_id} ---")
79
+
80
+ for turn in range(max_turns):
81
+ # 1. Ask the LLM
82
+ response = client_llm.chat.completions.create(
83
+ model=model_name,
84
+ messages=messages,
85
+ temperature=0.0,
86
+ max_tokens=512,
87
+ )
88
+ assistant_msg = response.choices[0].message.content.strip()
89
+
90
+ # 2. Parse action JSON (handle optional markdown fences)
91
+ try:
92
+ raw = assistant_msg
93
+ if raw.startswith("```"):
94
+ raw = raw.split("```")[1]
95
+ if raw.startswith("json"):
96
+ raw = raw[4:]
97
+ action_data = json.loads(raw)
98
+ tool = action_data["tool"]
99
+ command = action_data["command"]
100
+ except (json.JSONDecodeError, KeyError):
101
+ # Feed parse error back to LLM, do NOT count as a step
102
+ messages.append({"role": "assistant", "content": assistant_msg})
103
+ messages.append({
104
+ "role": "user",
105
+ "content": (
106
+ 'Invalid JSON. Reply with exactly one JSON object:\n'
107
+ '{"tool": "sql" | "python", "command": "..."}'
108
+ ),
109
+ })
110
+ continue
111
+
112
+ # 3. Execute the action via OpenEnv step()
113
+ step_resp = env.step(SqlSandboxAction(tool=tool, command=command))
114
+
115
+ reward = step_resp.reward or 0.0
116
+ done = step_resp.done
117
+ output = step_resp.observation.output or ""
118
+ error = step_resp.observation.error or ""
119
+
120
+ final_reward = reward
121
+ print(f" [Turn {turn+1:02d}] tool={tool:<6} | reward={reward:.4f} | done={done}")
122
+
123
+ if done:
124
+ break
125
+
126
+ # 4. Feed result back to LLM for the next turn
127
+ messages.append({"role": "assistant", "content": assistant_msg})
128
+ feedback = f"Output:\n{output[:1500]}"
129
+ if error:
130
+ feedback += f"\nError:\n{error[:500]}"
131
+ feedback += f"\nReward so far: {reward:.4f}"
132
+ messages.append({"role": "user", "content": feedback})
133
+
134
+ return final_reward
135
+
136
+
137
+ # ---------------------------------------------------------------------------
138
+ # Per-difficulty entry points (called by main, importable for custom use)
139
+ # ---------------------------------------------------------------------------
140
+ def easy_run(base_url: str, max_turns: int = 15) -> float:
141
+ print(f"\n{'='*50}\nRunning task: easy\n{'='*50}")
142
+ score = _run_task_agent(base_url, "easy", max_turns)
143
+ print(f" Final score: {score:.4f}")
144
+ return score
145
+
146
+
147
+ def med_run(base_url: str, max_turns: int = 15) -> float:
148
+ print(f"\n{'='*50}\nRunning task: medium\n{'='*50}")
149
+ score = _run_task_agent(base_url, "medium", max_turns)
150
+ print(f" Final score: {score:.4f}")
151
+ return score
152
+
153
+
154
+ def hard_run(base_url: str, max_turns: int = 15) -> float:
155
+ print(f"\n{'='*50}\nRunning task: hard\n{'='*50}")
156
+ score = _run_task_agent(base_url, "hard", max_turns)
157
+ print(f" Final score: {score:.4f}")
158
+ return score
159
+
160
+
161
+ # ---------------------------------------------------------------------------
162
+ # CLI entry point
163
+ # ---------------------------------------------------------------------------
164
+ def main():
165
+ parser = argparse.ArgumentParser(
166
+ description="OpenAI baseline inference for the SQL/Data Cleaning Sandbox"
167
+ )
168
+ parser.add_argument(
169
+ "--url",
170
+ default="http://localhost:8000",
171
+ help="Base URL of the running environment server (default: http://localhost:8000)",
172
+ )
173
+ parser.add_argument(
174
+ "--max-turns",
175
+ type=int,
176
+ default=15,
177
+ help="Maximum agent turns per task (default: 15)",
178
+ )
179
+ args = parser.parse_args()
180
+
181
+ if not os.environ.get("HF_TOKEN") and not os.environ.get("OPENAI_API_KEY"):
182
+ print("ERROR: HF_TOKEN (or OPENAI_API_KEY) environment variable is not set per checklist.")
183
+ sys.exit(1)
184
+
185
+ results: dict[str, float] = {}
186
+ results["easy"] = easy_run(args.url, args.max_turns)
187
+ results["medium"] = med_run(args.url, args.max_turns)
188
+ results["hard"] = hard_run(args.url, args.max_turns)
189
+
190
+ avg = sum(results.values()) / len(results)
191
+ print(f"\n{'='*50}")
192
+ print("RESULTS SUMMARY")
193
+ print(f"{'='*50}")
194
+ for task_id, score in results.items():
195
+ print(f" {task_id:<10}: {score:.4f}")
196
+ print(f" {'average':<10}: {avg:.4f}")
197
+
198
+
199
+ if __name__ == "__main__":
200
+ main()