Sidharth1743 commited on
Commit
689c71b
·
1 Parent(s): da3c180

Hackathon polishing

Browse files
grid2op_env/.dockerignore → .dockerignore RENAMED
File without changes
.gitignore CHANGED
@@ -1,13 +1,11 @@
1
- # Python-generated files
 
2
  __pycache__/
3
- *.py[oc]
4
- build/
5
- dist/
6
- wheels/
7
- *.egg-info
8
 
9
- # Virtual environments
10
- .venv
11
- OpenEnv/
12
- OpenEnv
13
- .env
 
1
+ .venv/
2
+ .pytest_cache/
3
  __pycache__/
4
+ *.pyc
5
+ outputs/logs/*
6
+ !outputs/logs/.gitkeep
7
+ outputs/evals/*
8
+ !outputs/evals/.gitkeep
9
 
10
+ .env
11
+ grid2op_env.egg-info
 
 
 
.python-version CHANGED
@@ -1 +1 @@
1
- 3.13
 
1
+ 3.12
AGENTS.md ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Repository Guidelines
2
+
3
+ ## Project Structure & Module Organization
4
+ The package is rooted at the repository top level. Core models live in `models.py`, the baseline agent in `inference.py`, the client helper in `client.py`, and topology analysis in `graph_analysis.py`. The FastAPI/OpenEnv server lives in `server/` with `app.py`, `grid_environment.py`, `tasks.py`, `graders.py`, and logging helpers. Tests are in `tests/`, reference docs in `docs/` and `architecture/`, and submission utilities in `submission/`. Runtime artifacts go under `outputs/logs/` and `outputs/evals/`.
5
+
6
+ ## Build, Test, and Development Commands
7
+ Use `uv` for local work.
8
+
9
+ - `env UV_CACHE_DIR=/tmp/uv-cache uv run --no-dev server --port 7860` starts the FastAPI server declared in `openenv.yaml`.
10
+ - `env UV_CACHE_DIR=/tmp/uv-cache uv run --no-dev grid2op-smoke --task-id single_fault --steps 1` runs a quick environment smoke test.
11
+ - `env UV_CACHE_DIR=/tmp/uv-cache uv run --extra dev pytest tests/test_grid2op_env.py -q` runs the current pytest suite.
12
+ - `docker build -t grid2op-env:local -f server/Dockerfile .` builds the local container image.
13
+ - `bash submission/pre_validation.sh` runs submission checks before packaging.
14
+
15
+ ## Coding Style & Naming Conventions
16
+ Follow the existing Python style: 4-space indentation, type hints, `from __future__ import annotations`, and compact module-level imports. Use `snake_case` for functions, variables, and modules, `PascalCase` for Pydantic models, and `UPPER_CASE` for constants like `TASKS`. Keep OpenEnv payloads strongly typed with Pydantic models instead of raw dicts when practical. No formatter or linter config is committed, so match surrounding code and keep diffs minimal.
17
+
18
+ ## Testing Guidelines
19
+ Tests use `pytest`. Add new coverage in `tests/test_grid2op_env.py` or split into `tests/test_<feature>.py` as the suite grows. Prefer deterministic assertions over probabilistic checks; this repository already tests grader determinism, task resets, proposal parsing, and graph-analysis output. Run the smoke command plus pytest before opening a PR.
20
+
21
+ ## Commit & Pull Request Guidelines
22
+ Recent commits use short, direct subjects such as `docs updated` and `task 3 refining`. Keep commit titles imperative, lowercase is acceptable, and stay under roughly 60 characters. PRs should describe the affected task or subsystem, list validation commands run, and include baseline or API behavior changes when relevant. Add screenshots only for UI or HTTP response examples.
23
+
24
+ ## Configuration & Runtime Notes
25
+ `openenv.yaml` points to `server.app:app` on port `7860`. Keep API credentials in environment variables or `.env`; do not hardcode secrets. If you change server routes or environment logic, restart the server before rerunning `inference.py`.
README.md CHANGED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Grid2Op Environment
2
+
3
+ Standalone OpenEnv environment package for the full `PROJECT.md` design.
4
+
5
+ The current planner uses server-side simulation on the live Grid2Op session. It does not rely on a replayed local mirror.
6
+
7
+ ## File structure
8
+
9
+ ```text
10
+ grid2op_env/
11
+ ├── .dockerignore
12
+ ├── .env
13
+ ├── .gitignore
14
+ ├── __init__.py
15
+ ├── models.py
16
+ ├── client.py
17
+ ├── inference.py
18
+ ├── README.md
19
+ ├── openenv.yaml
20
+ ├── outputs/
21
+ │ ├── logs/
22
+ │ └── evals/
23
+ ├── pyproject.toml
24
+ └── server/
25
+ ├── grid_environment.py
26
+ ├── tasks.py
27
+ ├── graders.py
28
+ ├── app.py
29
+ ├── requirements.txt
30
+ └── Dockerfile
31
+ ```
32
+
33
+ The top-level package now follows the canonical OpenEnv environment layout:
34
+
35
+ - `.dockerignore`
36
+ - `__init__.py`
37
+ - `models.py`
38
+ - `client.py`
39
+ - `README.md`
40
+ - `openenv.yaml`
41
+ - `pyproject.toml`
42
+ - `outputs/logs`
43
+ - `outputs/evals`
44
+ - `server/`
45
+
46
+ Supporting files outside the minimum template remain for quality and verification:
47
+
48
+ - `inference.py`
49
+ - `tests/test_grid2op_env.py`
50
+ - helper server modules such as `tasks.py`, `graders.py`, and `logging_utils.py`
51
+
52
+ ## What is implemented
53
+
54
+ - Grid2Op core simulator using `l2rpn_case14_sandbox`
55
+ - Typed `GridAction`, `GridObservation`, and `GridState`
56
+ - Four tasks: `single_fault`, `n_minus_1`, `cascade_prevent`, `multi_stage_cascade`
57
+ - Reset-time scenario injection and retry logic for non-convergent starts
58
+ - Shaped reward, episode logging, and deterministic graders
59
+ - OpenEnv WebSocket interface plus `/tasks`, `/grader`, and `/baseline`
60
+ - Server-side planner support via:
61
+ - `POST /planning_context`
62
+ - `POST /simulate`
63
+ - Qwen3.5 baseline using the Chat Completions API
64
+ - Local Docker workflow with dataset pre-download
65
+
66
+ ## Recent fixes
67
+
68
+ 1. **Task 1 (single_fault) benchmark ranges corrected** (tasks.py:297-304):
69
+ - `single_fault_easy`: 0.82-0.85 (was mathematically impossible 0.90-0.94)
70
+ - `single_fault_moderate`: 0.86-0.89 (was 0.94-0.97)
71
+ - `single_fault_severe`: 0.90-0.93 (was 0.96-0.99)
72
+ - Warmup phase finds high-loading state in chronics, then agent has 10 steps to solve
73
+
74
+ 2. **Task 1 reward function** (grid_environment.py:589-596):
75
+ - Target achieved bonus: `1.0 / step_count` (rewards early solution)
76
+ - Safe margin bonus: `0.05 × max(0.0, 1.0 - max_rho)`
77
+ - Overload penalty: `0.2 × overloaded_count` (lines > 100%)
78
+ - Redispatch penalty: `0.01 × MW` (discourages large interventions)
79
+ - Failure penalty: `-5.0` if time limit reached without target
80
+
81
+ 3. **Task 1 grading** (graders.py:28-55):
82
+ - 70% weight on survival ratio
83
+ - 50% target achieved bonus
84
+ - Final state bonus (0.3 if below target, 0.15/+0.05, 0.05/+0.10)
85
+ - Legacy success score for early completion: `1.0 - 0.08 × (step - 1)`
86
+
87
+ 4. **Task 2 (n_minus_1) redesign** based on RL2Grid paper (grid_environment.py:598-609):
88
+ - Three-component reward: `0.3×R_survive + 0.6×R_overload + 0.1×R_cost`
89
+ - `R_survive`: +1.0 per step (constant survival signal)
90
+ - `R_overload`: `(1/n) × Σ clip(1-ρ, -1, 1)` - loading margin quality
91
+ - `R_cost`: `-0.05 × Σ|ΔMW|/max_ramp` (normalized redispatch cost)
92
+ - Reconnection bonus: +2.0 when safely reconnecting (grid_environment.py:853-869)
93
+ - Terminal: +10×(s/m)² quadratic survival, -15 blackout
94
+ - Phase-aware grader (graders.py:58-83):
95
+ - Emergency response (30%): cleared within 5 steps at rho < 0.92
96
+ - Sustained security (50%): steps 6-20 at rho < 0.90
97
+ - Reconnection (20%): did agent reconnect line 0?
98
+ - N-1 security score (bridge lines) in prompt
99
+ - **Grading now honest**: score = survival_ratio × mastery_score (no override)
100
+ - Latest eval: 0.952 (was 1.0 with old override)
101
+
102
+ 5. **Task 3 (cascade_prevent)** (grid_environment.py:611-628):
103
+ - 1-2 lines disconnected at reset + 5-15% load increase
104
+ - Key metric: `timestep_overflow` countdowns (not just max_rho)
105
+ - Quadratic overflow penalty: `-0.05 × Σ(overflow²)` - line at overflow=2 is 4x more urgent than overflow=1
106
+ - Reward components:
107
+ - Cascade prevention: +0.3 if no auto-trip, -2.5 if auto-trip
108
+ - Thermal margin: +0.1 × mean(clip(1-ρ, -1, 1))
109
+ - Terminal: +5.0 × (1 - auto_trips/5)² survival bonus, -12.0 blackout
110
+ - Grading (graders.py:86-121):
111
+ - Cascade containment (50%): steps without auto-trips / 30
112
+ - Thermal stability (30%): safe_steps / containment_steps
113
+ - Recovery speed (20%): how fast recovered from first overload
114
+ - Latest eval: 0.798 (hard/extreme tiers challenging)
115
+
116
+ 6. **Task 4 (multi_stage_cascade)** (tasks.py:334-337, grid_environment.py:630-647):
117
+ - 3 lines disconnected at reset + **20% load increase** (not 15%)
118
+ - Three explicit stages (10 steps each) with stage boundaries at step 10 and 20
119
+ - Overflow window: 2 (faster cascades than default 3)
120
+ - Do-nothing survival probe: 5 steps minimum
121
+ - Island availability assessment at stage boundaries (grid_environment.py:767-814)
122
+ - Candidate filtering (inference.py:1003-1030): filters unsafe topology disconnects
123
+ - Reward (grid_environment.py:630-647):
124
+ - Generation cost: -0.02 × (total_gen / initial_load)
125
+ - Convergence: +0.5 × available_island_ratio
126
+ - Load loss penalty: -5.0 × (1 - available_load_ratio) at boundaries only
127
+ - Terminal win: +8.0 × (available_load_ratio)² if ≥50% load at step 30
128
+ - Terminal blackout: -12.0
129
+ - Grading (graders.py:124-174):
130
+ - Stage completion (30%): survived stages 1, 2, 3
131
+ - Load preservation (40%): available_load_ratio at end
132
+ - Island quality (20%): majority islands viable at boundaries
133
+ - Speed bonus (10%): how fast stability returned each stage
134
+ - Latest eval: 0.929 (31x improvement from 0.027)
135
+
136
+ ## Planner architecture
137
+
138
+ `inference.py` now uses this flow:
139
+
140
+ 1. `reset()` live episode
141
+ 2. `state()` to obtain `episode_id`
142
+ 3. `planning_context(episode_id)` for graph intelligence and redispatchable generators
143
+ 4. LLM proposes 3 candidate actions
144
+ 5. `simulate_candidates(episode_id, actions)` on the live server session
145
+ 6. LLM selects the safest simulated action
146
+ 7. `step(action)`
147
+
148
+ This avoids the old replay-mirror drift problem.
149
+
150
+ ## Local Docker workflow
151
+
152
+ Build:
153
+
154
+ ```bash
155
+ cd grid2op_env
156
+ docker build -t grid2op-env:local -f server/Dockerfile .
157
+ ```
158
+
159
+ Run:
160
+
161
+ ```bash
162
+ docker run --rm -p 7860:7860 grid2op-env:local
163
+ ```
164
+
165
+ If your Qwen-compatible API is running on the host machine, use:
166
+
167
+ ```bash
168
+ docker run --rm \
169
+ --add-host=host.docker.internal:host-gateway \
170
+ -e OPENAI_BASE_URL=http://host.docker.internal:8000/v1 \
171
+ -e OPENAI_API_KEY=EMPTY \
172
+ -p 7860:7860 \
173
+ grid2op-env:local
174
+ ```
175
+
176
+ ## Local UV workflow
177
+
178
+ ```bash
179
+ cd grid2op_env
180
+ env UV_CACHE_DIR=/tmp/uv-cache uv run --no-dev grid2op-smoke --task-id single_fault --steps 1
181
+ env UV_CACHE_DIR=/tmp/uv-cache uv run --no-dev server --port 7860
182
+ env UV_CACHE_DIR=/tmp/uv-cache uv run --extra dev pytest tests/test_grid2op_env.py -q
183
+ ```
184
+
185
+ ## Qwen baseline
186
+
187
+ The baseline uses the OpenAI Python SDK against a local Chat Completions API.
188
+
189
+ ```bash
190
+ cat > .env <<'EOF'
191
+ OPENAI_BASE_URL=http://localhost:8000/v1
192
+ OPENAI_API_KEY=EMPTY
193
+ OPENAI_MODEL=cyankiwi/Qwen3.5-9B-AWQ-4bit
194
+ EOF
195
+
196
+ env UV_CACHE_DIR=/tmp/uv-cache uv run --no-dev inference.py
197
+ ```
198
+
199
+ ## Important runtime note
200
+
201
+ After changing server code, restart the Grid2Op server before running `inference.py`. The planner depends on the live server routes `/planning_context` and `/simulate`.
202
+
203
+ ## Latest verified result
204
+
205
+ Latest saved run:
206
+
207
+ - `single_fault`: `0.752`
208
+ - `n_minus_1`: `0.952`
209
+ - `cascade_prevent`: `0.798`
210
+ - `multi_stage_cascade`: `0.929`
211
+
212
+ This confirms the server-side simulation path is active.
213
+
214
+ ## Architecture Documentation
215
+
216
+ - [architecture/task_1_architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/task_1_architecture.md) - Task 1 detailed walkthrough
217
+ - [architecture/task_2_architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/task_2_architecture.md) - Task 2 N-1 contingency management
218
+ - [architecture/task_3_architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/task_3_architecture.md) - Task 3 cascade prevention
219
+ - [architecture/task_4_architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/task_4_architecture.md) - Task 4 multi-stage cascade management
220
+ - [architecture/architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/architecture.md) - Overall system architecture
grid2op_env/__init__.py → __init__.py RENAMED
File without changes
grid2op_env/client.py → client.py RENAMED
File without changes
evaluation.md → docs/evaluation.md RENAMED
@@ -923,3 +923,126 @@ Summary scores:
923
  }
924
  ```
925
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
923
  }
924
  ```
925
 
926
+ ## Run 20260407_130112
927
+
928
+ - Model: `openai/gpt-oss-20b:groq`
929
+ - Tasks: `single_fault, n_minus_1, cascade_prevent, multi_stage_cascade`
930
+ - Seeds: `0` to `4`
931
+ - Scenario mode: `benchmark`
932
+ - Sampling: `temperature=0.7`, `top_p=0.8`, `top_k=20`, `min_p=0.0`, `presence_penalty=1.5`, `repetition_penalty=1.0`
933
+ - JSON output: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_130112.json](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_130112.json)
934
+ - CSV output: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_130112.csv](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_130112.csv)
935
+ - Log file: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/logs/baseline_run_20260407_130112.log](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/logs/baseline_run_20260407_130112.log)
936
+
937
+ | Task | Tier | Mean Score | Mean Episode Length | Mean Time (s) | Mean Do-Nothing Steps |
938
+ | --- | --- | ---: | ---: | ---: | ---: |
939
+ | `single_fault` | `single_fault_easy` | `0.750000` | `10.00` | `30.82` | `9.00` |
940
+ | `single_fault` | `single_fault_moderate` | `0.725000` | `10.00` | `30.94` | `7.50` |
941
+ | `single_fault` | `single_fault_severe` | `1.000000` | `1.00` | `5.72` | `0.00` |
942
+ | `n_minus_1` | `n_minus_1_fixed` | `0.645333` | `20.00` | `57.54` | `17.00` |
943
+ | `cascade_prevent` | `cascade_prevent_easy` | `1.000000` | `30.00` | `86.27` | `28.50` |
944
+ | `cascade_prevent` | `cascade_prevent_medium` | `1.000000` | `30.00` | `87.11` | `25.00` |
945
+ | `cascade_prevent` | `cascade_prevent_extreme` | `0.596666` | `16.50` | `49.61` | `15.50` |
946
+ | `multi_stage_cascade` | `multi_stage_cascade_expert` | `0.831466` | `28.40` | `96.60` | `9.80` |
947
+
948
+ Summary scores:
949
+ ```json
950
+ {
951
+ "model": "openai/gpt-oss-20b:groq",
952
+ "scores": {
953
+ "single_fault": 0.825,
954
+ "n_minus_1": 0.645333,
955
+ "cascade_prevent": 0.865556,
956
+ "multi_stage_cascade": 0.831466
957
+ },
958
+ "episode_lengths": {
959
+ "single_fault": 7,
960
+ "n_minus_1": 20,
961
+ "cascade_prevent": 26,
962
+ "multi_stage_cascade": 28
963
+ }
964
+ }
965
+ ```
966
+
967
+ ## Run 20260407_145958
968
+
969
+ - Model: `openai/gpt-oss-20b:groq`
970
+ - Tasks: `single_fault, n_minus_1, cascade_prevent, multi_stage_cascade`
971
+ - Seeds: `0` to `4`
972
+ - Scenario mode: `benchmark`
973
+ - Sampling: `temperature=0.7`, `top_p=0.8`, `top_k=20`, `min_p=0.0`, `presence_penalty=1.5`, `repetition_penalty=1.0`
974
+ - JSON output: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_145958.json](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_145958.json)
975
+ - CSV output: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_145958.csv](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_145958.csv)
976
+ - Log file: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/logs/baseline_run_20260407_145958.log](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/logs/baseline_run_20260407_145958.log)
977
+
978
+ | Task | Tier | Mean Score | Mean Episode Length | Mean Time (s) | Mean Do-Nothing Steps |
979
+ | --- | --- | ---: | ---: | ---: | ---: |
980
+ | `single_fault` | `single_fault_easy` | `0.750000` | `10.00` | `30.10` | `9.00` |
981
+ | `single_fault` | `single_fault_moderate` | `0.725000` | `10.00` | `28.88` | `7.50` |
982
+ | `single_fault` | `single_fault_severe` | `1.000000` | `1.00` | `6.25` | `0.00` |
983
+ | `n_minus_1` | `n_minus_1_fixed` | `0.575000` | `20.00` | `59.77` | `17.25` |
984
+ | `cascade_prevent` | `cascade_prevent_easy` | `1.000000` | `30.00` | `92.46` | `29.00` |
985
+ | `cascade_prevent` | `cascade_prevent_medium` | `1.000000` | `30.00` | `94.56` | `27.00` |
986
+ | `cascade_prevent` | `cascade_prevent_extreme` | `0.596666` | `16.50` | `50.18` | `16.00` |
987
+ | `multi_stage_cascade` | `multi_stage_cascade_expert` | `0.917766` | `30.00` | `94.55` | `10.00` |
988
+
989
+ Summary scores:
990
+ ```json
991
+ {
992
+ "model": "openai/gpt-oss-20b:groq",
993
+ "scores": {
994
+ "single_fault": 0.825,
995
+ "n_minus_1": 0.575,
996
+ "cascade_prevent": 0.865556,
997
+ "multi_stage_cascade": 0.917766
998
+ },
999
+ "episode_lengths": {
1000
+ "single_fault": 7,
1001
+ "n_minus_1": 20,
1002
+ "cascade_prevent": 26,
1003
+ "multi_stage_cascade": 30
1004
+ }
1005
+ }
1006
+ ```
1007
+
1008
+ ## Run 20260407_163224
1009
+
1010
+ - Model: `openai/gpt-oss-20b:groq`
1011
+ - Tasks: `single_fault, n_minus_1, cascade_prevent, multi_stage_cascade`
1012
+ - Seeds: `0` to `4`
1013
+ - Scenario mode: `benchmark`
1014
+ - Sampling: `temperature=0.7`, `top_p=0.8`, `top_k=20`, `min_p=0.0`, `presence_penalty=1.5`, `repetition_penalty=1.0`
1015
+ - JSON output: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_163224.json](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_163224.json)
1016
+ - CSV output: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_163224.csv](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_163224.csv)
1017
+ - Log file: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/logs/baseline_run_20260407_163224.log](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/logs/baseline_run_20260407_163224.log)
1018
+
1019
+ | Task | Tier | Mean Score | Mean Episode Length | Mean Time (s) | Mean Do-Nothing Steps |
1020
+ | --- | --- | ---: | ---: | ---: | ---: |
1021
+ | `single_fault` | `single_fault_easy` | `0.750000` | `10.00` | `19.32` | `10.00` |
1022
+ | `single_fault` | `single_fault_moderate` | `0.750000` | `10.00` | `18.22` | `10.00` |
1023
+ | `single_fault` | `single_fault_severe` | `0.750000` | `10.00` | `20.68` | `9.00` |
1024
+ | `n_minus_1` | `n_minus_1_fixed` | `0.576750` | `15.75` | `54.19` | `15.25` |
1025
+ | `cascade_prevent` | `cascade_prevent_easy` | `1.000000` | `30.00` | `97.00` | `28.50` |
1026
+ | `cascade_prevent` | `cascade_prevent_medium` | `1.000000` | `30.00` | `93.13` | `28.50` |
1027
+ | `cascade_prevent` | `cascade_prevent_extreme` | `0.596666` | `16.50` | `52.60` | `15.00` |
1028
+ | `multi_stage_cascade` | `multi_stage_cascade_expert` | `0.812543` | `28.00` | `94.51` | `8.50` |
1029
+
1030
+ Summary scores:
1031
+ ```json
1032
+ {
1033
+ "model": "openai/gpt-oss-20b:groq",
1034
+ "scores": {
1035
+ "single_fault": 0.75,
1036
+ "n_minus_1": 0.57675,
1037
+ "cascade_prevent": 0.865556,
1038
+ "multi_stage_cascade": 0.812543
1039
+ },
1040
+ "episode_lengths": {
1041
+ "single_fault": 10,
1042
+ "n_minus_1": 16,
1043
+ "cascade_prevent": 26,
1044
+ "multi_stage_cascade": 28
1045
+ }
1046
+ }
1047
+ ```
1048
+
graph_build.md → docs/graph_build.md RENAMED
File without changes
implementation.md → docs/implementation.md RENAMED
File without changes
grid2op_env/graph_analysis.py → graph_analysis.py RENAMED
File without changes
grid2op_env/.gitignore DELETED
@@ -1,9 +0,0 @@
1
- .venv/
2
- .pytest_cache/
3
- __pycache__/
4
- *.pyc
5
- outputs/logs/*
6
- !outputs/logs/.gitkeep
7
- outputs/evals/*
8
- !outputs/evals/.gitkeep
9
-
 
 
 
 
 
 
 
 
 
 
grid2op_env/README.md DELETED
@@ -1,220 +0,0 @@
1
- # Grid2Op Environment
2
-
3
- Standalone OpenEnv environment package for the full `PROJECT.md` design.
4
-
5
- The current planner uses server-side simulation on the live Grid2Op session. It does not rely on a replayed local mirror.
6
-
7
- ## File structure
8
-
9
- ```text
10
- grid2op_env/
11
- ├── .dockerignore
12
- ├── .env
13
- ├── .gitignore
14
- ├── __init__.py
15
- ├── models.py
16
- ├── client.py
17
- ├── inference.py
18
- ├── README.md
19
- ├── openenv.yaml
20
- ├── outputs/
21
- │ ├── logs/
22
- │ └── evals/
23
- ├── pyproject.toml
24
- └── server/
25
- ├── grid_environment.py
26
- ├── tasks.py
27
- ├── graders.py
28
- ├── app.py
29
- ├── requirements.txt
30
- └── Dockerfile
31
- ```
32
-
33
- The top-level package now follows the canonical OpenEnv environment layout:
34
-
35
- - `.dockerignore`
36
- - `__init__.py`
37
- - `models.py`
38
- - `client.py`
39
- - `README.md`
40
- - `openenv.yaml`
41
- - `pyproject.toml`
42
- - `outputs/logs`
43
- - `outputs/evals`
44
- - `server/`
45
-
46
- Supporting files outside the minimum template remain for quality and verification:
47
-
48
- - `inference.py`
49
- - `tests/test_grid2op_env.py`
50
- - helper server modules such as `tasks.py`, `graders.py`, and `logging_utils.py`
51
-
52
- ## What is implemented
53
-
54
- - Grid2Op core simulator using `l2rpn_case14_sandbox`
55
- - Typed `GridAction`, `GridObservation`, and `GridState`
56
- - Four tasks: `single_fault`, `n_minus_1`, `cascade_prevent`, `multi_stage_cascade`
57
- - Reset-time scenario injection and retry logic for non-convergent starts
58
- - Shaped reward, episode logging, and deterministic graders
59
- - OpenEnv WebSocket interface plus `/tasks`, `/grader`, and `/baseline`
60
- - Server-side planner support via:
61
- - `POST /planning_context`
62
- - `POST /simulate`
63
- - Qwen3.5 baseline using the Chat Completions API
64
- - Local Docker workflow with dataset pre-download
65
-
66
- ## Recent fixes
67
-
68
- 1. **Task 1 (single_fault) benchmark ranges corrected** (tasks.py:297-304):
69
- - `single_fault_easy`: 0.82-0.85 (was mathematically impossible 0.90-0.94)
70
- - `single_fault_moderate`: 0.86-0.89 (was 0.94-0.97)
71
- - `single_fault_severe`: 0.90-0.93 (was 0.96-0.99)
72
- - Warmup phase finds high-loading state in chronics, then agent has 10 steps to solve
73
-
74
- 2. **Task 1 reward function** (grid_environment.py:589-596):
75
- - Target achieved bonus: `1.0 / step_count` (rewards early solution)
76
- - Safe margin bonus: `0.05 × max(0.0, 1.0 - max_rho)`
77
- - Overload penalty: `0.2 × overloaded_count` (lines > 100%)
78
- - Redispatch penalty: `0.01 × MW` (discourages large interventions)
79
- - Failure penalty: `-5.0` if time limit reached without target
80
-
81
- 3. **Task 1 grading** (graders.py:28-55):
82
- - 70% weight on survival ratio
83
- - 50% target achieved bonus
84
- - Final state bonus (0.3 if below target, 0.15/+0.05, 0.05/+0.10)
85
- - Legacy success score for early completion: `1.0 - 0.08 × (step - 1)`
86
-
87
- 4. **Task 2 (n_minus_1) redesign** based on RL2Grid paper (grid_environment.py:598-609):
88
- - Three-component reward: `0.3×R_survive + 0.6×R_overload + 0.1×R_cost`
89
- - `R_survive`: +1.0 per step (constant survival signal)
90
- - `R_overload`: `(1/n) × Σ clip(1-ρ, -1, 1)` - loading margin quality
91
- - `R_cost`: `-0.05 × Σ|ΔMW|/max_ramp` (normalized redispatch cost)
92
- - Reconnection bonus: +2.0 when safely reconnecting (grid_environment.py:853-869)
93
- - Terminal: +10×(s/m)² quadratic survival, -15 blackout
94
- - Phase-aware grader (graders.py:58-83):
95
- - Emergency response (30%): cleared within 5 steps at rho < 0.92
96
- - Sustained security (50%): steps 6-20 at rho < 0.90
97
- - Reconnection (20%): did agent reconnect line 0?
98
- - N-1 security score (bridge lines) in prompt
99
- - **Grading now honest**: score = survival_ratio × mastery_score (no override)
100
- - Latest eval: 0.952 (was 1.0 with old override)
101
-
102
- 5. **Task 3 (cascade_prevent)** (grid_environment.py:611-628):
103
- - 1-2 lines disconnected at reset + 5-15% load increase
104
- - Key metric: `timestep_overflow` countdowns (not just max_rho)
105
- - Quadratic overflow penalty: `-0.05 × Σ(overflow²)` - line at overflow=2 is 4x more urgent than overflow=1
106
- - Reward components:
107
- - Cascade prevention: +0.3 if no auto-trip, -2.5 if auto-trip
108
- - Thermal margin: +0.1 × mean(clip(1-ρ, -1, 1))
109
- - Terminal: +5.0 × (1 - auto_trips/5)² survival bonus, -12.0 blackout
110
- - Grading (graders.py:86-121):
111
- - Cascade containment (50%): steps without auto-trips / 30
112
- - Thermal stability (30%): safe_steps / containment_steps
113
- - Recovery speed (20%): how fast recovered from first overload
114
- - Latest eval: 0.798 (hard/extreme tiers challenging)
115
-
116
- 6. **Task 4 (multi_stage_cascade)** (tasks.py:334-337, grid_environment.py:630-647):
117
- - 3 lines disconnected at reset + **20% load increase** (not 15%)
118
- - Three explicit stages (10 steps each) with stage boundaries at step 10 and 20
119
- - Overflow window: 2 (faster cascades than default 3)
120
- - Do-nothing survival probe: 5 steps minimum
121
- - Island availability assessment at stage boundaries (grid_environment.py:767-814)
122
- - Candidate filtering (inference.py:1003-1030): filters unsafe topology disconnects
123
- - Reward (grid_environment.py:630-647):
124
- - Generation cost: -0.02 × (total_gen / initial_load)
125
- - Convergence: +0.5 × available_island_ratio
126
- - Load loss penalty: -5.0 × (1 - available_load_ratio) at boundaries only
127
- - Terminal win: +8.0 × (available_load_ratio)² if ≥50% load at step 30
128
- - Terminal blackout: -12.0
129
- - Grading (graders.py:124-174):
130
- - Stage completion (30%): survived stages 1, 2, 3
131
- - Load preservation (40%): available_load_ratio at end
132
- - Island quality (20%): majority islands viable at boundaries
133
- - Speed bonus (10%): how fast stability returned each stage
134
- - Latest eval: 0.929 (31x improvement from 0.027)
135
-
136
- ## Planner architecture
137
-
138
- `inference.py` now uses this flow:
139
-
140
- 1. `reset()` live episode
141
- 2. `state()` to obtain `episode_id`
142
- 3. `planning_context(episode_id)` for graph intelligence and redispatchable generators
143
- 4. LLM proposes 3 candidate actions
144
- 5. `simulate_candidates(episode_id, actions)` on the live server session
145
- 6. LLM selects the safest simulated action
146
- 7. `step(action)`
147
-
148
- This avoids the old replay-mirror drift problem.
149
-
150
- ## Local Docker workflow
151
-
152
- Build:
153
-
154
- ```bash
155
- cd grid2op_env
156
- docker build -t grid2op-env:local -f server/Dockerfile .
157
- ```
158
-
159
- Run:
160
-
161
- ```bash
162
- docker run --rm -p 7860:7860 grid2op-env:local
163
- ```
164
-
165
- If your Qwen-compatible API is running on the host machine, use:
166
-
167
- ```bash
168
- docker run --rm \
169
- --add-host=host.docker.internal:host-gateway \
170
- -e OPENAI_BASE_URL=http://host.docker.internal:8000/v1 \
171
- -e OPENAI_API_KEY=EMPTY \
172
- -p 7860:7860 \
173
- grid2op-env:local
174
- ```
175
-
176
- ## Local UV workflow
177
-
178
- ```bash
179
- cd grid2op_env
180
- env UV_CACHE_DIR=/tmp/uv-cache uv run --no-dev grid2op-smoke --task-id single_fault --steps 1
181
- env UV_CACHE_DIR=/tmp/uv-cache uv run --no-dev server --port 7860
182
- env UV_CACHE_DIR=/tmp/uv-cache uv run --extra dev pytest tests/test_grid2op_env.py -q
183
- ```
184
-
185
- ## Qwen baseline
186
-
187
- The baseline uses the OpenAI Python SDK against a local Chat Completions API.
188
-
189
- ```bash
190
- cat > .env <<'EOF'
191
- OPENAI_BASE_URL=http://localhost:8000/v1
192
- OPENAI_API_KEY=EMPTY
193
- OPENAI_MODEL=cyankiwi/Qwen3.5-9B-AWQ-4bit
194
- EOF
195
-
196
- env UV_CACHE_DIR=/tmp/uv-cache uv run --no-dev inference.py
197
- ```
198
-
199
- ## Important runtime note
200
-
201
- After changing server code, restart the Grid2Op server before running `inference.py`. The planner depends on the live server routes `/planning_context` and `/simulate`.
202
-
203
- ## Latest verified result
204
-
205
- Latest saved run:
206
-
207
- - `single_fault`: `0.752`
208
- - `n_minus_1`: `0.952`
209
- - `cascade_prevent`: `0.798`
210
- - `multi_stage_cascade`: `0.929`
211
-
212
- This confirms the server-side simulation path is active.
213
-
214
- ## Architecture Documentation
215
-
216
- - [architecture/task_1_architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/task_1_architecture.md) - Task 1 detailed walkthrough
217
- - [architecture/task_2_architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/task_2_architecture.md) - Task 2 N-1 contingency management
218
- - [architecture/task_3_architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/task_3_architecture.md) - Task 3 cascade prevention
219
- - [architecture/task_4_architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/task_4_architecture.md) - Task 4 multi-stage cascade management
220
- - [architecture/architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/architecture.md) - Overall system architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
grid2op_env/pyproject.toml DELETED
@@ -1,35 +0,0 @@
1
- [build-system]
2
- requires = ["setuptools>=45", "wheel"]
3
- build-backend = "setuptools.build_meta"
4
-
5
- [project]
6
- name = "grid2op-env"
7
- version = "0.1.0"
8
- description = "Standalone OpenEnv wrapper around Grid2Op"
9
- readme = "README.md"
10
- requires-python = ">=3.10,<3.13"
11
- dependencies = [
12
- "openenv-core[core]>=0.2.2",
13
- "grid2op>=1.10.5",
14
- "numpy>=1.24.0",
15
- "openai>=2.7.2",
16
- "python-dotenv>=1.0.1",
17
- "requests>=2.31.0",
18
- ]
19
-
20
- [project.optional-dependencies]
21
- dev = [
22
- "pytest>=8.0.0",
23
- ]
24
- lightsim = [
25
- "lightsim2grid>=0.10.0",
26
- ]
27
-
28
- [project.scripts]
29
- server = "grid2op_env.server.app:main"
30
- grid2op-smoke = "grid2op_env.server.grid_environment:smoke_main"
31
-
32
- [tool.setuptools]
33
- include-package-data = true
34
- packages = ["grid2op_env", "grid2op_env.server"]
35
- package-dir = { "grid2op_env" = ".", "grid2op_env.server" = "server" }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
grid2op_env/uv.lock DELETED
The diff for this file is too large to render. See raw diff
 
grid2op_env/inference.py → inference.py RENAMED
@@ -27,7 +27,7 @@ from grid2op_env.models import (
27
  from grid2op_env.server.tasks import TASKS, benchmark_tiers_for_task
28
 
29
 
30
- def configure_logging(level: int = logging.INFO) -> None:
31
  root_logger = logging.getLogger()
32
  if root_logger.handlers:
33
  root_logger.setLevel(level)
@@ -56,11 +56,18 @@ configure_logging()
56
  logger = logging.getLogger(__name__)
57
 
58
  TASK_SEED_OVERRIDES: dict[TaskId, int] = {
59
- "single_fault": 2,
 
60
  "cascade_prevent": 2,
 
61
  }
62
  HF_ROUTER_BASE_URL = "https://router.huggingface.co/v1"
63
- HF_ROUTER_DEFAULT_MODEL = "openai/gpt-oss-safeguard-20b:groq"
 
 
 
 
 
64
 
65
 
66
  @dataclass
@@ -94,36 +101,23 @@ class SimulationOutcome:
94
  raw_result: dict[str, Any]
95
 
96
 
97
- def _env_flag(name: str, default: bool) -> bool:
98
- raw_value = os.environ.get(name)
99
- if raw_value is None:
100
- return default
101
- return raw_value.strip().lower() in {"1", "true", "yes", "on"}
102
-
103
 
104
- def _use_local_llm_setup() -> bool:
105
- if "local_setup" in os.environ:
106
- return _env_flag("local_setup", True)
107
- return _env_flag("LOCAL_SETUP", True)
108
 
109
-
110
- def _default_model_name() -> str:
111
- if _use_local_llm_setup():
112
- return os.environ.get("OPENAI_MODEL", "Qwen/Qwen3.5-9B")
113
- return os.environ.get("HF_ROUTER_MODEL", HF_ROUTER_DEFAULT_MODEL)
114
 
115
 
116
  def _build_llm_client() -> OpenAI:
117
- if _use_local_llm_setup():
118
- return OpenAI()
119
- hf_token = os.environ.get("HF_TOKEN")
120
- if not hf_token:
121
  raise RuntimeError(
122
- "HF Router mode requires HF_TOKEN in the environment when local_setup=false."
123
  )
124
  return OpenAI(
125
- base_url=HF_ROUTER_BASE_URL,
126
- api_key=hf_token,
127
  )
128
 
129
 
@@ -138,17 +132,152 @@ def _chat_completion_kwargs(
138
  "temperature": llm_config.temperature,
139
  "top_p": llm_config.top_p,
140
  "presence_penalty": llm_config.presence_penalty,
 
141
  }
142
- if _use_local_llm_setup():
143
- request_kwargs["extra_body"] = {
144
- "top_k": llm_config.top_k,
145
- "min_p": llm_config.min_p,
146
- "repetition_penalty": llm_config.repetition_penalty,
147
- "chat_template_kwargs": {"enable_thinking": llm_config.enable_thinking},
148
- }
149
  return request_kwargs
150
 
151
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
  def run_baseline_suite(
153
  base_url: str,
154
  config: BaselineRequest | None = None,
@@ -158,9 +287,7 @@ def run_baseline_suite(
158
  run_paths = prepare_run_paths(timestamp)
159
  attach_file_logger(run_paths["log"])
160
 
161
- request_config = config or BaselineRequest(
162
- model=_default_model_name()
163
- )
164
  llm_config = BaselineConfig(
165
  model=request_config.model,
166
  max_tokens=request_config.max_tokens,
@@ -182,12 +309,12 @@ def run_baseline_suite(
182
  episode_lengths: Dict[TaskId, int] = {}
183
  evaluation_records: list[dict[str, Any]] = []
184
  logger.info(
185
- "Starting baseline suite base_url=%s model=%s num_seeds=%s seed_start=%s local_setup=%s",
186
  base_url,
 
187
  llm_config.model,
188
  llm_config.num_seeds,
189
  llm_config.seed_start,
190
- _use_local_llm_setup(),
191
  )
192
 
193
  with GridEnv(base_url=base_url).sync() as env:
@@ -528,6 +655,25 @@ def choose_action_with_qwen(
528
  },
529
  }
530
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
531
  final_prompt = build_final_selection_prompt(
532
  task_id=task_id,
533
  observation=observation,
@@ -610,12 +756,26 @@ def build_proposal_prompt(
610
  majority_islands_available = bool(
611
  observation.metadata.get("majority_islands_available", False)
612
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
613
  lines = [
614
  "You are a grid operator proposing actions for a deterministic simulator.",
615
  "Propose exactly 3 candidate actions to test in the physics sandbox.",
616
  "Allowed action types: disconnect_line, reconnect_line, redispatch, do_nothing.",
617
  "Return a single JSON object only.",
618
- 'Use this exact schema: {"candidates":[{"action_type":"disconnect_line|reconnect_line|redispatch|do_nothing","line_id":null|int,"gen_id":null|int,"delta_mw":null|float,"reason":"short string"}]}',
619
  "Rules: no markdown, no prose, no code fences, no extra keys, exactly 3 candidates.",
620
  "Diversity rule: use at least two different action types when plausible.",
621
  "CRITICAL PHYSICS RULE: You must prioritize candidates from the sensitivity_guidance list. These actions have been mathematically verified by power-flow sensitivity factors to reduce the load on the stressed line.",
@@ -648,6 +808,10 @@ def build_proposal_prompt(
648
  6,
649
  "TASK RULE: For single_fault, do not propose disconnect_line or reconnect_line. Use redispatch and do_nothing only. Solve congestion by shifting generation, not by cutting topology.",
650
  )
 
 
 
 
651
  if task_id == "n_minus_1":
652
  danger_lines = [
653
  entry for entry in stressed_lines if float(entry["rho"]) >= 0.92
@@ -666,22 +830,50 @@ def build_proposal_prompt(
666
  )
667
  lines.insert(
668
  7,
669
- f"N-1 STRUCTURAL SECURITY: score={float(graph_intelligence.get('n1_security_score', 0.0)):.3f}; bridge_lines={json.dumps(graph_intelligence.get('bridge_lines', []), separators=(',', ':'))}",
670
  )
671
  lines.insert(
672
  8,
673
- "THRESHOLDS: EMERGENCY if any line rho >= 0.92, WARNING for 0.80 <= rho < 0.92, SAFE if all lines are below 0.80.",
674
  )
675
  lines.insert(
676
  9,
677
- "EMERGENCY_LINES=" + json.dumps(danger_lines, separators=(",", ":")),
678
  )
679
  lines.insert(
680
  10,
681
- "WARNING_LINES=" + json.dumps(warning_lines, separators=(",", ":")),
682
  )
683
  lines.insert(
684
  11,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
685
  "RECONNECT_WINDOW_LINES="
686
  + json.dumps(cooldown_zero_lines, separators=(",", ":")),
687
  )
@@ -806,6 +998,19 @@ def build_final_selection_prompt(
806
  7,
807
  "RULE: If a simulated candidate safely reduces max_rho compared to the current state, you MUST select it over do_nothing, no matter how small the reduction is. Do not choose do_nothing unless every other candidate increases max_rho or causes a failure. Safe, incremental redispatch improvements are the only way to win.",
808
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
809
  if task_id == "multi_stage_cascade":
810
  lines.insert(
811
  7,
@@ -927,7 +1132,14 @@ def parse_candidate_proposals(
927
  task_id: TaskId = "n_minus_1",
928
  ) -> tuple[list[tuple[GridAction, dict[str, Any]]], dict[str, Any]]:
929
  payload = parse_json_action(content)
930
- raw_candidates = payload.get("candidates", [])
 
 
 
 
 
 
 
931
  candidates: list[tuple[GridAction, dict[str, Any]]] = []
932
  if isinstance(raw_candidates, list):
933
  for item in raw_candidates[:3]:
@@ -1649,15 +1861,23 @@ def append_evaluation_markdown(
1649
 
1650
  if __name__ == "__main__":
1651
  parser = argparse.ArgumentParser()
 
 
 
 
 
1652
  parser.add_argument(
1653
  "--task-id",
1654
  dest="task_ids",
1655
  nargs="+",
1656
  choices=sorted(TASKS.keys()),
1657
- help="Run only the selected task ids. Defaults to all tasks.",
1658
  )
1659
  args = parser.parse_args()
1660
 
1661
- base_url = os.environ.get("GRID2OP_BASE_URL", "http://127.0.0.1:7860")
1662
- result = run_baseline_suite(base_url=base_url, task_ids=args.task_ids)
1663
- print(result.model_dump_json(indent=2))
 
 
 
 
27
  from grid2op_env.server.tasks import TASKS, benchmark_tiers_for_task
28
 
29
 
30
+ def configure_logging(level: int = logging.WARNING) -> None:
31
  root_logger = logging.getLogger()
32
  if root_logger.handlers:
33
  root_logger.setLevel(level)
 
56
  logger = logging.getLogger(__name__)
57
 
58
  TASK_SEED_OVERRIDES: dict[TaskId, int] = {
59
+ "single_fault": 1,
60
+ "n_minus_1": 4,
61
  "cascade_prevent": 2,
62
+ "multi_stage_cascade": 4,
63
  }
64
  HF_ROUTER_BASE_URL = "https://router.huggingface.co/v1"
65
+ HF_ROUTER_DEFAULT_MODEL = "openai/gpt-oss-20b:groq"
66
+ DEFAULT_ENV_BASE_URL = "http://127.0.0.1:7860"
67
+ DEFAULT_BENCHMARK_NAME = "grid2op_env"
68
+ SUBMISSION_SUCCESS_SCORE_THRESHOLD = float(
69
+ os.getenv("SUCCESS_SCORE_THRESHOLD", "0.1")
70
+ )
71
 
72
 
73
  @dataclass
 
101
  raw_result: dict[str, Any]
102
 
103
 
104
+ def _default_model_name() -> str:
105
+ return os.environ.get("MODEL_NAME", HF_ROUTER_DEFAULT_MODEL)
 
 
 
 
106
 
 
 
 
 
107
 
108
+ def _llm_api_base_url() -> str:
109
+ return os.environ.get("API_BASE_URL", HF_ROUTER_BASE_URL)
 
 
 
110
 
111
 
112
  def _build_llm_client() -> OpenAI:
113
+ api_key = os.environ.get("HF_TOKEN") or os.environ.get("API_KEY")
114
+ if not api_key:
 
 
115
  raise RuntimeError(
116
+ "Set HF_TOKEN or API_KEY to use Hugging Face Router inference."
117
  )
118
  return OpenAI(
119
+ base_url=_llm_api_base_url(),
120
+ api_key=api_key,
121
  )
122
 
123
 
 
132
  "temperature": llm_config.temperature,
133
  "top_p": llm_config.top_p,
134
  "presence_penalty": llm_config.presence_penalty,
135
+ "stream": False,
136
  }
 
 
 
 
 
 
 
137
  return request_kwargs
138
 
139
 
140
+ def log_start(task: str, env: str, model: str) -> None:
141
+ print(f"[START] task={task} env={env} model={model}", flush=True)
142
+
143
+
144
+ def log_step(
145
+ step: int,
146
+ action: GridAction,
147
+ reward: float,
148
+ done: bool,
149
+ error: str | None,
150
+ ) -> None:
151
+ error_val = error if error else "null"
152
+ done_val = str(done).lower()
153
+ action_str = json.dumps(action.model_dump(), separators=(",", ":"), sort_keys=True)
154
+ print(
155
+ f"[STEP] step={step} action={action_str} reward={reward:.2f} done={done_val} error={error_val}",
156
+ flush=True,
157
+ )
158
+
159
+
160
+ def log_end(success: bool, steps: int, score: float, rewards: list[float]) -> None:
161
+ rewards_str = ",".join(f"{reward:.2f}" for reward in rewards)
162
+ print(
163
+ f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
164
+ flush=True,
165
+ )
166
+
167
+
168
+ def run_submission_episodes(task_ids: Sequence[TaskId] | None = None) -> dict[TaskId, float]:
169
+ base_url = os.environ.get("GRID2OP_BASE_URL", DEFAULT_ENV_BASE_URL)
170
+ benchmark_name = os.environ.get("GRID2OP_BENCHMARK", DEFAULT_BENCHMARK_NAME)
171
+ scenario_mode = os.environ.get("GRID2OP_SCENARIO_MODE", "benchmark")
172
+ selected_task_ids = list(task_ids) if task_ids is not None else list(TASKS.keys())
173
+ llm_config = BaselineConfig(
174
+ model=_default_model_name(),
175
+ max_tokens=int(os.environ.get("MAX_TOKENS", "1200")),
176
+ temperature=float(os.environ.get("TEMPERATURE", "0.7")),
177
+ top_p=float(os.environ.get("TOP_P", "0.8")),
178
+ presence_penalty=float(os.environ.get("PRESENCE_PENALTY", "1.5")),
179
+ top_k=int(os.environ.get("TOP_K", "20")),
180
+ min_p=float(os.environ.get("MIN_P", "0.0")),
181
+ repetition_penalty=float(os.environ.get("REPETITION_PENALTY", "1.0")),
182
+ enable_thinking=False,
183
+ num_seeds=int(os.environ.get("NUM_SEEDS", "5")),
184
+ seed_start=int(os.environ.get("SEED_START", "0")),
185
+ scenario_mode=scenario_mode, # type: ignore[arg-type]
186
+ )
187
+ client = _build_llm_client()
188
+ task_scores: dict[TaskId, float] = {}
189
+ with GridEnv(base_url=base_url).sync() as env:
190
+ for task_id in selected_task_ids:
191
+ task = TASKS[task_id]
192
+ benchmark_tiers = benchmark_tiers_for_task(task_id)
193
+ task_num_seeds = TASK_SEED_OVERRIDES.get(task_id, llm_config.num_seeds)
194
+ task_episode_scores: list[float] = []
195
+ for benchmark_tier in benchmark_tiers:
196
+ for seed in range(
197
+ llm_config.seed_start, llm_config.seed_start + task_num_seeds
198
+ ):
199
+ rewards: list[float] = []
200
+ steps_taken = 0
201
+ score = 0.0
202
+ success = False
203
+ log_start(task=task_id, env=benchmark_name, model=llm_config.model)
204
+ try:
205
+ result = env.reset(
206
+ task_id=task_id,
207
+ seed=seed,
208
+ difficulty_level=1,
209
+ scenario_mode=scenario_mode, # type: ignore[arg-type]
210
+ benchmark_tier=benchmark_tier,
211
+ )
212
+ state = env.state()
213
+ step_idx = 0
214
+
215
+ while not result.done and step_idx < task.max_steps:
216
+ action, _planning_trace = choose_action_with_qwen(
217
+ client=client,
218
+ env=env,
219
+ episode_id=state.episode_id,
220
+ task_id=task_id,
221
+ observation=result.observation,
222
+ step_count=step_idx,
223
+ max_steps=task.max_steps,
224
+ include_task_description=(step_idx == 0),
225
+ llm_config=llm_config,
226
+ )
227
+ error: str | None = None
228
+ try:
229
+ result = env.step(action)
230
+ except Exception as exc:
231
+ error = str(exc)
232
+ log_step(
233
+ step=step_idx + 1,
234
+ action=action,
235
+ reward=0.0,
236
+ done=True,
237
+ error=error,
238
+ )
239
+ raise
240
+ reward = float(result.reward or 0.0)
241
+ rewards.append(reward)
242
+ steps_taken = step_idx + 1
243
+ log_step(
244
+ step=steps_taken,
245
+ action=action,
246
+ reward=reward,
247
+ done=bool(result.done),
248
+ error=error,
249
+ )
250
+ step_idx += 1
251
+
252
+ state = env.state()
253
+ response = requests.post(
254
+ f"{base_url}/grader",
255
+ json={
256
+ "task_id": task_id,
257
+ "episode_log": [
258
+ entry.model_dump() for entry in state.episode_log
259
+ ],
260
+ },
261
+ timeout=60,
262
+ )
263
+ response.raise_for_status()
264
+ score = float(response.json()["score"])
265
+ task_episode_scores.append(score)
266
+ success = score >= SUBMISSION_SUCCESS_SCORE_THRESHOLD
267
+ finally:
268
+ log_end(
269
+ success=success,
270
+ steps=steps_taken,
271
+ score=score,
272
+ rewards=rewards,
273
+ )
274
+ task_scores[task_id] = (
275
+ round(mean(task_episode_scores), 6) if task_episode_scores else 0.0
276
+ )
277
+
278
+ return task_scores
279
+
280
+
281
  def run_baseline_suite(
282
  base_url: str,
283
  config: BaselineRequest | None = None,
 
287
  run_paths = prepare_run_paths(timestamp)
288
  attach_file_logger(run_paths["log"])
289
 
290
+ request_config = config or BaselineRequest(model=_default_model_name())
 
 
291
  llm_config = BaselineConfig(
292
  model=request_config.model,
293
  max_tokens=request_config.max_tokens,
 
309
  episode_lengths: Dict[TaskId, int] = {}
310
  evaluation_records: list[dict[str, Any]] = []
311
  logger.info(
312
+ "Starting baseline suite base_url=%s llm_api_base_url=%s model=%s num_seeds=%s seed_start=%s",
313
  base_url,
314
+ _llm_api_base_url(),
315
  llm_config.model,
316
  llm_config.num_seeds,
317
  llm_config.seed_start,
 
318
  )
319
 
320
  with GridEnv(base_url=base_url).sync() as env:
 
655
  },
656
  }
657
 
658
+ if task_id == "single_fault":
659
+ selected_outcome = selectable_simulations[0]
660
+ return selected_outcome.action, {
661
+ "proposal_prompt": proposal_prompt,
662
+ "proposal_raw_output": proposal_raw_output,
663
+ "proposal_trace": {**proposal_trace, **prefilter_trace},
664
+ "graph_intelligence": graph_intelligence,
665
+ "simulations": [
666
+ serialize_simulation_outcome(outcome) for outcome in simulations
667
+ ],
668
+ "final_prompt": "",
669
+ "final_raw_output": "",
670
+ "final_trace": {
671
+ "decision": "single_call_ranked_selection",
672
+ "reason": selected_outcome.trace.get("reason", ""),
673
+ "selected_candidate": selected_outcome.candidate_index,
674
+ },
675
+ }
676
+
677
  final_prompt = build_final_selection_prompt(
678
  task_id=task_id,
679
  observation=observation,
 
756
  majority_islands_available = bool(
757
  observation.metadata.get("majority_islands_available", False)
758
  )
759
+ action_schema = (
760
+ '{"action_type":"disconnect_line|reconnect_line|redispatch|do_nothing","line_id":null|int,"gen_id":null|int,"delta_mw":null|float,"reason":"short string"}'
761
+ )
762
+ response_schema = (
763
+ '{"primary_action":'
764
+ + action_schema
765
+ + ',"backup_action_1":'
766
+ + action_schema
767
+ + ',"backup_action_2":'
768
+ + action_schema
769
+ + "}"
770
+ if task_id == "single_fault"
771
+ else '{"candidates":[' + action_schema + "," + action_schema + "," + action_schema + "]}"
772
+ )
773
  lines = [
774
  "You are a grid operator proposing actions for a deterministic simulator.",
775
  "Propose exactly 3 candidate actions to test in the physics sandbox.",
776
  "Allowed action types: disconnect_line, reconnect_line, redispatch, do_nothing.",
777
  "Return a single JSON object only.",
778
+ "Use this exact schema: " + response_schema,
779
  "Rules: no markdown, no prose, no code fences, no extra keys, exactly 3 candidates.",
780
  "Diversity rule: use at least two different action types when plausible.",
781
  "CRITICAL PHYSICS RULE: You must prioritize candidates from the sensitivity_guidance list. These actions have been mathematically verified by power-flow sensitivity factors to reduce the load on the stressed line.",
 
808
  6,
809
  "TASK RULE: For single_fault, do not propose disconnect_line or reconnect_line. Use redispatch and do_nothing only. Solve congestion by shifting generation, not by cutting topology.",
810
  )
811
+ lines.insert(
812
+ 7,
813
+ "TASK RULE: Rank your output strictly as primary_action first, then backup_action_1, then backup_action_2. The simulator will test all three and execute the highest-ranked safe option.",
814
+ )
815
  if task_id == "n_minus_1":
816
  danger_lines = [
817
  entry for entry in stressed_lines if float(entry["rho"]) >= 0.92
 
830
  )
831
  lines.insert(
832
  7,
833
+ f"FAULTED_LINE=0; disconnected_now={json.dumps([entry['line_id'] for entry in disconnected], separators=(',', ':'))}",
834
  )
835
  lines.insert(
836
  8,
837
+ f"N-1 PHASE={'emergency' if step_count < 5 else 'steady_state'}; emergency_window_steps_remaining={max(0, 5 - step_count)}",
838
  )
839
  lines.insert(
840
  9,
841
+ "EMERGENCY OBJECTIVE: In steps 1-5, prioritize actions that bring max_rho below 0.92 as fast as possible. Clearing the emergency window is the top priority.",
842
  )
843
  lines.insert(
844
  10,
845
+ "STEADY-STATE OBJECTIVE: From step 6 onward, prioritize keeping max_rho below 0.90 on as many steps as possible while preserving survivability.",
846
  )
847
  lines.insert(
848
  11,
849
+ "RECONNECTION OBJECTIVE: When line 0 cooldown reaches 0, include a reconnect_line candidate for line 0 unless graph intelligence or current overloads strongly suggest it is unsafe.",
850
+ )
851
+ lines.insert(
852
+ 12,
853
+ "CANDIDATE RULE: In the emergency phase, include at least one redispatch candidate aimed at immediate rho reduction. Do not fill the set with passive do_nothing-style choices.",
854
+ )
855
+ lines.insert(
856
+ 13,
857
+ "CANDIDATE RULE: If no action looks clearly better, still propose the smallest safe redispatch or a safe reconnect test rather than defaulting all candidates toward do_nothing.",
858
+ )
859
+ lines.insert(
860
+ 14,
861
+ f"N-1 STRUCTURAL SECURITY: score={float(graph_intelligence.get('n1_security_score', 0.0)):.3f}; bridge_lines={json.dumps(graph_intelligence.get('bridge_lines', []), separators=(',', ':'))}",
862
+ )
863
+ lines.insert(
864
+ 15,
865
+ "THRESHOLDS: EMERGENCY if any line rho >= 0.92, WARNING for 0.80 <= rho < 0.92, SAFE if all lines are below 0.80.",
866
+ )
867
+ lines.insert(
868
+ 16,
869
+ "EMERGENCY_LINES=" + json.dumps(danger_lines, separators=(",", ":")),
870
+ )
871
+ lines.insert(
872
+ 17,
873
+ "WARNING_LINES=" + json.dumps(warning_lines, separators=(",", ":")),
874
+ )
875
+ lines.insert(
876
+ 18,
877
  "RECONNECT_WINDOW_LINES="
878
  + json.dumps(cooldown_zero_lines, separators=(",", ":")),
879
  )
 
998
  7,
999
  "RULE: If a simulated candidate safely reduces max_rho compared to the current state, you MUST select it over do_nothing, no matter how small the reduction is. Do not choose do_nothing unless every other candidate increases max_rho or causes a failure. Safe, incremental redispatch improvements are the only way to win.",
1000
  )
1001
+ if task_id == "n_minus_1":
1002
+ lines.insert(
1003
+ 7,
1004
+ "RULE: In steps 1-5, prioritize candidates that clear the emergency by bringing max_rho below 0.92. Do not choose do_nothing in the emergency window if a safe simulated action lowers max_rho.",
1005
+ )
1006
+ lines.insert(
1007
+ 8,
1008
+ "RULE: When a safe reconnect_line action for line 0 is available after cooldown, strongly prefer it if it improves or preserves security.",
1009
+ )
1010
+ lines.insert(
1011
+ 9,
1012
+ "RULE: After step 5, prefer candidates that keep max_rho below 0.90 on future steps rather than merely surviving at higher stress.",
1013
+ )
1014
  if task_id == "multi_stage_cascade":
1015
  lines.insert(
1016
  7,
 
1132
  task_id: TaskId = "n_minus_1",
1133
  ) -> tuple[list[tuple[GridAction, dict[str, Any]]], dict[str, Any]]:
1134
  payload = parse_json_action(content)
1135
+ if task_id == "single_fault":
1136
+ raw_candidates = [
1137
+ payload.get("primary_action"),
1138
+ payload.get("backup_action_1"),
1139
+ payload.get("backup_action_2"),
1140
+ ]
1141
+ else:
1142
+ raw_candidates = payload.get("candidates", [])
1143
  candidates: list[tuple[GridAction, dict[str, Any]]] = []
1144
  if isinstance(raw_candidates, list):
1145
  for item in raw_candidates[:3]:
 
1861
 
1862
  if __name__ == "__main__":
1863
  parser = argparse.ArgumentParser()
1864
+ parser.add_argument(
1865
+ "--baseline-suite",
1866
+ action="store_true",
1867
+ help="Run the internal multi-task baseline suite instead of the submission episode runner.",
1868
+ )
1869
  parser.add_argument(
1870
  "--task-id",
1871
  dest="task_ids",
1872
  nargs="+",
1873
  choices=sorted(TASKS.keys()),
1874
+ help="Run only the selected task ids for --baseline-suite. Defaults to all tasks.",
1875
  )
1876
  args = parser.parse_args()
1877
 
1878
+ if args.baseline_suite:
1879
+ base_url = os.environ.get("GRID2OP_BASE_URL", DEFAULT_ENV_BASE_URL)
1880
+ result = run_baseline_suite(base_url=base_url, task_ids=args.task_ids)
1881
+ print(result.model_dump_json(indent=2))
1882
+ else:
1883
+ run_submission_episodes(task_ids=args.task_ids)
inference_speed_test.py DELETED
@@ -1,34 +0,0 @@
1
- import os
2
- import time
3
- from dotenv import load_dotenv
4
- from openai import OpenAI
5
-
6
- load_dotenv()
7
-
8
- client = OpenAI(
9
- base_url="https://router.huggingface.co/v1",
10
- api_key=os.environ["HF_TOKEN"],
11
- )
12
-
13
- start_time = time.perf_counter()
14
-
15
- completion = client.chat.completions.create(
16
- model="Qwen/Qwen3.5-9B:fastest",
17
- messages=[
18
- {
19
- "role": "user",
20
- "content": "Write a detailed essay about the history of artificial intelligence, its major milestones, and future implications. Include at least 500 words.",
21
- }
22
- ],
23
- max_tokens=1000,
24
- )
25
-
26
- end_time = time.perf_counter()
27
-
28
- latency = (end_time - start_time) * 1000
29
- tokens = completion.usage.completion_tokens
30
-
31
- print(f"Response: {completion.choices[0].message.content}")
32
- print(f"Latency: {latency:.2f} ms")
33
- print(f"Tokens: {tokens}")
34
- print(f"Throughput: {tokens / (latency / 1000):.2f} tokens/sec")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
main.py DELETED
@@ -1,6 +0,0 @@
1
- def main():
2
- print("Hello from openenv-modules!")
3
-
4
-
5
- if __name__ == "__main__":
6
- main()
 
 
 
 
 
 
 
grid2op_env/models.py → models.py RENAMED
@@ -112,7 +112,7 @@ class GraderResponse(BaseModel):
112
 
113
  class BaselineRequest(BaseModel):
114
  model: str = Field(default="Qwen/Qwen3.5-9B")
115
- max_tokens: int = Field(default=500, ge=1)
116
  temperature: float = 0.7
117
  top_p: float = 0.8
118
  presence_penalty: float = 1.5
 
112
 
113
  class BaselineRequest(BaseModel):
114
  model: str = Field(default="Qwen/Qwen3.5-9B")
115
+ max_tokens: int = Field(default=1200, ge=1)
116
  temperature: float = 0.7
117
  top_p: float = 0.8
118
  presence_penalty: float = 1.5
grid2op_env/openenv.yaml → openenv.yaml RENAMED
File without changes
{grid2op_env/outputs → outputs}/evals/.gitkeep RENAMED
File without changes
{grid2op_env/outputs → outputs}/logs/.gitkeep RENAMED
File without changes
pyproject.toml CHANGED
@@ -1,11 +1,35 @@
 
 
 
 
1
  [project]
2
- name = "openenv-modules"
3
  version = "0.1.0"
4
- description = "Add your description here"
5
  readme = "README.md"
6
- requires-python = ">=3.13"
7
  dependencies = [
8
- "fastmcp>=3.1.1",
9
- "numba>=0.64.0",
10
- "openenv-core>=0.2.2",
 
 
 
11
  ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["setuptools>=45", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
  [project]
6
+ name = "grid2op-env"
7
  version = "0.1.0"
8
+ description = "Standalone OpenEnv wrapper around Grid2Op"
9
  readme = "README.md"
10
+ requires-python = ">=3.10,<3.13"
11
  dependencies = [
12
+ "openenv-core[core]>=0.2.2",
13
+ "grid2op>=1.10.5",
14
+ "numpy>=1.24.0",
15
+ "openai>=2.7.2",
16
+ "python-dotenv>=1.0.1",
17
+ "requests>=2.31.0",
18
  ]
19
+
20
+ [project.optional-dependencies]
21
+ dev = [
22
+ "pytest>=8.0.0",
23
+ ]
24
+ lightsim = [
25
+ "lightsim2grid>=0.10.0",
26
+ ]
27
+
28
+ [project.scripts]
29
+ server = "grid2op_env.server.app:main"
30
+ grid2op-smoke = "grid2op_env.server.grid_environment:smoke_main"
31
+
32
+ [tool.setuptools]
33
+ include-package-data = true
34
+ packages = ["grid2op_env", "grid2op_env.server"]
35
+ package-dir = { "grid2op_env" = ".", "grid2op_env.server" = "server" }
{grid2op_env/server → server}/Dockerfile RENAMED
File without changes
{grid2op_env/server → server}/__init__.py RENAMED
File without changes
{grid2op_env/server → server}/app.py RENAMED
File without changes
{grid2op_env/server → server}/graders.py RENAMED
File without changes
{grid2op_env/server → server}/grid_environment.py RENAMED
File without changes
{grid2op_env/server → server}/logging_utils.py RENAMED
File without changes
{grid2op_env/server → server}/requirements.txt RENAMED
File without changes
{grid2op_env/server → server}/tasks.py RENAMED
File without changes
submission/README.md ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenEnv Hackathon Submission Requirements
2
+
3
+ ## Overview
4
+
5
+ This document outlines all requirements for submitting an environment to the OpenEnv Hackathon. All submissions must meet the criteria defined in this document to be evaluated.
6
+
7
+ ---
8
+
9
+ ## 1. Task Requirements
10
+
11
+ ### 1.1 Real-World Task Simulation
12
+ - The environment must simulate a task **humans actually do**
13
+ - **NOT** games or toys
14
+ - Examples of acceptable domains: email triage, code review, data cleaning, scheduling, customer support, content moderation, power grid management
15
+
16
+ ### 1.2 OpenEnv Spec Compliance
17
+ - Implement the full OpenEnv interface:
18
+ - Typed `Observation`, `Action`, and `Reward` Pydantic models
19
+ - `step(action)` → returns `observation, reward, done, info`
20
+ - `reset()` → returns initial observation
21
+ - `state()` → returns current state
22
+ - Include `openenv.yaml` with metadata
23
+ - Tested via `openenv validate`
24
+
25
+ ### 1.3 Minimum 3 Tasks with Agent Graders
26
+ - **Each task** must have:
27
+ - A concrete objective an agent must accomplish
28
+ - A programmatic grader that scores performance (0.0–1.0)
29
+ - Clear, deterministic success/failure criteria
30
+ - **Difficulty progression**: easy → medium → hard
31
+
32
+ ### 1.4 Meaningful Reward Function
33
+ - Provides signal over the full trajectory (not just binary end-of-episode)
34
+ - Rewards partial progress toward task completion
35
+ - Penalizes clearly undesirable behavior (e.g., infinite loops, destructive actions)
36
+
37
+ ---
38
+
39
+ ## 2. Functional Requirements
40
+
41
+ ### 2.1 Baseline Inference Script
42
+ - Must be named `inference.py` and placed in the **root directory**
43
+ - Use the OpenAI API client to run a model against the environment
44
+ - Read API credentials from environment variables:
45
+ - `API_BASE_URL` - The API endpoint for the LLM
46
+ - `MODEL_NAME` - The model identifier to use for inference
47
+ - `HF_TOKEN` - Your Hugging Face / API key
48
+ - Produce a reproducible baseline score on all tasks
49
+
50
+ ### 2.2 Structured Logging
51
+ - Emit structured stdout logs strictly following the format:
52
+ - `[START]` - Episode start
53
+ - `[STEP]` - Each step
54
+ - `[END]` - Episode end
55
+ - Any deviation in field names, ordering, or formatting will result in incorrect evaluation
56
+
57
+ ---
58
+
59
+ ## 3. Deployment Requirements
60
+
61
+ ### 3.1 Hugging Face Spaces
62
+ - Environment must run as a containerized HF Space tagged with `openenv`
63
+ - Automated ping to the Space URL — must return 200 and respond to `reset()`
64
+
65
+ ### 3.2 Containerized Execution
66
+ - Must include a working `Dockerfile`
67
+ - The environment should start cleanly with `docker build && docker run`
68
+
69
+ ### 3.3 Infrastructure Restrictions
70
+ - Runtime of inference script should be **less than 20 minutes**
71
+ - Must run on a machine with `vcpu=2, memory=8gb`
72
+
73
+ ---
74
+
75
+ ## 4. Documentation Requirements
76
+
77
+ ### 4.1 README
78
+ Must include:
79
+ - Environment description and motivation
80
+ - Action and observation space definitions
81
+ - Task descriptions with expected difficulty
82
+ - Setup and usage instructions
83
+ - Baseline scores
84
+
85
+ ---
86
+
87
+ ## 5. Evaluation Criteria
88
+
89
+ ### 5.1 Parameter Weights
90
+
91
+ | Parameter | Weight | Description |
92
+ |-----------|--------|-------------|
93
+ | **Real-world utility** | 30% | Does the environment model a genuine task? Would someone actually use this to train or evaluate agents? |
94
+ | **Task & grader quality** | 25% | Are tasks well-defined with clear objectives? Do graders accurately and fairly measure success? Meaningful difficulty progression? |
95
+ | **Environment design** | 20% | Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries |
96
+ | **Code quality & spec compliance** | 15% | Follows OpenEnv spec, clean project structure, typed models, documented, tested, Dockerfile works |
97
+ | **Creativity & novelty** | 10% | Novel problem domain, interesting mechanics, clever reward design, original approach |
98
+
99
+ ### 5.2 Scoring Breakdown
100
+
101
+ #### Real-world utility (30%)
102
+ - 0–5: Toy/artificial problem with no practical application
103
+ - 6–15: Valid domain but shallow modeling of the real task
104
+ - 16–25: Good domain modeling, would be useful for agent evaluation
105
+ - 26–30: Excellent — fills a real gap, immediate value for the RL/agent community
106
+
107
+ #### Task & grader quality (25%)
108
+ - ✅ 3+ tasks with difficulty range?
109
+ - ✅ Graders produce scores between 0.0–1.0?
110
+ - ✅ Graders deterministic and reproducible?
111
+ - ✅ Hard task genuinely challenges frontier models?
112
+
113
+ #### Environment design (20%)
114
+ - ✅ `reset()` produces clean state?
115
+ - ✅ Action/observation types well-designed and documented?
116
+ - ✅ Reward function provides useful varying signal (not just sparse)?
117
+ - ✅ Episode boundaries sensible?
118
+
119
+ #### Code quality & spec compliance (15%)
120
+ - ✅ `openenv validate` passes?
121
+ - ✅ `docker build && docker run` works?
122
+ - ✅ HF Space deploys and responds?
123
+ - ✅ Baseline script runs and reproduces scores?
124
+
125
+ #### Creativity & novelty (10%)
126
+ - ✅ Domain we haven't seen in OpenEnv before?
127
+ - ✅ Reward design has interesting properties?
128
+ - ✅ Clever mechanics that make the environment engaging?
129
+
130
+ ---
131
+
132
+ ## 6. Validation Checklist
133
+
134
+ Before submitting, ensure:
135
+
136
+ - [ ] `openenv validate` passes
137
+ - [ ] `docker build && docker run` works
138
+ - [ ] HF Space deploys and responds to `reset()`
139
+ - [ ] Baseline inference script runs without error
140
+ - [ ] 3+ tasks with graders (scores in 0.0–1.0 range)
141
+ - [ ] `inference.py` named correctly and in root directory
142
+ - [ ] Environment variables defined: `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`
143
+ - [ ] Structured logs follow `[START]`, `[STEP]`, `[END]` format
144
+ - [ ] Runtime under 20 minutes
145
+ - [ ] Works on 2 vCPU, 8GB RAM machine
146
+
147
+ ---
148
+
149
+ ## 7. Judging Phases
150
+
151
+ ### Phase 1: Automated Validation (Pass/Fail)
152
+ - HF Space deploys
153
+ - OpenEnv spec compliance
154
+ - Dockerfile builds
155
+ - Baseline reproduces
156
+ - 3+ tasks with graders
157
+
158
+ ### Phase 2: Agentic Evaluation (Scored)
159
+ - Baseline agent re-run
160
+ - Standard Open LLM agent (e.g., Nemotron 3 Super) run against all environments
161
+ - Score variance check
162
+
163
+ ### Phase 3: Human Review
164
+ - Top submissions reviewed by Meta and HuggingFace engineers
165
+ - Real-world utility check
166
+ - Creativity check
167
+ - Exploit checks
168
+
169
+ ---
170
+
171
+ ## 8. Disqualification Criteria
172
+
173
+ The following will result in disqualification:
174
+
175
+ - Environment does not deploy or respond
176
+ - Plagiarized or trivially modified existing environments
177
+ - Graders that always return the same score
178
+
179
+ ---
180
+
181
+ ## 9. Example: Power Grid Environment (Reference)
182
+
183
+ Your environment should follow a similar structure:
184
+
185
+ ```
186
+ project/
187
+ ├── inference.py # Baseline inference script
188
+ ├── openenv.yaml # OpenEnv metadata
189
+ ├── Dockerfile # Container configuration
190
+ ├── README.md # Documentation
191
+ ├── src/
192
+ │ ├── tasks.py # Task definitions (3+ tasks)
193
+ │ ├── graders.py # Task graders (scores 0.0-1.0)
194
+ │ ├── environment.py # Environment implementation
195
+ │ └── models.py # Typed Observation/Action models
196
+ └── requirements.txt # Dependencies
197
+ ```
198
+
199
+ ---
200
+
201
+ ## Summary Checklist
202
+
203
+ | Requirement | Mandatory? |
204
+ |-------------|-------------|
205
+ | Real-world task (not games) | ✅ Yes |
206
+ | OpenEnv spec compliance | ✅ Yes |
207
+ | 3 tasks (easy→medium→hard) | ✅ Yes |
208
+ | Graders (0.0–1.0 scores) | ✅ Yes |
209
+ | Meaningful reward function | ✅ Yes |
210
+ | `inference.py` in root | ✅ Yes |
211
+ | HF_TOKEN, MODEL_NAME, API_BASE_URL | ✅ Yes |
212
+ | Structured logs [START/STEP/END] | ✅ Yes |
213
+ | HF Space deploys | ✅ Yes |
214
+ | Dockerfile works | ✅ Yes |
215
+ | Runtime < 20 min | ✅ Yes |
216
+ | 2 vCPU, 8GB RAM | ✅ Yes |
217
+ | README with setup instructions | ✅ Yes |
submission/pre_validation.sh ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ #
3
+ # validate-submission.sh — OpenEnv Submission Validator
4
+ #
5
+ # Checks that your HF Space is live, Docker image builds, and openenv validate passes.
6
+ #
7
+ # Prerequisites:
8
+ # - Docker: https://docs.docker.com/get-docker/
9
+ # - openenv-core: pip install openenv-core
10
+ # - curl (usually pre-installed)
11
+ #
12
+ # Run:
13
+ # curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
14
+ #
15
+ # Or download and run locally:
16
+ # chmod +x validate-submission.sh
17
+ # ./validate-submission.sh <ping_url> [repo_dir]
18
+ #
19
+ # Arguments:
20
+ # ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)
21
+ # repo_dir Path to your repo (default: current directory)
22
+ #
23
+ # Examples:
24
+ # ./validate-submission.sh https://my-team.hf.space
25
+ # ./validate-submission.sh https://my-team.hf.space ./my-repo
26
+ #
27
+
28
+ set -uo pipefail
29
+
30
+ DOCKER_BUILD_TIMEOUT=600
31
+ if [ -t 1 ]; then
32
+ RED='\033[0;31m'
33
+ GREEN='\033[0;32m'
34
+ YELLOW='\033[1;33m'
35
+ BOLD='\033[1m'
36
+ NC='\033[0m'
37
+ else
38
+ RED='' GREEN='' YELLOW='' BOLD='' NC=''
39
+ fi
40
+
41
+ run_with_timeout() {
42
+ local secs="$1"; shift
43
+ if command -v timeout &>/dev/null; then
44
+ timeout "$secs" "$@"
45
+ elif command -v gtimeout &>/dev/null; then
46
+ gtimeout "$secs" "$@"
47
+ else
48
+ "$@" &
49
+ local pid=$!
50
+ ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
51
+ local watcher=$!
52
+ wait "$pid" 2>/dev/null
53
+ local rc=$?
54
+ kill "$watcher" 2>/dev/null
55
+ wait "$watcher" 2>/dev/null
56
+ return $rc
57
+ fi
58
+ }
59
+
60
+ portable_mktemp() {
61
+ local prefix="${1:-validate}"
62
+ mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
63
+ }
64
+
65
+ CLEANUP_FILES=()
66
+ cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
67
+ trap cleanup EXIT
68
+
69
+ PING_URL="${1:-}"
70
+ REPO_DIR="${2:-.}"
71
+
72
+ if [ -z "$PING_URL" ]; then
73
+ printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
74
+ printf "\n"
75
+ printf " ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
76
+ printf " repo_dir Path to your repo (default: current directory)\n"
77
+ exit 1
78
+ fi
79
+
80
+ if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
81
+ printf "Error: directory '%s' not found\n" "${2:-.}"
82
+ exit 1
83
+ fi
84
+ PING_URL="${PING_URL%/}"
85
+ export PING_URL
86
+ PASS=0
87
+
88
+ log() { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
89
+ pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
90
+ fail() { log "${RED}FAILED${NC} -- $1"; }
91
+ hint() { printf " ${YELLOW}Hint:${NC} %b\n" "$1"; }
92
+ stop_at() {
93
+ printf "\n"
94
+ printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
95
+ exit 1
96
+ }
97
+
98
+ printf "\n"
99
+ printf "${BOLD}========================================${NC}\n"
100
+ printf "${BOLD} OpenEnv Submission Validator${NC}\n"
101
+ printf "${BOLD}========================================${NC}\n"
102
+ log "Repo: $REPO_DIR"
103
+ log "Ping URL: $PING_URL"
104
+ printf "\n"
105
+
106
+ log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
107
+
108
+ CURL_OUTPUT=$(portable_mktemp "validate-curl")
109
+ CLEANUP_FILES+=("$CURL_OUTPUT")
110
+ HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
111
+ -H "Content-Type: application/json" -d '{}' \
112
+ "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
113
+
114
+ if [ "$HTTP_CODE" = "200" ]; then
115
+ pass "HF Space is live and responds to /reset"
116
+ elif [ "$HTTP_CODE" = "000" ]; then
117
+ fail "HF Space not reachable (connection failed or timed out)"
118
+ hint "Check your network connection and that the Space is running."
119
+ hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
120
+ stop_at "Step 1"
121
+ else
122
+ fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
123
+ hint "Make sure your Space is running and the URL is correct."
124
+ hint "Try opening $PING_URL in your browser first."
125
+ stop_at "Step 1"
126
+ fi
127
+
128
+ log "${BOLD}Step 2/3: Running docker build${NC} ..."
129
+
130
+ if ! command -v docker &>/dev/null; then
131
+ fail "docker command not found"
132
+ hint "Install Docker: https://docs.docker.com/get-docker/"
133
+ stop_at "Step 2"
134
+ fi
135
+
136
+ if [ -f "$REPO_DIR/Dockerfile" ]; then
137
+ DOCKER_CONTEXT="$REPO_DIR"
138
+ elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
139
+ DOCKER_CONTEXT="$REPO_DIR/server"
140
+ else
141
+ fail "No Dockerfile found in repo root or server/ directory"
142
+ stop_at "Step 2"
143
+ fi
144
+
145
+ log " Found Dockerfile in $DOCKER_CONTEXT"
146
+
147
+ BUILD_OK=false
148
+ BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
149
+
150
+ if [ "$BUILD_OK" = true ]; then
151
+ pass "Docker build succeeded"
152
+ else
153
+ fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
154
+ printf "%s\n" "$BUILD_OUTPUT" | tail -20
155
+ stop_at "Step 2"
156
+ fi
157
+
158
+ log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
159
+
160
+ if ! command -v openenv &>/dev/null; then
161
+ fail "openenv command not found"
162
+ hint "Install it: pip install openenv-core"
163
+ stop_at "Step 3"
164
+ fi
165
+
166
+ VALIDATE_OK=false
167
+ VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
168
+
169
+ if [ "$VALIDATE_OK" = true ]; then
170
+ pass "openenv validate passed"
171
+ [ -n "$VALIDATE_OUTPUT" ] && log " $VALIDATE_OUTPUT"
172
+ else
173
+ fail "openenv validate failed"
174
+ printf "%s\n" "$VALIDATE_OUTPUT"
175
+ stop_at "Step 3"
176
+ fi
177
+
178
+ printf "\n"
179
+ printf "${BOLD}========================================${NC}\n"
180
+ printf "${GREEN}${BOLD} All 3/3 checks passed!${NC}\n"
181
+ printf "${GREEN}${BOLD} Your submission is ready to submit.${NC}\n"
182
+ printf "${BOLD}========================================${NC}\n"
183
+ printf "\n"
184
+
185
+ exit 0
submission/sample_inference.py ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ""
2
+ Inference Script Example
3
+ ===================================
4
+ MANDATORY
5
+ - Before submitting, ensure the following variables are defined in your environment configuration:
6
+ API_BASE_URL The API endpoint for the LLM.
7
+ MODEL_NAME The model identifier to use for inference.
8
+ HF_TOKEN Your Hugging Face / API key.
9
+ LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
10
+ method
11
+
12
+ - Defaults are set only for API_BASE_URL and MODEL_NAME
13
+ (and should reflect your active inference setup):
14
+ API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
15
+ MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
16
+
17
+ - The inference script must be named `inference.py` and placed in the root directory of the project
18
+ - Participants must use OpenAI Client for all LLM calls using above variables
19
+
20
+ STDOUT FORMAT
21
+ - The script must emit exactly three line types to stdout, in this order:
22
+
23
+ [START] task=<task_name> env=<benchmark> model=<model_name>
24
+ [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
25
+ [END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
26
+
27
+ Rules:
28
+ - One [START] line at episode begin.
29
+ - One [STEP] line per step, immediately after env.step() returns.
30
+ - One [END] line after env.close(), always emitted (even on exception).
31
+ - reward and rewards are formatted to 2 decimal places.
32
+ - done and success are lowercase booleans: true or false.
33
+ - error is the raw last_action_error string, or null if none.
34
+ - All fields on a single line with no newlines within a line.
35
+ - Each tasks should return score in [0, 1]
36
+
37
+ Example:
38
+ [START] task=click-test env=miniwob model=Qwen3-VL-30B
39
+ [STEP] step=1 action=click('123') reward=0.00 done=false error=null
40
+ [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
41
+ [STEP] step=3 action=click('789') reward=1.00 done=true error=null
42
+ [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
43
+ """
44
+
45
+ import asyncio
46
+ import os
47
+ import textwrap
48
+ from typing import List, Optional
49
+
50
+ from openai import OpenAI
51
+
52
+ from my_env_v4 import MyEnvV4Action, MyEnvV4Env
53
+ IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image
54
+ API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
55
+
56
+ API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
57
+ MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
58
+ TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
59
+ BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
60
+ MAX_STEPS = 8
61
+ TEMPERATURE = 0.7
62
+ MAX_TOKENS = 150
63
+ SUCCESS_SCORE_THRESHOLD = 0.1 # normalized score in [0, 1]
64
+
65
+ # Max possible reward: each token contributes 0.1, across all steps
66
+ _MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
67
+ MAX_TOTAL_REWARD = MAX_STEPS * _MAX_REWARD_PER_STEP
68
+
69
+ SYSTEM_PROMPT = textwrap.dedent(
70
+ """
71
+ You are interacting with a simple echo environment.
72
+ Each turn you must send a message. The environment will echo it back.
73
+ Reward is proportional to message length: reward = len(message) * 0.1
74
+ Your goal is to maximize total reward by sending meaningful, substantive messages.
75
+ Reply with exactly one message string — no quotes, no prefixes, just the message text.
76
+ """
77
+ ).strip()
78
+
79
+
80
+ def log_start(task: str, env: str, model: str) -> None:
81
+ print(f"[START] task={task} env={env} model={model}", flush=True)
82
+
83
+
84
+ def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
85
+ error_val = error if error else "null"
86
+ done_val = str(done).lower()
87
+ print(
88
+ f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
89
+ flush=True,
90
+ )
91
+
92
+
93
+ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
94
+ rewards_str = ",".join(f"{r:.2f}" for r in rewards)
95
+ print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
96
+
97
+
98
+ def build_user_prompt(step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
99
+ history_block = "\n".join(history[-4:]) if history else "None"
100
+ return textwrap.dedent(
101
+ f"""
102
+ Step: {step}
103
+ Last echoed message: {last_echoed!r}
104
+ Last reward: {last_reward:.2f}
105
+ Previous steps:
106
+ {history_block}
107
+ Send your next message.
108
+ """
109
+ ).strip()
110
+
111
+
112
+ def get_model_message(client: OpenAI, step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
113
+ user_prompt = build_user_prompt(step, last_echoed, last_reward, history)
114
+ try:
115
+ completion = client.chat.completions.create(
116
+ model=MODEL_NAME,
117
+ messages=[
118
+ {"role": "system", "content": SYSTEM_PROMPT},
119
+ {"role": "user", "content": user_prompt},
120
+ ],
121
+ temperature=TEMPERATURE,
122
+ max_tokens=MAX_TOKENS,
123
+ stream=False,
124
+ )
125
+ text = (completion.choices[0].message.content or "").strip()
126
+ return text if text else "hello"
127
+ except Exception as exc:
128
+ print(f"[DEBUG] Model request failed: {exc}", flush=True)
129
+ return "hello"
130
+
131
+
132
+ async def main() -> None:
133
+ client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
134
+
135
+ env = await MyEnvV4Env.from_docker_image(IMAGE_NAME)
136
+
137
+ history: List[str] = []
138
+ rewards: List[float] = []
139
+ steps_taken = 0
140
+ score = 0.0
141
+ success = False
142
+
143
+ log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
144
+
145
+ try:
146
+ result = await env.reset() # OpenENV.reset()
147
+ last_echoed = result.observation.echoed_message
148
+ last_reward = 0.0
149
+
150
+ for step in range(1, MAX_STEPS + 1):
151
+ if result.done:
152
+ break
153
+
154
+ message = get_model_message(client, step, last_echoed, last_reward, history)
155
+
156
+ result = await env.step(MyEnvV4Action(message=message))
157
+ obs = result.observation
158
+
159
+ reward = result.reward or 0.0
160
+ done = result.done
161
+ error = None
162
+
163
+ rewards.append(reward)
164
+ steps_taken = step
165
+ last_echoed = obs.echoed_message
166
+ last_reward = reward
167
+
168
+ log_step(step=step, action=message, reward=reward, done=done, error=error)
169
+
170
+ history.append(f"Step {step}: {message!r} -> reward {reward:+.2f}")
171
+
172
+ if done:
173
+ break
174
+
175
+ score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
176
+ score = min(max(score, 0.0), 1.0) # clamp to [0, 1]
177
+ success = score >= SUCCESS_SCORE_THRESHOLD
178
+
179
+ finally:
180
+ try:
181
+ await env.close()
182
+ except Exception as e:
183
+ print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
184
+ log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
185
+
186
+
187
+ if __name__ == "__main__":
188
+ asyncio.run(main())
{grid2op_env/tests → tests}/test_grid2op_env.py RENAMED
File without changes
uv.lock CHANGED
The diff for this file is too large to render. See raw diff