Spaces:

Sidharth1743
/

grid2op-openenv

Running

App Files Files Community

Sidharth1743 commited on 2 days ago

Commit

689c71b

1 Parent(s): da3c180

Hackathon polishing

Browse files

Files changed (36) hide show

grid2op_env/.dockerignore → .dockerignore +0 -0
.gitignore +9 -11
.python-version +1 -1
AGENTS.md +25 -0
README.md +220 -0
grid2op_env/__init__.py → __init__.py +0 -0
grid2op_env/client.py → client.py +0 -0
evaluation.md → docs/evaluation.md +123 -0
graph_build.md → docs/graph_build.md +0 -0
implementation.md → docs/implementation.md +0 -0
grid2op_env/graph_analysis.py → graph_analysis.py +0 -0
grid2op_env/.gitignore +0 -9
grid2op_env/README.md +0 -220
grid2op_env/pyproject.toml +0 -35
grid2op_env/uv.lock +0 -0
grid2op_env/inference.py → inference.py +267 -47
inference_speed_test.py +0 -34
main.py +0 -6
grid2op_env/models.py → models.py +1 -1
grid2op_env/openenv.yaml → openenv.yaml +0 -0
{grid2op_env/outputs → outputs}/evals/.gitkeep +0 -0
{grid2op_env/outputs → outputs}/logs/.gitkeep +0 -0
pyproject.toml +30 -6
{grid2op_env/server → server}/Dockerfile +0 -0
{grid2op_env/server → server}/__init__.py +0 -0
{grid2op_env/server → server}/app.py +0 -0
{grid2op_env/server → server}/graders.py +0 -0
{grid2op_env/server → server}/grid_environment.py +0 -0
{grid2op_env/server → server}/logging_utils.py +0 -0
{grid2op_env/server → server}/requirements.txt +0 -0
{grid2op_env/server → server}/tasks.py +0 -0
submission/README.md +217 -0
submission/pre_validation.sh +185 -0
submission/sample_inference.py +188 -0
{grid2op_env/tests → tests}/test_grid2op_env.py +0 -0
uv.lock +0 -0

grid2op_env/.dockerignore → .dockerignore RENAMED Viewed

File without changes

.gitignore CHANGED Viewed

@@ -1,13 +1,11 @@
-# Python-generated files
 __pycache__/
-*.py[oc]
-build/
-dist/
-wheels/
-*.egg-info
-# Virtual environments
-.venv
-OpenEnv/
-OpenEnv
-.env

+.venv/
+.pytest_cache/
 __pycache__/
+*.pyc
+outputs/logs/*
+!outputs/logs/.gitkeep
+outputs/evals/*
+!outputs/evals/.gitkeep
+.env
+grid2op_env.egg-info

.python-version CHANGED Viewed

	@@ -1 +1 @@
1	- 3.13


1	+ 3.12

AGENTS.md ADDED Viewed

	@@ -0,0 +1,25 @@

+# Repository Guidelines
+## Project Structure & Module Organization
+The package is rooted at the repository top level. Core models live in `models.py`, the baseline agent in `inference.py`, the client helper in `client.py`, and topology analysis in `graph_analysis.py`. The FastAPI/OpenEnv server lives in `server/` with `app.py`, `grid_environment.py`, `tasks.py`, `graders.py`, and logging helpers. Tests are in `tests/`, reference docs in `docs/` and `architecture/`, and submission utilities in `submission/`. Runtime artifacts go under `outputs/logs/` and `outputs/evals/`.
+## Build, Test, and Development Commands
+Use `uv` for local work.
+- `env UV_CACHE_DIR=/tmp/uv-cache uv run --no-dev server --port 7860` starts the FastAPI server declared in `openenv.yaml`.
+- `env UV_CACHE_DIR=/tmp/uv-cache uv run --no-dev grid2op-smoke --task-id single_fault --steps 1` runs a quick environment smoke test.
+- `env UV_CACHE_DIR=/tmp/uv-cache uv run --extra dev pytest tests/test_grid2op_env.py -q` runs the current pytest suite.
+- `docker build -t grid2op-env:local -f server/Dockerfile .` builds the local container image.
+- `bash submission/pre_validation.sh` runs submission checks before packaging.
+## Coding Style & Naming Conventions
+Follow the existing Python style: 4-space indentation, type hints, `from __future__ import annotations`, and compact module-level imports. Use `snake_case` for functions, variables, and modules, `PascalCase` for Pydantic models, and `UPPER_CASE` for constants like `TASKS`. Keep OpenEnv payloads strongly typed with Pydantic models instead of raw dicts when practical. No formatter or linter config is committed, so match surrounding code and keep diffs minimal.
+## Testing Guidelines
+Tests use `pytest`. Add new coverage in `tests/test_grid2op_env.py` or split into `tests/test_<feature>.py` as the suite grows. Prefer deterministic assertions over probabilistic checks; this repository already tests grader determinism, task resets, proposal parsing, and graph-analysis output. Run the smoke command plus pytest before opening a PR.
+## Commit & Pull Request Guidelines
+Recent commits use short, direct subjects such as `docs updated` and `task 3 refining`. Keep commit titles imperative, lowercase is acceptable, and stay under roughly 60 characters. PRs should describe the affected task or subsystem, list validation commands run, and include baseline or API behavior changes when relevant. Add screenshots only for UI or HTTP response examples.
+## Configuration & Runtime Notes
+`openenv.yaml` points to `server.app:app` on port `7860`. Keep API credentials in environment variables or `.env`; do not hardcode secrets. If you change server routes or environment logic, restart the server before rerunning `inference.py`.

README.md CHANGED Viewed

	@@ -0,0 +1,220 @@

+# Grid2Op Environment
+Standalone OpenEnv environment package for the full `PROJECT.md` design.
+The current planner uses server-side simulation on the live Grid2Op session. It does not rely on a replayed local mirror.
+## File structure
+```text
+grid2op_env/
+├── .dockerignore
+├── .env
+├── .gitignore
+├── __init__.py
+├── models.py
+├── client.py
+├── inference.py
+├── README.md
+├── openenv.yaml
+├── outputs/
+│   ├── logs/
+│   └── evals/
+├── pyproject.toml
+└── server/
+    ├── grid_environment.py
+    ├── tasks.py
+    ├── graders.py
+    ├── app.py
+    ├── requirements.txt
+    └── Dockerfile
+```
+The top-level package now follows the canonical OpenEnv environment layout:
+- `.dockerignore`
+- `__init__.py`
+- `models.py`
+- `client.py`
+- `README.md`
+- `openenv.yaml`
+- `pyproject.toml`
+- `outputs/logs`
+- `outputs/evals`
+- `server/`
+Supporting files outside the minimum template remain for quality and verification:
+- `inference.py`
+- `tests/test_grid2op_env.py`
+- helper server modules such as `tasks.py`, `graders.py`, and `logging_utils.py`
+## What is implemented
+- Grid2Op core simulator using `l2rpn_case14_sandbox`
+- Typed `GridAction`, `GridObservation`, and `GridState`
+- Four tasks: `single_fault`, `n_minus_1`, `cascade_prevent`, `multi_stage_cascade`
+- Reset-time scenario injection and retry logic for non-convergent starts
+- Shaped reward, episode logging, and deterministic graders
+- OpenEnv WebSocket interface plus `/tasks`, `/grader`, and `/baseline`
+- Server-side planner support via:
+  - `POST /planning_context`
+  - `POST /simulate`
+- Qwen3.5 baseline using the Chat Completions API
+- Local Docker workflow with dataset pre-download
+## Recent fixes
+1. **Task 1 (single_fault) benchmark ranges corrected** (tasks.py:297-304):
+   - `single_fault_easy`: 0.82-0.85 (was mathematically impossible 0.90-0.94)
+   - `single_fault_moderate`: 0.86-0.89 (was 0.94-0.97)
+   - `single_fault_severe`: 0.90-0.93 (was 0.96-0.99)
+   - Warmup phase finds high-loading state in chronics, then agent has 10 steps to solve
+2. **Task 1 reward function** (grid_environment.py:589-596):
+   - Target achieved bonus: `1.0 / step_count` (rewards early solution)
+   - Safe margin bonus: `0.05 × max(0.0, 1.0 - max_rho)`
+   - Overload penalty: `0.2 × overloaded_count` (lines > 100%)
+   - Redispatch penalty: `0.01 × MW` (discourages large interventions)
+   - Failure penalty: `-5.0` if time limit reached without target
+3. **Task 1 grading** (graders.py:28-55):
+   - 70% weight on survival ratio
+   - 50% target achieved bonus
+   - Final state bonus (0.3 if below target, 0.15/+0.05, 0.05/+0.10)
+   - Legacy success score for early completion: `1.0 - 0.08 × (step - 1)`
+4. **Task 2 (n_minus_1) redesign** based on RL2Grid paper (grid_environment.py:598-609):
+   - Three-component reward: `0.3×R_survive + 0.6×R_overload + 0.1×R_cost`
+   - `R_survive`: +1.0 per step (constant survival signal)
+   - `R_overload`: `(1/n) × Σ clip(1-ρ, -1, 1)` - loading margin quality
+   - `R_cost`: `-0.05 × Σ|ΔMW|/max_ramp` (normalized redispatch cost)
+   - Reconnection bonus: +2.0 when safely reconnecting (grid_environment.py:853-869)
+   - Terminal: +10×(s/m)² quadratic survival, -15 blackout
+   - Phase-aware grader (graders.py:58-83):
+     - Emergency response (30%): cleared within 5 steps at rho < 0.92
+     - Sustained security (50%): steps 6-20 at rho < 0.90
+     - Reconnection (20%): did agent reconnect line 0?
+   - N-1 security score (bridge lines) in prompt
+   - **Grading now honest**: score = survival_ratio × mastery_score (no override)
+   - Latest eval: 0.952 (was 1.0 with old override)
+5. **Task 3 (cascade_prevent)** (grid_environment.py:611-628):
+   - 1-2 lines disconnected at reset + 5-15% load increase
+   - Key metric: `timestep_overflow` countdowns (not just max_rho)
+   - Quadratic overflow penalty: `-0.05 × Σ(overflow²)` - line at overflow=2 is 4x more urgent than overflow=1
+   - Reward components:
+     - Cascade prevention: +0.3 if no auto-trip, -2.5 if auto-trip
+     - Thermal margin: +0.1 × mean(clip(1-ρ, -1, 1))
+     - Terminal: +5.0 × (1 - auto_trips/5)² survival bonus, -12.0 blackout
+   - Grading (graders.py:86-121):
+     - Cascade containment (50%): steps without auto-trips / 30
+     - Thermal stability (30%): safe_steps / containment_steps
+     - Recovery speed (20%): how fast recovered from first overload
+   - Latest eval: 0.798 (hard/extreme tiers challenging)
+6. **Task 4 (multi_stage_cascade)** (tasks.py:334-337, grid_environment.py:630-647):
+   - 3 lines disconnected at reset + **20% load increase** (not 15%)
+   - Three explicit stages (10 steps each) with stage boundaries at step 10 and 20
+   - Overflow window: 2 (faster cascades than default 3)
+   - Do-nothing survival probe: 5 steps minimum
+   - Island availability assessment at stage boundaries (grid_environment.py:767-814)
+   - Candidate filtering (inference.py:1003-1030): filters unsafe topology disconnects
+   - Reward (grid_environment.py:630-647):
+     - Generation cost: -0.02 × (total_gen / initial_load)
+     - Convergence: +0.5 × available_island_ratio
+     - Load loss penalty: -5.0 × (1 - available_load_ratio) at boundaries only
+     - Terminal win: +8.0 × (available_load_ratio)² if ≥50% load at step 30
+     - Terminal blackout: -12.0
+   - Grading (graders.py:124-174):
+     - Stage completion (30%): survived stages 1, 2, 3
+     - Load preservation (40%): available_load_ratio at end
+     - Island quality (20%): majority islands viable at boundaries
+     - Speed bonus (10%): how fast stability returned each stage
+   - Latest eval: 0.929 (31x improvement from 0.027)
+## Planner architecture
+`inference.py` now uses this flow:
+1. `reset()` live episode
+2. `state()` to obtain `episode_id`
+3. `planning_context(episode_id)` for graph intelligence and redispatchable generators
+4. LLM proposes 3 candidate actions
+5. `simulate_candidates(episode_id, actions)` on the live server session
+6. LLM selects the safest simulated action
+7. `step(action)`
+This avoids the old replay-mirror drift problem.
+## Local Docker workflow
+Build:
+```bash
+cd grid2op_env
+docker build -t grid2op-env:local -f server/Dockerfile .
+```
+Run:
+```bash
+docker run --rm -p 7860:7860 grid2op-env:local
+```
+If your Qwen-compatible API is running on the host machine, use:
+```bash
+docker run --rm \
+  --add-host=host.docker.internal:host-gateway \
+  -e OPENAI_BASE_URL=http://host.docker.internal:8000/v1 \
+  -e OPENAI_API_KEY=EMPTY \
+  -p 7860:7860 \
+  grid2op-env:local
+```
+## Local UV workflow
+```bash
+cd grid2op_env
+env UV_CACHE_DIR=/tmp/uv-cache uv run --no-dev grid2op-smoke --task-id single_fault --steps 1
+env UV_CACHE_DIR=/tmp/uv-cache uv run --no-dev server --port 7860
+env UV_CACHE_DIR=/tmp/uv-cache uv run --extra dev pytest tests/test_grid2op_env.py -q
+```
+## Qwen baseline
+The baseline uses the OpenAI Python SDK against a local Chat Completions API.
+```bash
+cat > .env <<'EOF'
+OPENAI_BASE_URL=http://localhost:8000/v1
+OPENAI_API_KEY=EMPTY
+OPENAI_MODEL=cyankiwi/Qwen3.5-9B-AWQ-4bit
+EOF
+env UV_CACHE_DIR=/tmp/uv-cache uv run --no-dev inference.py
+```
+## Important runtime note
+After changing server code, restart the Grid2Op server before running `inference.py`. The planner depends on the live server routes `/planning_context` and `/simulate`.
+## Latest verified result
+Latest saved run:
+- `single_fault`: `0.752`
+- `n_minus_1`: `0.952`
+- `cascade_prevent`: `0.798`
+- `multi_stage_cascade`: `0.929`
+This confirms the server-side simulation path is active.
+## Architecture Documentation
+- [architecture/task_1_architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/task_1_architecture.md) - Task 1 detailed walkthrough
+- [architecture/task_2_architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/task_2_architecture.md) - Task 2 N-1 contingency management
+- [architecture/task_3_architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/task_3_architecture.md) - Task 3 cascade prevention
+- [architecture/task_4_architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/task_4_architecture.md) - Task 4 multi-stage cascade management
+- [architecture/architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/architecture.md) - Overall system architecture

grid2op_env/__init__.py → __init__.py RENAMED Viewed

File without changes

grid2op_env/client.py → client.py RENAMED Viewed

File without changes

evaluation.md → docs/evaluation.md RENAMED Viewed

@@ -923,3 +923,126 @@ Summary scores:
 }
 ```

 }
 ```
+## Run 20260407_130112
+- Model: `openai/gpt-oss-20b:groq`
+- Tasks: `single_fault, n_minus_1, cascade_prevent, multi_stage_cascade`
+- Seeds: `0` to `4`
+- Scenario mode: `benchmark`
+- Sampling: `temperature=0.7`, `top_p=0.8`, `top_k=20`, `min_p=0.0`, `presence_penalty=1.5`, `repetition_penalty=1.0`
+- JSON output: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_130112.json](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_130112.json)
+- CSV output: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_130112.csv](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_130112.csv)
+- Log file: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/logs/baseline_run_20260407_130112.log](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/logs/baseline_run_20260407_130112.log)
+| Task | Tier | Mean Score | Mean Episode Length | Mean Time (s) | Mean Do-Nothing Steps |
+| --- | --- | ---: | ---: | ---: | ---: |
+| `single_fault` | `single_fault_easy` | `0.750000` | `10.00` | `30.82` | `9.00` |
+| `single_fault` | `single_fault_moderate` | `0.725000` | `10.00` | `30.94` | `7.50` |
+| `single_fault` | `single_fault_severe` | `1.000000` | `1.00` | `5.72` | `0.00` |
+| `n_minus_1` | `n_minus_1_fixed` | `0.645333` | `20.00` | `57.54` | `17.00` |
+| `cascade_prevent` | `cascade_prevent_easy` | `1.000000` | `30.00` | `86.27` | `28.50` |
+| `cascade_prevent` | `cascade_prevent_medium` | `1.000000` | `30.00` | `87.11` | `25.00` |
+| `cascade_prevent` | `cascade_prevent_extreme` | `0.596666` | `16.50` | `49.61` | `15.50` |
+| `multi_stage_cascade` | `multi_stage_cascade_expert` | `0.831466` | `28.40` | `96.60` | `9.80` |
+Summary scores:
+```json
+{
+  "model": "openai/gpt-oss-20b:groq",
+  "scores": {
+    "single_fault": 0.825,
+    "n_minus_1": 0.645333,
+    "cascade_prevent": 0.865556,
+    "multi_stage_cascade": 0.831466
+  },
+  "episode_lengths": {
+    "single_fault": 7,
+    "n_minus_1": 20,
+    "cascade_prevent": 26,
+    "multi_stage_cascade": 28
+  }
+}
+```
+## Run 20260407_145958
+- Model: `openai/gpt-oss-20b:groq`
+- Tasks: `single_fault, n_minus_1, cascade_prevent, multi_stage_cascade`
+- Seeds: `0` to `4`
+- Scenario mode: `benchmark`
+- Sampling: `temperature=0.7`, `top_p=0.8`, `top_k=20`, `min_p=0.0`, `presence_penalty=1.5`, `repetition_penalty=1.0`
+- JSON output: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_145958.json](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_145958.json)
+- CSV output: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_145958.csv](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_145958.csv)
+- Log file: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/logs/baseline_run_20260407_145958.log](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/logs/baseline_run_20260407_145958.log)
+| Task | Tier | Mean Score | Mean Episode Length | Mean Time (s) | Mean Do-Nothing Steps |
+| --- | --- | ---: | ---: | ---: | ---: |
+| `single_fault` | `single_fault_easy` | `0.750000` | `10.00` | `30.10` | `9.00` |
+| `single_fault` | `single_fault_moderate` | `0.725000` | `10.00` | `28.88` | `7.50` |
+| `single_fault` | `single_fault_severe` | `1.000000` | `1.00` | `6.25` | `0.00` |
+| `n_minus_1` | `n_minus_1_fixed` | `0.575000` | `20.00` | `59.77` | `17.25` |
+| `cascade_prevent` | `cascade_prevent_easy` | `1.000000` | `30.00` | `92.46` | `29.00` |
+| `cascade_prevent` | `cascade_prevent_medium` | `1.000000` | `30.00` | `94.56` | `27.00` |
+| `cascade_prevent` | `cascade_prevent_extreme` | `0.596666` | `16.50` | `50.18` | `16.00` |
+| `multi_stage_cascade` | `multi_stage_cascade_expert` | `0.917766` | `30.00` | `94.55` | `10.00` |
+Summary scores:
+```json
+{
+  "model": "openai/gpt-oss-20b:groq",
+  "scores": {
+    "single_fault": 0.825,
+    "n_minus_1": 0.575,
+    "cascade_prevent": 0.865556,
+    "multi_stage_cascade": 0.917766
+  },
+  "episode_lengths": {
+    "single_fault": 7,
+    "n_minus_1": 20,
+    "cascade_prevent": 26,
+    "multi_stage_cascade": 30
+  }
+}
+```
+## Run 20260407_163224
+- Model: `openai/gpt-oss-20b:groq`
+- Tasks: `single_fault, n_minus_1, cascade_prevent, multi_stage_cascade`
+- Seeds: `0` to `4`
+- Scenario mode: `benchmark`
+- Sampling: `temperature=0.7`, `top_p=0.8`, `top_k=20`, `min_p=0.0`, `presence_penalty=1.5`, `repetition_penalty=1.0`
+- JSON output: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_163224.json](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_163224.json)
+- CSV output: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_163224.csv](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/evals/baseline_eval_20260407_163224.csv)
+- Log file: [/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/logs/baseline_run_20260407_163224.log](/home/sidharth/Desktop/Openenv_modules/grid2op_env/outputs/logs/baseline_run_20260407_163224.log)
+| Task | Tier | Mean Score | Mean Episode Length | Mean Time (s) | Mean Do-Nothing Steps |
+| --- | --- | ---: | ---: | ---: | ---: |
+| `single_fault` | `single_fault_easy` | `0.750000` | `10.00` | `19.32` | `10.00` |
+| `single_fault` | `single_fault_moderate` | `0.750000` | `10.00` | `18.22` | `10.00` |
+| `single_fault` | `single_fault_severe` | `0.750000` | `10.00` | `20.68` | `9.00` |
+| `n_minus_1` | `n_minus_1_fixed` | `0.576750` | `15.75` | `54.19` | `15.25` |
+| `cascade_prevent` | `cascade_prevent_easy` | `1.000000` | `30.00` | `97.00` | `28.50` |
+| `cascade_prevent` | `cascade_prevent_medium` | `1.000000` | `30.00` | `93.13` | `28.50` |
+| `cascade_prevent` | `cascade_prevent_extreme` | `0.596666` | `16.50` | `52.60` | `15.00` |
+| `multi_stage_cascade` | `multi_stage_cascade_expert` | `0.812543` | `28.00` | `94.51` | `8.50` |
+Summary scores:
+```json
+{
+  "model": "openai/gpt-oss-20b:groq",
+  "scores": {
+    "single_fault": 0.75,
+    "n_minus_1": 0.57675,
+    "cascade_prevent": 0.865556,
+    "multi_stage_cascade": 0.812543
+  },
+  "episode_lengths": {
+    "single_fault": 10,
+    "n_minus_1": 16,
+    "cascade_prevent": 26,
+    "multi_stage_cascade": 28
+  }
+}
+```

graph_build.md → docs/graph_build.md RENAMED Viewed

File without changes

implementation.md → docs/implementation.md RENAMED Viewed

File without changes

grid2op_env/graph_analysis.py → graph_analysis.py RENAMED Viewed

File without changes

grid2op_env/.gitignore DELETED Viewed

@@ -1,9 +0,0 @@
-.venv/
-.pytest_cache/
-__pycache__/
-*.pyc
-outputs/logs/*
-!outputs/logs/.gitkeep
-outputs/evals/*
-!outputs/evals/.gitkeep

grid2op_env/README.md DELETED Viewed

@@ -1,220 +0,0 @@
-# Grid2Op Environment
-Standalone OpenEnv environment package for the full `PROJECT.md` design.
-The current planner uses server-side simulation on the live Grid2Op session. It does not rely on a replayed local mirror.
-## File structure
-```text
-grid2op_env/
-├── .dockerignore
-├── .env
-├── .gitignore
-├── __init__.py
-├── models.py
-├── client.py
-├── inference.py
-├── README.md
-├── openenv.yaml
-├── outputs/
-│   ├── logs/
-│   └── evals/
-├── pyproject.toml
-└── server/
-    ├── grid_environment.py
-    ├── tasks.py
-    ├── graders.py
-    ├── app.py
-    ├── requirements.txt
-    └── Dockerfile
-```
-The top-level package now follows the canonical OpenEnv environment layout:
-- `.dockerignore`
-- `__init__.py`
-- `models.py`
-- `client.py`
-- `README.md`
-- `openenv.yaml`
-- `pyproject.toml`
-- `outputs/logs`
-- `outputs/evals`
-- `server/`
-Supporting files outside the minimum template remain for quality and verification:
-- `inference.py`
-- `tests/test_grid2op_env.py`
-- helper server modules such as `tasks.py`, `graders.py`, and `logging_utils.py`
-## What is implemented
-- Grid2Op core simulator using `l2rpn_case14_sandbox`
-- Typed `GridAction`, `GridObservation`, and `GridState`
-- Four tasks: `single_fault`, `n_minus_1`, `cascade_prevent`, `multi_stage_cascade`
-- Reset-time scenario injection and retry logic for non-convergent starts
-- Shaped reward, episode logging, and deterministic graders
-- OpenEnv WebSocket interface plus `/tasks`, `/grader`, and `/baseline`
-- Server-side planner support via:
-  - `POST /planning_context`
-  - `POST /simulate`
-- Qwen3.5 baseline using the Chat Completions API
-- Local Docker workflow with dataset pre-download
-## Recent fixes
-1. **Task 1 (single_fault) benchmark ranges corrected** (tasks.py:297-304):
-   - `single_fault_easy`: 0.82-0.85 (was mathematically impossible 0.90-0.94)
-   - `single_fault_moderate`: 0.86-0.89 (was 0.94-0.97)
-   - `single_fault_severe`: 0.90-0.93 (was 0.96-0.99)
-   - Warmup phase finds high-loading state in chronics, then agent has 10 steps to solve
-2. **Task 1 reward function** (grid_environment.py:589-596):
-   - Target achieved bonus: `1.0 / step_count` (rewards early solution)
-   - Safe margin bonus: `0.05 × max(0.0, 1.0 - max_rho)`
-   - Overload penalty: `0.2 × overloaded_count` (lines > 100%)
-   - Redispatch penalty: `0.01 × MW` (discourages large interventions)
-   - Failure penalty: `-5.0` if time limit reached without target
-3. **Task 1 grading** (graders.py:28-55):
-   - 70% weight on survival ratio
-   - 50% target achieved bonus
-   - Final state bonus (0.3 if below target, 0.15/+0.05, 0.05/+0.10)
-   - Legacy success score for early completion: `1.0 - 0.08 × (step - 1)`
-4. **Task 2 (n_minus_1) redesign** based on RL2Grid paper (grid_environment.py:598-609):
-   - Three-component reward: `0.3×R_survive + 0.6×R_overload + 0.1×R_cost`
-   - `R_survive`: +1.0 per step (constant survival signal)
-   - `R_overload`: `(1/n) × Σ clip(1-ρ, -1, 1)` - loading margin quality
-   - `R_cost`: `-0.05 × Σ|ΔMW|/max_ramp` (normalized redispatch cost)
-   - Reconnection bonus: +2.0 when safely reconnecting (grid_environment.py:853-869)
-   - Terminal: +10×(s/m)² quadratic survival, -15 blackout
-   - Phase-aware grader (graders.py:58-83):
-     - Emergency response (30%): cleared within 5 steps at rho < 0.92
-     - Sustained security (50%): steps 6-20 at rho < 0.90
-     - Reconnection (20%): did agent reconnect line 0?
-   - N-1 security score (bridge lines) in prompt
-   - **Grading now honest**: score = survival_ratio × mastery_score (no override)
-   - Latest eval: 0.952 (was 1.0 with old override)
-5. **Task 3 (cascade_prevent)** (grid_environment.py:611-628):
-   - 1-2 lines disconnected at reset + 5-15% load increase
-   - Key metric: `timestep_overflow` countdowns (not just max_rho)
-   - Quadratic overflow penalty: `-0.05 × Σ(overflow²)` - line at overflow=2 is 4x more urgent than overflow=1
-   - Reward components:
-     - Cascade prevention: +0.3 if no auto-trip, -2.5 if auto-trip
-     - Thermal margin: +0.1 × mean(clip(1-ρ, -1, 1))
-     - Terminal: +5.0 × (1 - auto_trips/5)² survival bonus, -12.0 blackout
-   - Grading (graders.py:86-121):
-     - Cascade containment (50%): steps without auto-trips / 30
-     - Thermal stability (30%): safe_steps / containment_steps
-     - Recovery speed (20%): how fast recovered from first overload
-   - Latest eval: 0.798 (hard/extreme tiers challenging)
-6. **Task 4 (multi_stage_cascade)** (tasks.py:334-337, grid_environment.py:630-647):
-   - 3 lines disconnected at reset + **20% load increase** (not 15%)
-   - Three explicit stages (10 steps each) with stage boundaries at step 10 and 20
-   - Overflow window: 2 (faster cascades than default 3)
-   - Do-nothing survival probe: 5 steps minimum
-   - Island availability assessment at stage boundaries (grid_environment.py:767-814)
-   - Candidate filtering (inference.py:1003-1030): filters unsafe topology disconnects
-   - Reward (grid_environment.py:630-647):
-     - Generation cost: -0.02 × (total_gen / initial_load)
-     - Convergence: +0.5 × available_island_ratio
-     - Load loss penalty: -5.0 × (1 - available_load_ratio) at boundaries only
-     - Terminal win: +8.0 × (available_load_ratio)² if ≥50% load at step 30
-     - Terminal blackout: -12.0
-   - Grading (graders.py:124-174):
-     - Stage completion (30%): survived stages 1, 2, 3
-     - Load preservation (40%): available_load_ratio at end
-     - Island quality (20%): majority islands viable at boundaries
-     - Speed bonus (10%): how fast stability returned each stage
-   - Latest eval: 0.929 (31x improvement from 0.027)
-## Planner architecture
-`inference.py` now uses this flow:
-1. `reset()` live episode
-2. `state()` to obtain `episode_id`
-3. `planning_context(episode_id)` for graph intelligence and redispatchable generators
-4. LLM proposes 3 candidate actions
-5. `simulate_candidates(episode_id, actions)` on the live server session
-6. LLM selects the safest simulated action
-7. `step(action)`
-This avoids the old replay-mirror drift problem.
-## Local Docker workflow
-Build:
-```bash
-cd grid2op_env
-docker build -t grid2op-env:local -f server/Dockerfile .
-```
-Run:
-```bash
-docker run --rm -p 7860:7860 grid2op-env:local
-```
-If your Qwen-compatible API is running on the host machine, use:
-```bash
-docker run --rm \
-  --add-host=host.docker.internal:host-gateway \
-  -e OPENAI_BASE_URL=http://host.docker.internal:8000/v1 \
-  -e OPENAI_API_KEY=EMPTY \
-  -p 7860:7860 \
-  grid2op-env:local
-```
-## Local UV workflow
-```bash
-cd grid2op_env
-env UV_CACHE_DIR=/tmp/uv-cache uv run --no-dev grid2op-smoke --task-id single_fault --steps 1
-env UV_CACHE_DIR=/tmp/uv-cache uv run --no-dev server --port 7860
-env UV_CACHE_DIR=/tmp/uv-cache uv run --extra dev pytest tests/test_grid2op_env.py -q
-```
-## Qwen baseline
-The baseline uses the OpenAI Python SDK against a local Chat Completions API.
-```bash
-cat > .env <<'EOF'
-OPENAI_BASE_URL=http://localhost:8000/v1
-OPENAI_API_KEY=EMPTY
-OPENAI_MODEL=cyankiwi/Qwen3.5-9B-AWQ-4bit
-EOF
-env UV_CACHE_DIR=/tmp/uv-cache uv run --no-dev inference.py
-```
-## Important runtime note
-After changing server code, restart the Grid2Op server before running `inference.py`. The planner depends on the live server routes `/planning_context` and `/simulate`.
-## Latest verified result
-Latest saved run:
-- `single_fault`: `0.752`
-- `n_minus_1`: `0.952`
-- `cascade_prevent`: `0.798`
-- `multi_stage_cascade`: `0.929`
-This confirms the server-side simulation path is active.
-## Architecture Documentation
-- [architecture/task_1_architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/task_1_architecture.md) - Task 1 detailed walkthrough
-- [architecture/task_2_architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/task_2_architecture.md) - Task 2 N-1 contingency management
-- [architecture/task_3_architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/task_3_architecture.md) - Task 3 cascade prevention
-- [architecture/task_4_architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/task_4_architecture.md) - Task 4 multi-stage cascade management
-- [architecture/architecture.md](/home/sidharth/Desktop/Openenv_modules/architecture/architecture.md) - Overall system architecture

grid2op_env/pyproject.toml DELETED Viewed

@@ -1,35 +0,0 @@
-[build-system]
-requires = ["setuptools>=45", "wheel"]
-build-backend = "setuptools.build_meta"
-[project]
-name = "grid2op-env"
-version = "0.1.0"
-description = "Standalone OpenEnv wrapper around Grid2Op"
-readme = "README.md"
-requires-python = ">=3.10,<3.13"
-dependencies = [
-    "openenv-core[core]>=0.2.2",
-    "grid2op>=1.10.5",
-    "numpy>=1.24.0",
-    "openai>=2.7.2",
-    "python-dotenv>=1.0.1",
-    "requests>=2.31.0",
-]
-[project.optional-dependencies]
-dev = [
-    "pytest>=8.0.0",
-]
-lightsim = [
-    "lightsim2grid>=0.10.0",
-]
-[project.scripts]
-server = "grid2op_env.server.app:main"
-grid2op-smoke = "grid2op_env.server.grid_environment:smoke_main"
-[tool.setuptools]
-include-package-data = true
-packages = ["grid2op_env", "grid2op_env.server"]
-package-dir = { "grid2op_env" = ".", "grid2op_env.server" = "server" }

grid2op_env/uv.lock DELETED Viewed

The diff for this file is too large to render. See raw diff

grid2op_env/inference.py → inference.py RENAMED Viewed

@@ -27,7 +27,7 @@ from grid2op_env.models import (
 from grid2op_env.server.tasks import TASKS, benchmark_tiers_for_task
-def configure_logging(level: int = logging.INFO) -> None:
     root_logger = logging.getLogger()
     if root_logger.handlers:
         root_logger.setLevel(level)
@@ -56,11 +56,18 @@ configure_logging()
 logger = logging.getLogger(__name__)
 TASK_SEED_OVERRIDES: dict[TaskId, int] = {
-    "single_fault": 2,
     "cascade_prevent": 2,
 }
 HF_ROUTER_BASE_URL = "https://router.huggingface.co/v1"
-HF_ROUTER_DEFAULT_MODEL = "openai/gpt-oss-safeguard-20b:groq"
 @dataclass
@@ -94,36 +101,23 @@ class SimulationOutcome:
     raw_result: dict[str, Any]
-def _env_flag(name: str, default: bool) -> bool:
-    raw_value = os.environ.get(name)
-    if raw_value is None:
-        return default
-    return raw_value.strip().lower() in {"1", "true", "yes", "on"}
-def _use_local_llm_setup() -> bool:
-    if "local_setup" in os.environ:
-        return _env_flag("local_setup", True)
-    return _env_flag("LOCAL_SETUP", True)
-def _default_model_name() -> str:
-    if _use_local_llm_setup():
-        return os.environ.get("OPENAI_MODEL", "Qwen/Qwen3.5-9B")
-    return os.environ.get("HF_ROUTER_MODEL", HF_ROUTER_DEFAULT_MODEL)
 def _build_llm_client() -> OpenAI:
-    if _use_local_llm_setup():
-        return OpenAI()
-    hf_token = os.environ.get("HF_TOKEN")
-    if not hf_token:
         raise RuntimeError(
-            "HF Router mode requires HF_TOKEN in the environment when local_setup=false."
         )
     return OpenAI(
-        base_url=HF_ROUTER_BASE_URL,
-        api_key=hf_token,
     )
@@ -138,17 +132,152 @@ def _chat_completion_kwargs(
         "temperature": llm_config.temperature,
         "top_p": llm_config.top_p,
         "presence_penalty": llm_config.presence_penalty,
     }
-    if _use_local_llm_setup():
-        request_kwargs["extra_body"] = {
-            "top_k": llm_config.top_k,
-            "min_p": llm_config.min_p,
-            "repetition_penalty": llm_config.repetition_penalty,
-            "chat_template_kwargs": {"enable_thinking": llm_config.enable_thinking},
-        }
     return request_kwargs
 def run_baseline_suite(
     base_url: str,
     config: BaselineRequest | None = None,
@@ -158,9 +287,7 @@ def run_baseline_suite(
     run_paths = prepare_run_paths(timestamp)
     attach_file_logger(run_paths["log"])
-    request_config = config or BaselineRequest(
-        model=_default_model_name()
-    )
     llm_config = BaselineConfig(
         model=request_config.model,
         max_tokens=request_config.max_tokens,
@@ -182,12 +309,12 @@ def run_baseline_suite(
     episode_lengths: Dict[TaskId, int] = {}
     evaluation_records: list[dict[str, Any]] = []
     logger.info(
-        "Starting baseline suite base_url=%s model=%s num_seeds=%s seed_start=%s local_setup=%s",
         base_url,
         llm_config.model,
         llm_config.num_seeds,
         llm_config.seed_start,
-        _use_local_llm_setup(),
     )
     with GridEnv(base_url=base_url).sync() as env:
@@ -528,6 +655,25 @@ def choose_action_with_qwen(
             },
         }
     final_prompt = build_final_selection_prompt(
         task_id=task_id,
         observation=observation,
@@ -610,12 +756,26 @@ def build_proposal_prompt(
     majority_islands_available = bool(
         observation.metadata.get("majority_islands_available", False)
     )
     lines = [
         "You are a grid operator proposing actions for a deterministic simulator.",
         "Propose exactly 3 candidate actions to test in the physics sandbox.",
         "Allowed action types: disconnect_line, reconnect_line, redispatch, do_nothing.",
         "Return a single JSON object only.",
-        'Use this exact schema: {"candidates":[{"action_type":"disconnect_line|reconnect_line|redispatch|do_nothing","line_id":null|int,"gen_id":null|int,"delta_mw":null|float,"reason":"short string"}]}',
         "Rules: no markdown, no prose, no code fences, no extra keys, exactly 3 candidates.",
         "Diversity rule: use at least two different action types when plausible.",
         "CRITICAL PHYSICS RULE: You must prioritize candidates from the sensitivity_guidance list. These actions have been mathematically verified by power-flow sensitivity factors to reduce the load on the stressed line.",
@@ -648,6 +808,10 @@ def build_proposal_prompt(
             6,
             "TASK RULE: For single_fault, do not propose disconnect_line or reconnect_line. Use redispatch and do_nothing only. Solve congestion by shifting generation, not by cutting topology.",
         )
     if task_id == "n_minus_1":
         danger_lines = [
             entry for entry in stressed_lines if float(entry["rho"]) >= 0.92
@@ -666,22 +830,50 @@ def build_proposal_prompt(
         )
         lines.insert(
             7,
-            f"N-1 STRUCTURAL SECURITY: score={float(graph_intelligence.get('n1_security_score', 0.0)):.3f}; bridge_lines={json.dumps(graph_intelligence.get('bridge_lines', []), separators=(',', ':'))}",
         )
         lines.insert(
             8,
-            "THRESHOLDS: EMERGENCY if any line rho >= 0.92, WARNING for 0.80 <= rho < 0.92, SAFE if all lines are below 0.80.",
         )
         lines.insert(
             9,
-            "EMERGENCY_LINES=" + json.dumps(danger_lines, separators=(",", ":")),
         )
         lines.insert(
             10,
-            "WARNING_LINES=" + json.dumps(warning_lines, separators=(",", ":")),
         )
         lines.insert(
             11,
             "RECONNECT_WINDOW_LINES="
             + json.dumps(cooldown_zero_lines, separators=(",", ":")),
         )
@@ -806,6 +998,19 @@ def build_final_selection_prompt(
             7,
             "RULE: If a simulated candidate safely reduces max_rho compared to the current state, you MUST select it over do_nothing, no matter how small the reduction is. Do not choose do_nothing unless every other candidate increases max_rho or causes a failure. Safe, incremental redispatch improvements are the only way to win.",
         )
     if task_id == "multi_stage_cascade":
         lines.insert(
             7,
@@ -927,7 +1132,14 @@ def parse_candidate_proposals(
     task_id: TaskId = "n_minus_1",
 ) -> tuple[list[tuple[GridAction, dict[str, Any]]], dict[str, Any]]:
     payload = parse_json_action(content)
-    raw_candidates = payload.get("candidates", [])
     candidates: list[tuple[GridAction, dict[str, Any]]] = []
     if isinstance(raw_candidates, list):
         for item in raw_candidates[:3]:
@@ -1649,15 +1861,23 @@ def append_evaluation_markdown(
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
     parser.add_argument(
         "--task-id",
         dest="task_ids",
         nargs="+",
         choices=sorted(TASKS.keys()),
-        help="Run only the selected task ids. Defaults to all tasks.",
     )
     args = parser.parse_args()
-    base_url = os.environ.get("GRID2OP_BASE_URL", "http://127.0.0.1:7860")
-    result = run_baseline_suite(base_url=base_url, task_ids=args.task_ids)
-    print(result.model_dump_json(indent=2))

 from grid2op_env.server.tasks import TASKS, benchmark_tiers_for_task
+def configure_logging(level: int = logging.WARNING) -> None:
     root_logger = logging.getLogger()
     if root_logger.handlers:
         root_logger.setLevel(level)
 logger = logging.getLogger(__name__)
 TASK_SEED_OVERRIDES: dict[TaskId, int] = {
+    "single_fault": 1,
+    "n_minus_1": 4,
     "cascade_prevent": 2,
+    "multi_stage_cascade": 4,
 }
 HF_ROUTER_BASE_URL = "https://router.huggingface.co/v1"
+HF_ROUTER_DEFAULT_MODEL = "openai/gpt-oss-20b:groq"
+DEFAULT_ENV_BASE_URL = "http://127.0.0.1:7860"
+DEFAULT_BENCHMARK_NAME = "grid2op_env"
+SUBMISSION_SUCCESS_SCORE_THRESHOLD = float(
+    os.getenv("SUCCESS_SCORE_THRESHOLD", "0.1")
+)
 @dataclass
     raw_result: dict[str, Any]
+def _default_model_name() -> str:
+    return os.environ.get("MODEL_NAME", HF_ROUTER_DEFAULT_MODEL)
+def _llm_api_base_url() -> str:
+    return os.environ.get("API_BASE_URL", HF_ROUTER_BASE_URL)
 def _build_llm_client() -> OpenAI:
+    api_key = os.environ.get("HF_TOKEN") or os.environ.get("API_KEY")
+    if not api_key:
         raise RuntimeError(
+            "Set HF_TOKEN or API_KEY to use Hugging Face Router inference."
         )
     return OpenAI(
+        base_url=_llm_api_base_url(),
+        api_key=api_key,
     )
         "temperature": llm_config.temperature,
         "top_p": llm_config.top_p,
         "presence_penalty": llm_config.presence_penalty,
+        "stream": False,
     }
     return request_kwargs
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(
+    step: int,
+    action: GridAction,
+    reward: float,
+    done: bool,
+    error: str | None,
+) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    action_str = json.dumps(action.model_dump(), separators=(",", ":"), sort_keys=True)
+    print(
+        f"[STEP] step={step} action={action_str} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: list[float]) -> None:
+    rewards_str = ",".join(f"{reward:.2f}" for reward in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
+def run_submission_episodes(task_ids: Sequence[TaskId] | None = None) -> dict[TaskId, float]:
+    base_url = os.environ.get("GRID2OP_BASE_URL", DEFAULT_ENV_BASE_URL)
+    benchmark_name = os.environ.get("GRID2OP_BENCHMARK", DEFAULT_BENCHMARK_NAME)
+    scenario_mode = os.environ.get("GRID2OP_SCENARIO_MODE", "benchmark")
+    selected_task_ids = list(task_ids) if task_ids is not None else list(TASKS.keys())
+    llm_config = BaselineConfig(
+        model=_default_model_name(),
+        max_tokens=int(os.environ.get("MAX_TOKENS", "1200")),
+        temperature=float(os.environ.get("TEMPERATURE", "0.7")),
+        top_p=float(os.environ.get("TOP_P", "0.8")),
+        presence_penalty=float(os.environ.get("PRESENCE_PENALTY", "1.5")),
+        top_k=int(os.environ.get("TOP_K", "20")),
+        min_p=float(os.environ.get("MIN_P", "0.0")),
+        repetition_penalty=float(os.environ.get("REPETITION_PENALTY", "1.0")),
+        enable_thinking=False,
+        num_seeds=int(os.environ.get("NUM_SEEDS", "5")),
+        seed_start=int(os.environ.get("SEED_START", "0")),
+        scenario_mode=scenario_mode,  # type: ignore[arg-type]
+    )
+    client = _build_llm_client()
+    task_scores: dict[TaskId, float] = {}
+    with GridEnv(base_url=base_url).sync() as env:
+        for task_id in selected_task_ids:
+            task = TASKS[task_id]
+            benchmark_tiers = benchmark_tiers_for_task(task_id)
+            task_num_seeds = TASK_SEED_OVERRIDES.get(task_id, llm_config.num_seeds)
+            task_episode_scores: list[float] = []
+            for benchmark_tier in benchmark_tiers:
+                for seed in range(
+                    llm_config.seed_start, llm_config.seed_start + task_num_seeds
+                ):
+                    rewards: list[float] = []
+                    steps_taken = 0
+                    score = 0.0
+                    success = False
+                    log_start(task=task_id, env=benchmark_name, model=llm_config.model)
+                    try:
+                        result = env.reset(
+                            task_id=task_id,
+                            seed=seed,
+                            difficulty_level=1,
+                            scenario_mode=scenario_mode,  # type: ignore[arg-type]
+                            benchmark_tier=benchmark_tier,
+                        )
+                        state = env.state()
+                        step_idx = 0
+                        while not result.done and step_idx < task.max_steps:
+                            action, _planning_trace = choose_action_with_qwen(
+                                client=client,
+                                env=env,
+                                episode_id=state.episode_id,
+                                task_id=task_id,
+                                observation=result.observation,
+                                step_count=step_idx,
+                                max_steps=task.max_steps,
+                                include_task_description=(step_idx == 0),
+                                llm_config=llm_config,
+                            )
+                            error: str | None = None
+                            try:
+                                result = env.step(action)
+                            except Exception as exc:
+                                error = str(exc)
+                                log_step(
+                                    step=step_idx + 1,
+                                    action=action,
+                                    reward=0.0,
+                                    done=True,
+                                    error=error,
+                                )
+                                raise
+                            reward = float(result.reward or 0.0)
+                            rewards.append(reward)
+                            steps_taken = step_idx + 1
+                            log_step(
+                                step=steps_taken,
+                                action=action,
+                                reward=reward,
+                                done=bool(result.done),
+                                error=error,
+                            )
+                            step_idx += 1
+                        state = env.state()
+                        response = requests.post(
+                            f"{base_url}/grader",
+                            json={
+                                "task_id": task_id,
+                                "episode_log": [
+                                    entry.model_dump() for entry in state.episode_log
+                                ],
+                            },
+                            timeout=60,
+                        )
+                        response.raise_for_status()
+                        score = float(response.json()["score"])
+                        task_episode_scores.append(score)
+                        success = score >= SUBMISSION_SUCCESS_SCORE_THRESHOLD
+                    finally:
+                        log_end(
+                            success=success,
+                            steps=steps_taken,
+                            score=score,
+                            rewards=rewards,
+                        )
+            task_scores[task_id] = (
+                round(mean(task_episode_scores), 6) if task_episode_scores else 0.0
+            )
+    return task_scores
 def run_baseline_suite(
     base_url: str,
     config: BaselineRequest | None = None,
     run_paths = prepare_run_paths(timestamp)
     attach_file_logger(run_paths["log"])
+    request_config = config or BaselineRequest(model=_default_model_name())
     llm_config = BaselineConfig(
         model=request_config.model,
         max_tokens=request_config.max_tokens,
     episode_lengths: Dict[TaskId, int] = {}
     evaluation_records: list[dict[str, Any]] = []
     logger.info(
+        "Starting baseline suite base_url=%s llm_api_base_url=%s model=%s num_seeds=%s seed_start=%s",
         base_url,
+        _llm_api_base_url(),
         llm_config.model,
         llm_config.num_seeds,
         llm_config.seed_start,
     )
     with GridEnv(base_url=base_url).sync() as env:
             },
         }
+    if task_id == "single_fault":
+        selected_outcome = selectable_simulations[0]
+        return selected_outcome.action, {
+            "proposal_prompt": proposal_prompt,
+            "proposal_raw_output": proposal_raw_output,
+            "proposal_trace": {**proposal_trace, **prefilter_trace},
+            "graph_intelligence": graph_intelligence,
+            "simulations": [
+                serialize_simulation_outcome(outcome) for outcome in simulations
+            ],
+            "final_prompt": "",
+            "final_raw_output": "",
+            "final_trace": {
+                "decision": "single_call_ranked_selection",
+                "reason": selected_outcome.trace.get("reason", ""),
+                "selected_candidate": selected_outcome.candidate_index,
+            },
+        }
     final_prompt = build_final_selection_prompt(
         task_id=task_id,
         observation=observation,
     majority_islands_available = bool(
         observation.metadata.get("majority_islands_available", False)
     )
+    action_schema = (
+        '{"action_type":"disconnect_line|reconnect_line|redispatch|do_nothing","line_id":null|int,"gen_id":null|int,"delta_mw":null|float,"reason":"short string"}'
+    )
+    response_schema = (
+        '{"primary_action":'
+        + action_schema
+        + ',"backup_action_1":'
+        + action_schema
+        + ',"backup_action_2":'
+        + action_schema
+        + "}"
+        if task_id == "single_fault"
+        else '{"candidates":[' + action_schema + "," + action_schema + "," + action_schema + "]}"
+    )
     lines = [
         "You are a grid operator proposing actions for a deterministic simulator.",
         "Propose exactly 3 candidate actions to test in the physics sandbox.",
         "Allowed action types: disconnect_line, reconnect_line, redispatch, do_nothing.",
         "Return a single JSON object only.",
+        "Use this exact schema: " + response_schema,
         "Rules: no markdown, no prose, no code fences, no extra keys, exactly 3 candidates.",
         "Diversity rule: use at least two different action types when plausible.",
         "CRITICAL PHYSICS RULE: You must prioritize candidates from the sensitivity_guidance list. These actions have been mathematically verified by power-flow sensitivity factors to reduce the load on the stressed line.",
             6,
             "TASK RULE: For single_fault, do not propose disconnect_line or reconnect_line. Use redispatch and do_nothing only. Solve congestion by shifting generation, not by cutting topology.",
         )
+        lines.insert(
+            7,
+            "TASK RULE: Rank your output strictly as primary_action first, then backup_action_1, then backup_action_2. The simulator will test all three and execute the highest-ranked safe option.",
+        )
     if task_id == "n_minus_1":
         danger_lines = [
             entry for entry in stressed_lines if float(entry["rho"]) >= 0.92
         )
         lines.insert(
             7,
+            f"FAULTED_LINE=0; disconnected_now={json.dumps([entry['line_id'] for entry in disconnected], separators=(',', ':'))}",
         )
         lines.insert(
             8,
+            f"N-1 PHASE={'emergency' if step_count < 5 else 'steady_state'}; emergency_window_steps_remaining={max(0, 5 - step_count)}",
         )
         lines.insert(
             9,
+            "EMERGENCY OBJECTIVE: In steps 1-5, prioritize actions that bring max_rho below 0.92 as fast as possible. Clearing the emergency window is the top priority.",
         )
         lines.insert(
             10,
+            "STEADY-STATE OBJECTIVE: From step 6 onward, prioritize keeping max_rho below 0.90 on as many steps as possible while preserving survivability.",
         )
         lines.insert(
             11,
+            "RECONNECTION OBJECTIVE: When line 0 cooldown reaches 0, include a reconnect_line candidate for line 0 unless graph intelligence or current overloads strongly suggest it is unsafe.",
+        )
+        lines.insert(
+            12,
+            "CANDIDATE RULE: In the emergency phase, include at least one redispatch candidate aimed at immediate rho reduction. Do not fill the set with passive do_nothing-style choices.",
+        )
+        lines.insert(
+            13,
+            "CANDIDATE RULE: If no action looks clearly better, still propose the smallest safe redispatch or a safe reconnect test rather than defaulting all candidates toward do_nothing.",
+        )
+        lines.insert(
+            14,
+            f"N-1 STRUCTURAL SECURITY: score={float(graph_intelligence.get('n1_security_score', 0.0)):.3f}; bridge_lines={json.dumps(graph_intelligence.get('bridge_lines', []), separators=(',', ':'))}",
+        )
+        lines.insert(
+            15,
+            "THRESHOLDS: EMERGENCY if any line rho >= 0.92, WARNING for 0.80 <= rho < 0.92, SAFE if all lines are below 0.80.",
+        )
+        lines.insert(
+            16,
+            "EMERGENCY_LINES=" + json.dumps(danger_lines, separators=(",", ":")),
+        )
+        lines.insert(
+            17,
+            "WARNING_LINES=" + json.dumps(warning_lines, separators=(",", ":")),
+        )
+        lines.insert(
+            18,
             "RECONNECT_WINDOW_LINES="
             + json.dumps(cooldown_zero_lines, separators=(",", ":")),
         )
             7,
             "RULE: If a simulated candidate safely reduces max_rho compared to the current state, you MUST select it over do_nothing, no matter how small the reduction is. Do not choose do_nothing unless every other candidate increases max_rho or causes a failure. Safe, incremental redispatch improvements are the only way to win.",
         )
+    if task_id == "n_minus_1":
+        lines.insert(
+            7,
+            "RULE: In steps 1-5, prioritize candidates that clear the emergency by bringing max_rho below 0.92. Do not choose do_nothing in the emergency window if a safe simulated action lowers max_rho.",
+        )
+        lines.insert(
+            8,
+            "RULE: When a safe reconnect_line action for line 0 is available after cooldown, strongly prefer it if it improves or preserves security.",
+        )
+        lines.insert(
+            9,
+            "RULE: After step 5, prefer candidates that keep max_rho below 0.90 on future steps rather than merely surviving at higher stress.",
+        )
     if task_id == "multi_stage_cascade":
         lines.insert(
             7,
     task_id: TaskId = "n_minus_1",
 ) -> tuple[list[tuple[GridAction, dict[str, Any]]], dict[str, Any]]:
     payload = parse_json_action(content)
+    if task_id == "single_fault":
+        raw_candidates = [
+            payload.get("primary_action"),
+            payload.get("backup_action_1"),
+            payload.get("backup_action_2"),
+        ]
+    else:
+        raw_candidates = payload.get("candidates", [])
     candidates: list[tuple[GridAction, dict[str, Any]]] = []
     if isinstance(raw_candidates, list):
         for item in raw_candidates[:3]:
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--baseline-suite",
+        action="store_true",
+        help="Run the internal multi-task baseline suite instead of the submission episode runner.",
+    )
     parser.add_argument(
         "--task-id",
         dest="task_ids",
         nargs="+",
         choices=sorted(TASKS.keys()),
+        help="Run only the selected task ids for --baseline-suite. Defaults to all tasks.",
     )
     args = parser.parse_args()
+    if args.baseline_suite:
+        base_url = os.environ.get("GRID2OP_BASE_URL", DEFAULT_ENV_BASE_URL)
+        result = run_baseline_suite(base_url=base_url, task_ids=args.task_ids)
+        print(result.model_dump_json(indent=2))
+    else:
+        run_submission_episodes(task_ids=args.task_ids)

inference_speed_test.py DELETED Viewed

@@ -1,34 +0,0 @@
-import os
-import time
-from dotenv import load_dotenv
-from openai import OpenAI
-load_dotenv()
-client = OpenAI(
-    base_url="https://router.huggingface.co/v1",
-    api_key=os.environ["HF_TOKEN"],
-)
-start_time = time.perf_counter()
-completion = client.chat.completions.create(
-    model="Qwen/Qwen3.5-9B:fastest",
-    messages=[
-        {
-            "role": "user",
-            "content": "Write a detailed essay about the history of artificial intelligence, its major milestones, and future implications. Include at least 500 words.",
-        }
-    ],
-    max_tokens=1000,
-)
-end_time = time.perf_counter()
-latency = (end_time - start_time) * 1000
-tokens = completion.usage.completion_tokens
-print(f"Response: {completion.choices[0].message.content}")
-print(f"Latency: {latency:.2f} ms")
-print(f"Tokens: {tokens}")
-print(f"Throughput: {tokens / (latency / 1000):.2f} tokens/sec")

main.py DELETED Viewed

@@ -1,6 +0,0 @@
-def main():
-    print("Hello from openenv-modules!")
-if __name__ == "__main__":
-    main()

grid2op_env/models.py → models.py RENAMED Viewed

@@ -112,7 +112,7 @@ class GraderResponse(BaseModel):
 class BaselineRequest(BaseModel):
     model: str = Field(default="Qwen/Qwen3.5-9B")
-    max_tokens: int = Field(default=500, ge=1)
     temperature: float = 0.7
     top_p: float = 0.8
     presence_penalty: float = 1.5

 class BaselineRequest(BaseModel):
     model: str = Field(default="Qwen/Qwen3.5-9B")
+    max_tokens: int = Field(default=1200, ge=1)
     temperature: float = 0.7
     top_p: float = 0.8
     presence_penalty: float = 1.5

grid2op_env/openenv.yaml → openenv.yaml RENAMED Viewed

File without changes

{grid2op_env/outputs → outputs}/evals/.gitkeep RENAMED Viewed

File without changes

{grid2op_env/outputs → outputs}/logs/.gitkeep RENAMED Viewed

File without changes

pyproject.toml CHANGED Viewed

@@ -1,11 +1,35 @@
 [project]
-name = "openenv-modules"
 version = "0.1.0"
-description = "Add your description here"
 readme = "README.md"
-requires-python = ">=3.13"
 dependencies = [
-    "fastmcp>=3.1.1",
-    "numba>=0.64.0",
-    "openenv-core>=0.2.2",
 ]

+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
 [project]
+name = "grid2op-env"
 version = "0.1.0"
+description = "Standalone OpenEnv wrapper around Grid2Op"
 readme = "README.md"
+requires-python = ">=3.10,<3.13"
 dependencies = [
+    "openenv-core[core]>=0.2.2",
+    "grid2op>=1.10.5",
+    "numpy>=1.24.0",
+    "openai>=2.7.2",
+    "python-dotenv>=1.0.1",
+    "requests>=2.31.0",
 ]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+]
+lightsim = [
+    "lightsim2grid>=0.10.0",
+]
+[project.scripts]
+server = "grid2op_env.server.app:main"
+grid2op-smoke = "grid2op_env.server.grid_environment:smoke_main"
+[tool.setuptools]
+include-package-data = true
+packages = ["grid2op_env", "grid2op_env.server"]
+package-dir = { "grid2op_env" = ".", "grid2op_env.server" = "server" }

{grid2op_env/server → server}/Dockerfile RENAMED Viewed

File without changes

{grid2op_env/server → server}/__init__.py RENAMED Viewed

File without changes

{grid2op_env/server → server}/app.py RENAMED Viewed

File without changes

{grid2op_env/server → server}/graders.py RENAMED Viewed

File without changes

{grid2op_env/server → server}/grid_environment.py RENAMED Viewed

File without changes

{grid2op_env/server → server}/logging_utils.py RENAMED Viewed

File without changes

{grid2op_env/server → server}/requirements.txt RENAMED Viewed

File without changes

{grid2op_env/server → server}/tasks.py RENAMED Viewed

File without changes

submission/README.md ADDED Viewed

	@@ -0,0 +1,217 @@

+# OpenEnv Hackathon Submission Requirements
+## Overview
+This document outlines all requirements for submitting an environment to the OpenEnv Hackathon. All submissions must meet the criteria defined in this document to be evaluated.
+---
+## 1. Task Requirements
+### 1.1 Real-World Task Simulation
+- The environment must simulate a task **humans actually do**
+- **NOT** games or toys
+- Examples of acceptable domains: email triage, code review, data cleaning, scheduling, customer support, content moderation, power grid management
+### 1.2 OpenEnv Spec Compliance
+- Implement the full OpenEnv interface:
+  - Typed `Observation`, `Action`, and `Reward` Pydantic models
+  - `step(action)` → returns `observation, reward, done, info`
+  - `reset()` → returns initial observation
+  - `state()` → returns current state
+- Include `openenv.yaml` with metadata
+- Tested via `openenv validate`
+### 1.3 Minimum 3 Tasks with Agent Graders
+- **Each task** must have:
+  - A concrete objective an agent must accomplish
+  - A programmatic grader that scores performance (0.0–1.0)
+  - Clear, deterministic success/failure criteria
+- **Difficulty progression**: easy → medium → hard
+### 1.4 Meaningful Reward Function
+- Provides signal over the full trajectory (not just binary end-of-episode)
+- Rewards partial progress toward task completion
+- Penalizes clearly undesirable behavior (e.g., infinite loops, destructive actions)
+---
+## 2. Functional Requirements
+### 2.1 Baseline Inference Script
+- Must be named `inference.py` and placed in the **root directory**
+- Use the OpenAI API client to run a model against the environment
+- Read API credentials from environment variables:
+  - `API_BASE_URL` - The API endpoint for the LLM
+  - `MODEL_NAME` - The model identifier to use for inference
+  - `HF_TOKEN` - Your Hugging Face / API key
+- Produce a reproducible baseline score on all tasks
+### 2.2 Structured Logging
+- Emit structured stdout logs strictly following the format:
+  - `[START]` - Episode start
+  - `[STEP]` - Each step
+  - `[END]` - Episode end
+- Any deviation in field names, ordering, or formatting will result in incorrect evaluation
+---
+## 3. Deployment Requirements
+### 3.1 Hugging Face Spaces
+- Environment must run as a containerized HF Space tagged with `openenv`
+- Automated ping to the Space URL — must return 200 and respond to `reset()`
+### 3.2 Containerized Execution
+- Must include a working `Dockerfile`
+- The environment should start cleanly with `docker build && docker run`
+### 3.3 Infrastructure Restrictions
+- Runtime of inference script should be **less than 20 minutes**
+- Must run on a machine with `vcpu=2, memory=8gb`
+---
+## 4. Documentation Requirements
+### 4.1 README
+Must include:
+- Environment description and motivation
+- Action and observation space definitions
+- Task descriptions with expected difficulty
+- Setup and usage instructions
+- Baseline scores
+---
+## 5. Evaluation Criteria
+### 5.1 Parameter Weights
+| Parameter | Weight | Description |
+|-----------|--------|-------------|
+| **Real-world utility** | 30% | Does the environment model a genuine task? Would someone actually use this to train or evaluate agents? |
+| **Task & grader quality** | 25% | Are tasks well-defined with clear objectives? Do graders accurately and fairly measure success? Meaningful difficulty progression? |
+| **Environment design** | 20% | Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries |
+| **Code quality & spec compliance** | 15% | Follows OpenEnv spec, clean project structure, typed models, documented, tested, Dockerfile works |
+| **Creativity & novelty** | 10% | Novel problem domain, interesting mechanics, clever reward design, original approach |
+### 5.2 Scoring Breakdown
+#### Real-world utility (30%)
+- 0–5: Toy/artificial problem with no practical application
+- 6–15: Valid domain but shallow modeling of the real task
+- 16–25: Good domain modeling, would be useful for agent evaluation
+- 26–30: Excellent — fills a real gap, immediate value for the RL/agent community
+#### Task & grader quality (25%)
+- ✅ 3+ tasks with difficulty range?
+- ✅ Graders produce scores between 0.0–1.0?
+- ✅ Graders deterministic and reproducible?
+- ✅ Hard task genuinely challenges frontier models?
+#### Environment design (20%)
+- ✅ `reset()` produces clean state?
+- ✅ Action/observation types well-designed and documented?
+- ✅ Reward function provides useful varying signal (not just sparse)?
+- ✅ Episode boundaries sensible?
+#### Code quality & spec compliance (15%)
+- ✅ `openenv validate` passes?
+- ✅ `docker build && docker run` works?
+- ✅ HF Space deploys and responds?
+- ✅ Baseline script runs and reproduces scores?
+#### Creativity & novelty (10%)
+- ✅ Domain we haven't seen in OpenEnv before?
+- ✅ Reward design has interesting properties?
+- ✅ Clever mechanics that make the environment engaging?
+---
+## 6. Validation Checklist
+Before submitting, ensure:
+- [ ] `openenv validate` passes
+- [ ] `docker build && docker run` works
+- [ ] HF Space deploys and responds to `reset()`
+- [ ] Baseline inference script runs without error
+- [ ] 3+ tasks with graders (scores in 0.0–1.0 range)
+- [ ] `inference.py` named correctly and in root directory
+- [ ] Environment variables defined: `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`
+- [ ] Structured logs follow `[START]`, `[STEP]`, `[END]` format
+- [ ] Runtime under 20 minutes
+- [ ] Works on 2 vCPU, 8GB RAM machine
+---
+## 7. Judging Phases
+### Phase 1: Automated Validation (Pass/Fail)
+- HF Space deploys
+- OpenEnv spec compliance
+- Dockerfile builds
+- Baseline reproduces
+- 3+ tasks with graders
+### Phase 2: Agentic Evaluation (Scored)
+- Baseline agent re-run
+- Standard Open LLM agent (e.g., Nemotron 3 Super) run against all environments
+- Score variance check
+### Phase 3: Human Review
+- Top submissions reviewed by Meta and HuggingFace engineers
+- Real-world utility check
+- Creativity check
+- Exploit checks
+---
+## 8. Disqualification Criteria
+The following will result in disqualification:
+- Environment does not deploy or respond
+- Plagiarized or trivially modified existing environments
+- Graders that always return the same score
+---
+## 9. Example: Power Grid Environment (Reference)
+Your environment should follow a similar structure:
+```
+project/
+├── inference.py           # Baseline inference script
+├── openenv.yaml          # OpenEnv metadata
+├── Dockerfile            # Container configuration
+├── README.md             # Documentation
+├── src/
+│   ├── tasks.py          # Task definitions (3+ tasks)
+│   ├── graders.py        # Task graders (scores 0.0-1.0)
+│   ├── environment.py    # Environment implementation
+│   └── models.py         # Typed Observation/Action models
+└── requirements.txt       # Dependencies
+```
+---
+## Summary Checklist
+| Requirement | Mandatory? |
+|-------------|-------------|
+| Real-world task (not games) | ✅ Yes |
+| OpenEnv spec compliance | ✅ Yes |
+| 3 tasks (easy→medium→hard) | ✅ Yes |
+| Graders (0.0–1.0 scores) | ✅ Yes |
+| Meaningful reward function | ✅ Yes |
+| `inference.py` in root | ✅ Yes |
+| HF_TOKEN, MODEL_NAME, API_BASE_URL | ✅ Yes |
+| Structured logs [START/STEP/END] | ✅ Yes |
+| HF Space deploys | ✅ Yes |
+| Dockerfile works | ✅ Yes |
+| Runtime < 20 min | ✅ Yes |
+| 2 vCPU, 8GB RAM | ✅ Yes |
+| README with setup instructions | ✅ Yes |

submission/pre_validation.sh ADDED Viewed

	@@ -0,0 +1,185 @@

+#!/usr/bin/env bash
+#
+# validate-submission.sh — OpenEnv Submission Validator
+#
+# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
+#
+# Prerequisites:
+#   - Docker:       https://docs.docker.com/get-docker/
+#   - openenv-core: pip install openenv-core
+#   - curl (usually pre-installed)
+#
+# Run:
+#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
+#
+#   Or download and run locally:
+#     chmod +x validate-submission.sh
+#     ./validate-submission.sh <ping_url> [repo_dir]
+#
+# Arguments:
+#   ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)
+#   repo_dir   Path to your repo (default: current directory)
+#
+# Examples:
+#   ./validate-submission.sh https://my-team.hf.space
+#   ./validate-submission.sh https://my-team.hf.space ./my-repo
+#
+set -uo pipefail
+DOCKER_BUILD_TIMEOUT=600
+if [ -t 1 ]; then
+  RED='\033[0;31m'
+  GREEN='\033[0;32m'
+  YELLOW='\033[1;33m'
+  BOLD='\033[1m'
+  NC='\033[0m'
+else
+  RED='' GREEN='' YELLOW='' BOLD='' NC=''
+fi
+run_with_timeout() {
+  local secs="$1"; shift
+  if command -v timeout &>/dev/null; then
+    timeout "$secs" "$@"
+  elif command -v gtimeout &>/dev/null; then
+    gtimeout "$secs" "$@"
+  else
+    "$@" &
+    local pid=$!
+    ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
+    local watcher=$!
+    wait "$pid" 2>/dev/null
+    local rc=$?
+    kill "$watcher" 2>/dev/null
+    wait "$watcher" 2>/dev/null
+    return $rc
+  fi
+}
+portable_mktemp() {
+  local prefix="${1:-validate}"
+  mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
+}
+CLEANUP_FILES=()
+cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
+trap cleanup EXIT
+PING_URL="${1:-}"
+REPO_DIR="${2:-.}"
+if [ -z "$PING_URL" ]; then
+  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
+  printf "\n"
+  printf "  ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
+  printf "  repo_dir   Path to your repo (default: current directory)\n"
+  exit 1
+fi
+if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
+  printf "Error: directory '%s' not found\n" "${2:-.}"
+  exit 1
+fi
+PING_URL="${PING_URL%/}"
+export PING_URL
+PASS=0
+log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
+pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
+fail() { log "${RED}FAILED${NC} -- $1"; }
+hint() { printf "  ${YELLOW}Hint:${NC} %b\n" "$1"; }
+stop_at() {
+  printf "\n"
+  printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
+  exit 1
+}
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+log "Repo:     $REPO_DIR"
+log "Ping URL: $PING_URL"
+printf "\n"
+log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
+CURL_OUTPUT=$(portable_mktemp "validate-curl")
+CLEANUP_FILES+=("$CURL_OUTPUT")
+HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
+  -H "Content-Type: application/json" -d '{}' \
+  "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
+if [ "$HTTP_CODE" = "200" ]; then
+  pass "HF Space is live and responds to /reset"
+elif [ "$HTTP_CODE" = "000" ]; then
+  fail "HF Space not reachable (connection failed or timed out)"
+  hint "Check your network connection and that the Space is running."
+  hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
+  stop_at "Step 1"
+else
+  fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
+  hint "Make sure your Space is running and the URL is correct."
+  hint "Try opening $PING_URL in your browser first."
+  stop_at "Step 1"
+fi
+log "${BOLD}Step 2/3: Running docker build${NC} ..."
+if ! command -v docker &>/dev/null; then
+  fail "docker command not found"
+  hint "Install Docker: https://docs.docker.com/get-docker/"
+  stop_at "Step 2"
+fi
+if [ -f "$REPO_DIR/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR"
+elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR/server"
+else
+  fail "No Dockerfile found in repo root or server/ directory"
+  stop_at "Step 2"
+fi
+log "  Found Dockerfile in $DOCKER_CONTEXT"
+BUILD_OK=false
+BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
+if [ "$BUILD_OK" = true ]; then
+  pass "Docker build succeeded"
+else
+  fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
+  printf "%s\n" "$BUILD_OUTPUT" | tail -20
+  stop_at "Step 2"
+fi
+log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
+if ! command -v openenv &>/dev/null; then
+  fail "openenv command not found"
+  hint "Install it: pip install openenv-core"
+  stop_at "Step 3"
+fi
+VALIDATE_OK=false
+VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
+if [ "$VALIDATE_OK" = true ]; then
+  pass "openenv validate passed"
+  [ -n "$VALIDATE_OUTPUT" ] && log "  $VALIDATE_OUTPUT"
+else
+  fail "openenv validate failed"
+  printf "%s\n" "$VALIDATE_OUTPUT"
+  stop_at "Step 3"
+fi
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
+printf "${GREEN}${BOLD}  Your submission is ready to submit.${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+printf "\n"
+exit 0

submission/sample_inference.py ADDED Viewed

	@@ -0,0 +1,188 @@

+""
+Inference Script Example
+===================================
+MANDATORY
+- Before submitting, ensure the following variables are defined in your environment configuration:
+    API_BASE_URL   The API endpoint for the LLM.
+    MODEL_NAME     The model identifier to use for inference.
+    HF_TOKEN       Your Hugging Face / API key.
+    LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
+                     method
+- Defaults are set only for API_BASE_URL and MODEL_NAME
+    (and should reflect your active inference setup):
+    API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
+    MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
+- The inference script must be named `inference.py` and placed in the root directory of the project
+- Participants must use OpenAI Client for all LLM calls using above variables
+STDOUT FORMAT
+- The script must emit exactly three line types to stdout, in this order:
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
+  Rules:
+    - One [START] line at episode begin.
+    - One [STEP] line per step, immediately after env.step() returns.
+    - One [END] line after env.close(), always emitted (even on exception).
+    - reward and rewards are formatted to 2 decimal places.
+    - done and success are lowercase booleans: true or false.
+    - error is the raw last_action_error string, or null if none.
+    - All fields on a single line with no newlines within a line.
+    - Each tasks should return score in [0, 1]
+  Example:
+    [START] task=click-test env=miniwob model=Qwen3-VL-30B
+    [STEP] step=1 action=click('123') reward=0.00 done=false error=null
+    [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
+    [STEP] step=3 action=click('789') reward=1.00 done=true error=null
+    [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
+"""
+import asyncio
+import os
+import textwrap
+from typing import List, Optional
+from openai import OpenAI
+from my_env_v4 import MyEnvV4Action, MyEnvV4Env
+IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
+MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
+TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
+BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
+MAX_STEPS = 8
+TEMPERATURE = 0.7
+MAX_TOKENS = 150
+SUCCESS_SCORE_THRESHOLD = 0.1  # normalized score in [0, 1]
+# Max possible reward: each token contributes 0.1, across all steps
+_MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
+MAX_TOTAL_REWARD = MAX_STEPS * _MAX_REWARD_PER_STEP
+SYSTEM_PROMPT = textwrap.dedent(
+    """
+    You are interacting with a simple echo environment.
+    Each turn you must send a message. The environment will echo it back.
+    Reward is proportional to message length: reward = len(message) * 0.1
+    Your goal is to maximize total reward by sending meaningful, substantive messages.
+    Reply with exactly one message string — no quotes, no prefixes, just the message text.
+    """
+).strip()
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
+def build_user_prompt(step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
+    history_block = "\n".join(history[-4:]) if history else "None"
+    return textwrap.dedent(
+        f"""
+        Step: {step}
+        Last echoed message: {last_echoed!r}
+        Last reward: {last_reward:.2f}
+        Previous steps:
+        {history_block}
+        Send your next message.
+        """
+    ).strip()
+def get_model_message(client: OpenAI, step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
+    user_prompt = build_user_prompt(step, last_echoed, last_reward, history)
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_prompt},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            stream=False,
+        )
+        text = (completion.choices[0].message.content or "").strip()
+        return text if text else "hello"
+    except Exception as exc:
+        print(f"[DEBUG] Model request failed: {exc}", flush=True)
+        return "hello"
+async def main() -> None:
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = await MyEnvV4Env.from_docker_image(IMAGE_NAME)
+    history: List[str] = []
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        result = await env.reset() # OpenENV.reset()
+        last_echoed = result.observation.echoed_message
+        last_reward = 0.0
+        for step in range(1, MAX_STEPS + 1):
+            if result.done:
+                break
+            message = get_model_message(client, step, last_echoed, last_reward, history)
+            result = await env.step(MyEnvV4Action(message=message))
+            obs = result.observation
+            reward = result.reward or 0.0
+            done = result.done
+            error = None
+            rewards.append(reward)
+            steps_taken = step
+            last_echoed = obs.echoed_message
+            last_reward = reward
+            log_step(step=step, action=message, reward=reward, done=done, error=error)
+            history.append(f"Step {step}: {message!r} -> reward {reward:+.2f}")
+            if done:
+                break
+        score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
+        score = min(max(score, 0.0), 1.0)  # clamp to [0, 1]
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    finally:
+        try:
+            await env.close()
+        except Exception as e:
+            print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+if __name__ == "__main__":
+    asyncio.run(main())

{grid2op_env/tests → tests}/test_grid2op_env.py RENAMED Viewed

File without changes

uv.lock CHANGED Viewed

The diff for this file is too large to render. See raw diff