grid2op-openenv / README.md
Sidharth1743's picture
docs
45c7e5c
metadata
title: Grid2Op Environment
emoji: 
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
pinned: false

Grid2Op OpenEnv Environment

Power grid topology control for reinforcement learning — four tasks, from overload relief to multi-stage cascade damage control.

OpenEnv Grid2Op HF Spaces License


What This Is

This environment wraps the IEEE 14-bus power grid (l2rpn_case14_sandbox) as a fully compliant OpenEnv environment. It exposes four tasks of increasing difficulty, each grounded in real power systems research, with deterministic graders and a physics-backed simulation endpoint.

The design principle is simple: the server owns the simulation state. Planning uses obs.simulate() on the live session rather than a local mirror, so simulation results are always exact.

The included baseline agent uses a Think → Simulate → Act loop: the LLM proposes candidate actions, the server validates each one via physics simulation, and the LLM selects the safest option. This is a physics-grounded LLM planner, not a zero-shot guesser.


Quick Start

Prerequisites

  • Python 3.10–3.12
  • uv (recommended) or pip
  • Docker (for containerised deployment)

1. Install

uv venv
source .venv/bin/activate
uv pip install -e .

2. Run the server

uv run server --port 8000

The server listens at http://127.0.0.1:8000.

3. Smoke test

curl -X POST http://127.0.0.1:8000/reset \
  -H "Content-Type: application/json" \
  -d '{}'

curl http://127.0.0.1:8000/tasks

An empty POST /reset {} now defaults to:

  • task_id=single_fault
  • scenario_mode=benchmark
  • the first backend benchmark tier for that task (single_fault_easy)

This default keeps public resets fast and matches the benchmark-oriented submission flow.

4. Run the baseline agent

Create a .env file:

GRID2OP_BASE_URL=http://127.0.0.1:8000
API_BASE_URL=https://router.huggingface.co/v1
HF_TOKEN=hf_your_token
MODEL_NAME=openai/gpt-oss-20b:groq

Then run against any task:

python inference.py --task-id single_fault
python inference.py --task-id n_minus_1
python inference.py --task-id cascade_prevent
python inference.py --task-id multi_stage_cascade

5. Run tests

uv run --extra dev pytest tests/test_grid2op_env.py -q

Docker

docker build -t grid2op-env:local .
docker run --rm -p 8000:8000 grid2op-env:local

The image pre-downloads l2rpn_case14_sandbox at build time so runtime startup is instant.


Grid

All four tasks use the same underlying grid:

Property Value
Environment l2rpn_case14_sandbox (IEEE 14-bus)
Substations 14
Transmission lines 20
Generators 6
Loads 11
Power flow backend lightsim2grid (AC)
Time resolution 5 minutes per step
Scenario data Pre-recorded chronics (load + generation time series)

The Scenario Dataset

Our companion repository, grid2op-data, provides comprehensive data intelligence for all 1,014 scenarios.

Data Files per Scenario

Each scenario folder contains:

File Description Example Values
load_p.csv Active power demand (MW) 11 columns, 8,065 rows
load_q.csv Reactive power demand (MVAr) ~70% of active power
prod_p.csv Generator output (MW) 6 generators tracked
prod_v.csv Voltage setpoints (p.u.) Typically 0.95–1.05
*_forecasted.csv 5-minute-ahead forecasts Used by RL agents

Key Dataset Statistics

Total scenarios analyzed:    1,014
Timesteps per scenario:      8,065 (~4.7 weeks each)
Total data points:           ~8.2 million
Average system load:         ~257 MW
Peak load observed:          ~321 MW (scenario 0001)
Minimum load observed:       ~190 MW
Average peak hour:           19:24 (7:24 PM)
Reactive power burden:       0.70 (load_q / load_p)
Average ramp rate:           1.72 MW per 5-min step
Supply-demand imbalance:     4.92 MW (mean)

Action Space

Agents submit a GridAction with any combination of:

Field Type Description
line_set dict[int, int] Map of line_id → status where 1 = connect, -1 = disconnect
redispatch dict[int, float] Map of gen_id → delta_mw (subject to ramp limits)
do_nothing bool Explicit no-op
{
  "line_set": {"4": -1},
  "redispatch": {"2": 15.0},
  "do_nothing": false
}

All actions are validated server-side. Invalid indices are silently stripped. Ramp limits are enforced. Bridge lines that would island the network are rejected before reaching the physics solver.


Observation Space

Each step returns a GridObservation:

Field Shape Description
rho [20] Line loading ratio — 1.0 = thermal limit
gen_p [6] Generator active power output (MW)
load_p [11] Load active power demand (MW)
line_status [20] true = connected, false = disconnected
timestep_overflow [20] Consecutive steps each line has been overloaded
reward float Shaped reward for this step
done bool Whether the episode has ended
metadata dict Task-specific fields (stage index, available load ratio, etc.)

🕸️ Graph Intelligence

One of the most important capabilities of this environment is the graph intelligence layer exposed through:

POST /planning_context

This intelligence is computed from the live server observation inside server/environment.py, which internally calls graph_analysis.py.

Instead of relying only on raw rho values, the planner also receives structural grid insights. This allows the agent to:

  • avoid unsafe topology edits
  • detect critical transmission corridors
  • reason about cascading risks
  • make topology-safe switching decisions

🔍 What Graph Intelligence Includes

The graph analysis currently provides:

  • bridge_lines → connected lines whose removal would split the active grid graph
  • safe_to_disconnect → connected lines that can be disconnected without fragmenting the grid
  • n_minus_1_critical_lines → structurally critical lines important for N-1 contingency reasoning
  • high_centrality_buses → buses with high betweenness centrality in the active network
  • islanded_clusters → bus clusters already separated from the main connected component
  • congestion_corridor → short summary of exporter buses, importer buses, and stressed lines
  • flow_clusters → exporter/importer bus rankings derived from flow_bus_matrix
  • stressed_lines → highest-rho connected lines with endpoint and overflow context
  • parallel_groups → transmission lines sharing the same terminal substations

This makes the planner topology-aware, contingency-aware, and cascade-aware, rather than purely overload-reactive.

Tasks

Task 1 — single_fault · Easy · 10 steps

Scenario: The grid is intact but one or more lines are running hot (90–98% loading). No lines have tripped yet, but the chronic is trending toward overload.

Objective: Bring all lines below the safe threshold within 10 steps.

What makes it easy: Single problem region, full topology intact, multiple redundant paths available.

Reward:

  • success bonus when all lines fall below the task threshold
  • +0.05 × max(0, 1 - max_rho) to reward lower peak loading
  • −0.2 × overloaded_line_count to penalise unresolved overloads
  • redispatch penalty proportional to total redispatch magnitude
  • −5.0 terminal penalty if the episode reaches the time limit without clearing the target

Grader: Score based on whether all lines reached below threshold and how many steps it took. Faster resolution = higher score.

Research basis: Physics reward formulation from Dwivedi et al. (2024) [arXiv:2411.18050]. Switching cost design from the same paper's µ_line × c_line × W_ℓ[n] formulation.


Task 2 — n_minus_1 · Medium · 20 steps

Scenario: Line 0 is disconnected at reset (N-1 contingency). The remaining 19 lines absorb the rerouted flow. Several lines are immediately pushed to 70–90% loading.

Objective: Clear the emergency, maintain N-1 secure operation for 20 steps, and reconnect the faulted line when its cooldown expires.

What makes it medium: The agent must manage sustained stress across multiple lines over a longer horizon, and must understand that the cooldown on line 0 creates a reconnection opportunity that should not be missed.

Reward: Three-component RL2Grid structure:

R = 0.3 × R_survive + 0.6 × R_overload + 0.1 × R_cost
  • R_survive: constant +1.0 per step alive
  • R_overload: clipped thermal margin mean(clip(1 - ρ, -1, 1))
  • R_cost: economic penalty for redispatch proportional to |delta_mw| / max_ramp
  • Reconnection bonus: +2.0 when a faulted line is successfully reconnected without worsening loading

Grader:

  • 30% emergency response quality — did all lines reach below ρ_danger = 0.92 within 5 steps?
  • 50% sustained secure operation — fraction of steps 6–20 with ρ_max < 0.90
  • 20% reconnection achievement — binary, did line 0 get reconnected?

Research basis: Three-component reward from Marchesini et al. (2025) [arXiv:2503.23101]. Activation threshold pattern from Yoon et al. (2021), ICLR 2021. Reconnection heuristic from the L2RPN 2023 winning agent (LJNAgent).


Task 3 — cascade_prevent · Hard · 30 steps

Scenario: One or two lines are already disconnected and load is elevated by 5–15%. Several remaining lines are near or above their thermal limits. Grid2Op is counting down to automatic line trips — and each trip redistributes flow, potentially overloading more lines.

Objective: Prevent automatic line trips from propagating into a cascade for 30 steps.

What makes it hard: The key signal is not max_rho but timestep_overflow — the per-line countdown to automatic disconnection. A line at 103% with overflow=2 is more urgent than a line at 95% with overflow=0, even though the second has higher absolute loading. Triage under time pressure, not global optimisation.

Reward:

  • +0.3 per step with no automatic trip
  • −2.5 per automatic trip detected
  • −0.05 × Σ overflow² — quadratic overflow penalty that escalates with each step of inaction
  • +0.1 × mean(clip(1 - ρ, -1, 1)) — thermal margin signal
  • Terminal: survival bonus reduced proportionally by auto-trip count, blackout penalty −12.0

Grader:

  • 50% cascade containment — steps_without_auto_trip / 30
  • 30% thermal stability — fraction of safe steps with all lines below 100%
  • 20% recovery speed — how fast the grid reached ρ_max < 1.0 after initial stress

Research basis: Cascade prevention framing and curriculum profiles from Ramapuram Matavalam et al. (2023), IEEE Transactions on Power Systems. Thermal margin shaping from RL2Grid [arXiv:2503.23101]. Quadratic overflow penalty from physics urgency analysis.


Task 4 — multi_stage_cascade · Expert · 30 steps (3 × 10)

Scenario: Three lines are simultaneously disconnected and load is increased by 20%. The overflow window is shortened to 2 steps. The grid will fragment — the question is not whether a cascade occurs but how much viable load survives it.

Objective: Preserve as much load in self-sustaining grid islands as possible across three 10-step stages.

What makes it expert: Cascade propagation is physically inevitable. Actions that appear beneficial in Stage 1 can destroy island viability in Stage 2. The agent must plan across stage boundaries, not just the current timestep.

Island viability rule: For each connected component after fragmentation:

  • gen_total ≥ load_total → island is available (self-sustaining)
  • gen_total < load_total → island is unavailable (will collapse)

Key metric: available_load_ratio — fraction of original total load still located in viable islands at each step.

Reward (four-component MSCF structure):

  • −0.02 × (total_gen / initial_load) — generation cost penalty per step
  • +0.5 × available_island_ratio — reward for keeping more islands viable
  • −5.0 × (1 − available_load_ratio) — stage-boundary load-loss penalty (applied at steps 10, 20)
  • +8.0 × (available_load_ratio²) — terminal win reward if ≥50% load preserved at step 30
  • −12.0 — early collapse or convergence failure

Grader:

  • 30% stage completion — did the agent cross step 10, step 20, and step 30?
  • 40% load preservation — available_load_ratio at episode end (largest component)
  • 20% island quality — fraction of stage boundaries where majority of islands were viable
  • 10% speed — how fast each stage reached all lines below 100%

Research basis: MSCF formulation, island availability assessment (Max_Gen_Total ≥ Load_Total), and four-component reward structure from Meng, Xu & Zhu (2025) [arXiv:2505.09012], ICLR 2025. Stage-interdependence principle and continuous action design from the same paper. Earlier MSCF MDP formulation from Zhu (2021) [arXiv:2108.10424].


Task Summary

Task 1 Task 2 Task 3 Task 4
Name single_fault n_minus_1 cascade_prevent multi_stage_cascade
Difficulty Easy Medium Hard Expert
Max steps 10 20 30 30
Lines down at reset 0 1 1–2 3
Load increase 0% 0% +5% to +15% +20%
Core question Relieve overload Survive degraded grid Stop cascade propagation Preserve viable load
Key signal max_rho ρ_max trending timestep_overflow available_load_ratio
Unique reward Quadratic physics reward Three-component RL2Grid Quadratic overflow penalty Load preservation + island quality

API Endpoints

Endpoint Method Description
/reset POST Reset the environment; accepts task_id, seed, difficulty_level
/step POST Execute a GridAction; returns GridObservation
/state GET Current episode metadata and episode_id
/tasks GET List all tasks with descriptions and action schema
/simulate POST Simulate candidate actions on the live session without advancing state
/planning_context GET Graph topology intelligence for the current episode
/grader POST Score a completed episode using the deterministic grader
/baseline POST Run the full baseline agent against all tasks and return scores
/ws WebSocket OpenEnv-compliant persistent session interface

Baseline Agent — Think → Simulate → Act

The agent in inference.py implements a physics-grounded LLM planning loop:

reset()
  └─ state() → episode_id
      └─ planning_context() → graph topology, safe actions, LODF guidance
          └─ LLM proposes 3 candidate actions
              └─ /simulate → physics validation for each candidate
                  └─ deterministic ranking for Task 1, LLM final selection for Tasks 2–4
                      └─ step(action)
                          └─ /grader at episode end

Key properties:

  • Candidates with convergence_failed=True are filtered before final selection
  • Bridge lines (whose removal would island the network) are excluded from the candidate pool
  • Ramp limits are exposed in the prompt to prevent duplicate redispatch proposals after sanitisation
  • Context window is kept compact: only lines above 80%, active overflow countdowns, and stage-specific metadata are included

Repository Layout

grid2op-openenv/
├── server/
│   ├── app.py             # FastAPI/OpenEnv entrypoint plus HTTP routes like /tasks, /grader, /simulate
│   ├── environment.py     # Live Grid2Op adapter, reset defaults, reward shaping, planning support
│   ├── tasks.py           # Task specs, benchmark tiers, dynamic scenario injection, reset replay logic
│   ├── graders.py         # Deterministic per-task graders used by /grader
│   ├── gradio_ui.py       # Optional OpenEnv web UI customization
│   └── logging_utils.py   # Server logging setup
├── models.py              # Pydantic models for actions, observations, state, logs, and baseline config
├── client.py              # GridEnv client wrapper for reset/step/state/planning_context/simulate
├── inference.py           # Submission baseline agent with Docker- or URL-based environment startup
├── graph_analysis.py      # Topology analysis used in planning_context graph intelligence
├── tests/                 # Pytest coverage for tasks, graders, parsing, and graph analysis
├── openenv.yaml           # OpenEnv manifest (FastAPI app on port 8000)
├── Dockerfile             # Root Docker image used for local builds and submission validation
├── server/Dockerfile      # Server-focused Docker build variant
├── architecture/          # System and per-task architecture notes
└── docs/                  # Implementation notes and research-grounding writeups

References

  1. Meng, B., Xu, C., & Zhu, Y. (2025). Deep Reinforcement Learning for Power Grid Multi-Stage Cascading Failure Mitigation. ICLR 2025. arXiv:2505.09012

  2. Dwivedi, A., Tajer, A., Paternain, S., & Virani, N. (2024). RL for Mitigating Cascading Failures: Targeted Exploration via Sensitivity Factors. NeurIPS 2024 Workshop. arXiv:2411.18050

  3. Marchesini, E., Marzari, L., & Leofante, F. (2025). RL2Grid: Benchmarking Reinforcement Learning in Power Grid Operations. arXiv:2503.23101

  4. Yoon, D., Hong, S., Lee, B.-J., & Kim, K.-E. (2021). Winning the L2RPN Challenge: Power Grid Management via Semi-Markov Afterstate Actor-Critic. ICLR 2021. OpenReview

  5. Ramapuram Matavalam, A. R., Guddanti, K. P., Weng, Y., & Ajjarapu, V. (2023). Curriculum Based Reinforcement Learning of Grid Topology Controllers to Prevent Thermal Cascading. IEEE Transactions on Power Systems, 38(5), 4206–4220.

  6. van der Sar, E. et al. (2025). Centrally Coordinated Multi-Agent Reinforcement Learning for Power Grid Topology Control. ACM e-Energy 2025. arXiv:2502.08681

  7. Zhu, Y. (2021). Power Grid Cascading Failure Mitigation by Reinforcement Learning. arXiv:2108.10424

  8. Donnot, B. (2020). Grid2Op: A Testbed Platform to Model Sequential Decision Making in Power Systems. GitHub


License

MIT