Spaces:
Running
Running
Commit ·
da3c180
1
Parent(s): a65650a
docs updated
Browse files- architecture/task_1_architecture.md +41 -23
- architecture/task_2_architecture.md +88 -24
- architecture/task_4_architecture.md +12 -11
- grid2op_env/README.md +58 -18
architecture/task_1_architecture.md
CHANGED
|
@@ -24,10 +24,10 @@ At each step, calculate: max_rho = max(all line loadings)
|
|
| 24 |
▼
|
| 25 |
Find the step where: target_min ≤ max_rho ≤ target_max
|
| 26 |
│
|
| 27 |
-
│ Difficulty levels:
|
| 28 |
-
│ - easy
|
| 29 |
-
│ - moderate: 0.
|
| 30 |
-
│ - severe: 0.
|
| 31 |
│
|
| 32 |
▼
|
| 33 |
STOP at that step - this is your starting state
|
|
@@ -45,6 +45,7 @@ Return observation + scenario metadata
|
|
| 45 |
- `target_rho_range`: [min, max] that was searched for
|
| 46 |
- `warmup_steps`: How many steps were taken to find the state
|
| 47 |
- `target_matched`: True if exact target found, False if fallback used
|
|
|
|
| 48 |
|
| 49 |
---
|
| 50 |
|
|
@@ -294,12 +295,18 @@ Result:
|
|
| 294 |
|
| 295 |
### Reward Breakdown (Step 1)
|
| 296 |
|
|
|
|
|
|
|
| 297 |
```
|
| 298 |
-
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
|
| 302 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 303 |
```
|
| 304 |
|
| 305 |
---
|
|
@@ -445,7 +452,7 @@ if self._task_id == "single_fault" and all_lines_below_target:
|
|
| 445 |
]
|
| 446 |
```
|
| 447 |
|
| 448 |
-
### Grader Calculation (from graders.py)
|
| 449 |
|
| 450 |
```python
|
| 451 |
def grade_single_fault(episode_log):
|
|
@@ -454,31 +461,42 @@ def grade_single_fault(episode_log):
|
|
| 454 |
survival_score = survival_ratio * 0.7 # = 0.21
|
| 455 |
|
| 456 |
# 2. Target achieved bonus (50%)
|
| 457 |
-
achieved_target = any(entry.all_lines_below_target for entry in episode_log)
|
| 458 |
target_bonus = 0.5 if achieved_target else 0.0 # = 0.5
|
| 459 |
|
| 460 |
-
# 3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 461 |
final_rho = 0.77
|
| 462 |
target_threshold = 0.80
|
| 463 |
if final_rho < target_threshold:
|
| 464 |
final_bonus = 0.3 # = 0.3
|
| 465 |
elif final_rho < target_threshold + 0.05:
|
| 466 |
final_bonus = 0.15
|
|
|
|
|
|
|
| 467 |
else:
|
| 468 |
final_bonus = 0.0
|
| 469 |
|
| 470 |
# Total
|
| 471 |
-
score =
|
|
|
|
| 472 |
score = min(1.0, max(0.0, score))
|
| 473 |
|
| 474 |
-
return score
|
| 475 |
|
| 476 |
-
#
|
| 477 |
-
# survival_score =
|
| 478 |
-
# target_bonus = 0.5
|
| 479 |
-
#
|
|
|
|
| 480 |
#
|
| 481 |
-
#
|
| 482 |
```
|
| 483 |
|
| 484 |
---
|
|
@@ -561,11 +579,11 @@ STEP 3
|
|
| 561 |
|
| 562 |
| Tier | Target Range | Fixed in Code |
|
| 563 |
|------|--------------|---------------|
|
| 564 |
-
| `single_fault_easy` | 0.82-0.85 | tasks.py:
|
| 565 |
-
| `single_fault_moderate` | 0.86-0.89 | tasks.py:
|
| 566 |
-
| `single_fault_severe` | 0.90-0.93 | tasks.py:
|
| 567 |
|
| 568 |
-
**Note**: The original benchmark ranges (0.90-0.94, etc.) were mathematically impossible because generators could only reduce ~0.03-0.05 rho per step. Fixed in recent updates.
|
| 569 |
|
| 570 |
---
|
| 571 |
|
|
|
|
| 24 |
▼
|
| 25 |
Find the step where: target_min ≤ max_rho ≤ target_max
|
| 26 |
│
|
| 27 |
+
│ Difficulty levels (BENCHMARK - FIXED in tasks.py:297-304):
|
| 28 |
+
│ - easy: 0.82-0.85 (was impossible 0.90-0.94)
|
| 29 |
+
│ - moderate: 0.86-0.89 (was impossible 0.94-0.97)
|
| 30 |
+
│ - severe: 0.90-0.93 (was impossible 0.96-0.99)
|
| 31 |
│
|
| 32 |
▼
|
| 33 |
STOP at that step - this is your starting state
|
|
|
|
| 45 |
- `target_rho_range`: [min, max] that was searched for
|
| 46 |
- `warmup_steps`: How many steps were taken to find the state
|
| 47 |
- `target_matched`: True if exact target found, False if fallback used
|
| 48 |
+
- `scenario`: "high_loading"
|
| 49 |
|
| 50 |
---
|
| 51 |
|
|
|
|
| 295 |
|
| 296 |
### Reward Breakdown (Step 1)
|
| 297 |
|
| 298 |
+
From `grid_environment.py:589-596`:
|
| 299 |
+
|
| 300 |
```
|
| 301 |
+
# Actual implementation:
|
| 302 |
+
safe_margin_bonus = 0.05 × max(0.0, 1.0 - max_rho) # = 0.05 × 0.18 = 0.009
|
| 303 |
+
overload_penalty = 0.2 × overloaded_count # = 0 (no lines > 1.0)
|
| 304 |
+
redispatch_penalty = _action_penalty(action) # = 0.01 × 20 = 0.2
|
| 305 |
+
|
| 306 |
+
# Plus: early termination bonus if target achieved (step 1)
|
| 307 |
+
# target_achieved_bonus = 1.0 / step_count = 1.0/1 = 1.0
|
| 308 |
+
|
| 309 |
+
Total reward: 0.009 - 0.2 + 1.0 = 0.809 (if target achieved)
|
| 310 |
```
|
| 311 |
|
| 312 |
---
|
|
|
|
| 452 |
]
|
| 453 |
```
|
| 454 |
|
| 455 |
+
### Grader Calculation (from graders.py:28-55)
|
| 456 |
|
| 457 |
```python
|
| 458 |
def grade_single_fault(episode_log):
|
|
|
|
| 461 |
survival_score = survival_ratio * 0.7 # = 0.21
|
| 462 |
|
| 463 |
# 2. Target achieved bonus (50%)
|
| 464 |
+
achieved_target = any(entry.all_lines_below_target or entry.all_lines_below_80 for entry in episode_log)
|
| 465 |
target_bonus = 0.5 if achieved_target else 0.0 # = 0.5
|
| 466 |
|
| 467 |
+
# 3. Legacy success score (bonus for early completion)
|
| 468 |
+
legacy_success_score = 0.0
|
| 469 |
+
for entry in episode_log:
|
| 470 |
+
if entry.all_lines_below_target or entry.all_lines_below_80:
|
| 471 |
+
legacy_success_score = round(max(0.0, 1.0 - (0.08 * max(0, entry.step - 1))), 6)
|
| 472 |
+
break
|
| 473 |
+
|
| 474 |
+
# 4. Final state bonus (0.3 if below target, 0.15 if within +0.05, 0.05 if within +0.10)
|
| 475 |
final_rho = 0.77
|
| 476 |
target_threshold = 0.80
|
| 477 |
if final_rho < target_threshold:
|
| 478 |
final_bonus = 0.3 # = 0.3
|
| 479 |
elif final_rho < target_threshold + 0.05:
|
| 480 |
final_bonus = 0.15
|
| 481 |
+
elif final_rho < target_threshold + 0.10:
|
| 482 |
+
final_bonus = 0.05
|
| 483 |
else:
|
| 484 |
final_bonus = 0.0
|
| 485 |
|
| 486 |
# Total
|
| 487 |
+
score = (survival_ratio * 0.7) + target_bonus + final_bonus
|
| 488 |
+
score = max(score, legacy_success_score) # Take max of both scoring methods
|
| 489 |
score = min(1.0, max(0.0, score))
|
| 490 |
|
| 491 |
+
return round(score, 6)
|
| 492 |
|
| 493 |
+
# For example: episode completes at step 3 with max_rho=0.77:
|
| 494 |
+
# survival_score = (3/10) * 0.7 = 0.21
|
| 495 |
+
# target_bonus = 0.5 (achieved target)
|
| 496 |
+
# legacy_success_score = 1.0 - 0.08 * 2 = 0.84
|
| 497 |
+
# final_bonus = 0.3 (below target)
|
| 498 |
#
|
| 499 |
+
# score = 0.21 + 0.5 + 0.3 = 1.01 → capped at 1.0
|
| 500 |
```
|
| 501 |
|
| 502 |
---
|
|
|
|
| 579 |
|
| 580 |
| Tier | Target Range | Fixed in Code |
|
| 581 |
|------|--------------|---------------|
|
| 582 |
+
| `single_fault_easy` | 0.82-0.85 | tasks.py:299 |
|
| 583 |
+
| `single_fault_moderate` | 0.86-0.89 | tasks.py:301 |
|
| 584 |
+
| `single_fault_severe` | 0.90-0.93 | tasks.py:303 |
|
| 585 |
|
| 586 |
+
**Note**: The original benchmark ranges (0.90-0.94, etc.) were mathematically impossible because generators could only reduce ~0.03-0.05 rho per step. Fixed in recent updates to 0.82-0.85, 0.86-0.89, 0.90-0.93.
|
| 587 |
|
| 588 |
---
|
| 589 |
|
architecture/task_2_architecture.md
CHANGED
|
@@ -13,7 +13,7 @@ Key difference from Task 1:
|
|
| 13 |
## 1. Reset Phase
|
| 14 |
|
| 15 |
```python
|
| 16 |
-
# tasks.py:
|
| 17 |
def _reset_n_minus_1(env, seed, difficulty_level, scenario_mode, benchmark_tier):
|
| 18 |
obs = env.reset(
|
| 19 |
seed=seed,
|
|
@@ -22,6 +22,8 @@ def _reset_n_minus_1(env, seed, difficulty_level, scenario_mode, benchmark_tier)
|
|
| 22 |
return obs, {
|
| 23 |
"faulted_lines": [0],
|
| 24 |
"curriculum_stage": "fixed_n_minus_1",
|
|
|
|
|
|
|
| 25 |
...
|
| 26 |
}
|
| 27 |
```
|
|
@@ -68,24 +70,68 @@ Based on how grid operators actually work:
|
|
| 68 |
|
| 69 |
## 3. Reward Function (RL2Grid-inspired)
|
| 70 |
|
|
|
|
|
|
|
| 71 |
```python
|
| 72 |
-
# grid_environment.py:588-599
|
| 73 |
elif self._task_id == "n_minus_1":
|
|
|
|
| 74 |
r_survive = 1.0
|
|
|
|
|
|
|
| 75 |
clipped_margins = [max(-1.0, min(1.0, 1.0 - float(rho))) for rho in observation.rho]
|
| 76 |
r_overload = sum(clipped_margins) / len(clipped_margins)
|
|
|
|
|
|
|
| 77 |
r_cost = -self._n_minus_1_redispatch_cost(action)
|
|
|
|
|
|
|
| 78 |
reward += (0.3 * r_survive) + (0.6 * r_overload) + (0.1 * r_cost)
|
| 79 |
-
|
|
|
|
|
|
|
| 80 |
reward += 2.0
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
|
|
|
|
|
|
| 84 |
reward -= 15.0 # blackout penalty
|
| 85 |
```
|
| 86 |
|
| 87 |
### Components
|
| 88 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
| Component | Formula | Weight | Purpose |
|
| 90 |
|-----------|---------|--------|---------|
|
| 91 |
| `R_survive` | +1.0 per step | 0.3 | Constant survival signal |
|
|
@@ -140,35 +186,51 @@ N-1 STRUCTURAL SECURITY: score=0.941; bridge_lines=[4, 11, 15]
|
|
| 140 |
|
| 141 |
## 6. Grading (Phase-Aware)
|
| 142 |
|
|
|
|
|
|
|
| 143 |
```python
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
survival_ratio = min(1.0, len(episode_log) / max_steps)
|
| 148 |
|
| 149 |
# Component A: Emergency response (30%)
|
| 150 |
emergency_clear_step = next(
|
| 151 |
-
(entry.step for entry in episode_log[:5] if max_rho < 0.92),
|
| 152 |
None
|
| 153 |
)
|
| 154 |
-
emergency_score =
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
|
| 156 |
# Component B: Sustained security (50%)
|
| 157 |
-
|
| 158 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
|
| 160 |
# Component C: Reconnection (20%)
|
| 161 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 162 |
|
| 163 |
-
|
|
|
|
| 164 |
|
| 165 |
# Final: survival × mastery (no legacy override)
|
| 166 |
-
|
|
|
|
| 167 |
```
|
| 168 |
|
| 169 |
| Component | Weight | What it measures |
|
| 170 |
|-----------|--------|------------------|
|
| 171 |
-
| Emergency response | 30% | Cleared within 5 steps? |
|
| 172 |
| Sustained security | 50% | Steps 6-20 with rho < 0.90? |
|
| 173 |
| Reconnection | 20% | Did agent reconnect line 0? |
|
| 174 |
|
|
@@ -267,12 +329,14 @@ Behavior observed:
|
|
| 267 |
|
| 268 |
| File | Purpose |
|
| 269 |
|------|---------|
|
| 270 |
-
| `tasks.py:
|
| 271 |
-
| `
|
| 272 |
-
| `grid_environment.py:
|
| 273 |
-
| `
|
| 274 |
-
| `
|
| 275 |
-
| `
|
|
|
|
|
|
|
| 276 |
|
| 277 |
---
|
| 278 |
|
|
|
|
| 13 |
## 1. Reset Phase
|
| 14 |
|
| 15 |
```python
|
| 16 |
+
# tasks.py:566-602 (via _reset_n_minus_1)
|
| 17 |
def _reset_n_minus_1(env, seed, difficulty_level, scenario_mode, benchmark_tier):
|
| 18 |
obs = env.reset(
|
| 19 |
seed=seed,
|
|
|
|
| 22 |
return obs, {
|
| 23 |
"faulted_lines": [0],
|
| 24 |
"curriculum_stage": "fixed_n_minus_1",
|
| 25 |
+
"scenario_mode": scenario_mode,
|
| 26 |
+
"benchmark_tier": benchmark_tier or "n_minus_1_fixed",
|
| 27 |
...
|
| 28 |
}
|
| 29 |
```
|
|
|
|
| 70 |
|
| 71 |
## 3. Reward Function (RL2Grid-inspired)
|
| 72 |
|
| 73 |
+
From `grid_environment.py:598-609`:
|
| 74 |
+
|
| 75 |
```python
|
|
|
|
| 76 |
elif self._task_id == "n_minus_1":
|
| 77 |
+
# Component 1: Survival signal (+1.0 per step)
|
| 78 |
r_survive = 1.0
|
| 79 |
+
|
| 80 |
+
# Component 2: Loading margin quality
|
| 81 |
clipped_margins = [max(-1.0, min(1.0, 1.0 - float(rho))) for rho in observation.rho]
|
| 82 |
r_overload = sum(clipped_margins) / len(clipped_margins)
|
| 83 |
+
|
| 84 |
+
# Component 3: Redispatch cost (from _n_minus_1_redispatch_cost)
|
| 85 |
r_cost = -self._n_minus_1_redispatch_cost(action)
|
| 86 |
+
|
| 87 |
+
# Combined reward
|
| 88 |
reward += (0.3 * r_survive) + (0.6 * r_overload) + (0.1 * r_cost)
|
| 89 |
+
|
| 90 |
+
# Reconnection bonus (+2.0 if safe)
|
| 91 |
+
if reconnect_successful and self._reconnection_within_margin(previous_observation=self._last_obs, observation=observation):
|
| 92 |
reward += 2.0
|
| 93 |
+
|
| 94 |
+
# Terminal rewards
|
| 95 |
+
if reached_time_limit and not observation.metadata.get("convergence_failed"):
|
| 96 |
+
reward += 10.0 * ((self._state.step_count / max(1, self._max_steps)) ** 2)
|
| 97 |
+
elif done and not reached_time_limit:
|
| 98 |
reward -= 15.0 # blackout penalty
|
| 99 |
```
|
| 100 |
|
| 101 |
### Components
|
| 102 |
|
| 103 |
+
| Component | Formula | Weight | Purpose |
|
| 104 |
+
|-----------|---------|--------|---------|
|
| 105 |
+
| `R_survive` | +1.0 per step | 0.3 | Constant survival signal |
|
| 106 |
+
| `R_overload` | (1/n) × Σ clip(1-ρ, -1, 1) | 0.6 | Loading margin quality |
|
| 107 |
+
| `R_cost` | -0.05 × Σ|ΔMW|/ramp | 0.1 | Economic cost of redispatch |
|
| 108 |
+
| `R_reconnect` | +2.0 if safe reconnection | - | Heuristic from winning agents |
|
| 109 |
+
| Terminal | +10×(s/m)² / -15 | - | Quadratic survival / blackout |
|
| 110 |
+
|
| 111 |
+
### Reconnection Detection & Validation
|
| 112 |
+
|
| 113 |
+
From `grid_environment.py:853-869` (`_detect_successful_reconnection`):
|
| 114 |
+
```python
|
| 115 |
+
def _detect_successful_reconnection(previous_observation, observation, action):
|
| 116 |
+
# Check if any requested reconnection actually succeeded
|
| 117 |
+
requested_reconnects = {line_id for line_id, status in action.line_set.items() if status == 1}
|
| 118 |
+
for idx, (before, after) in enumerate(zip(previous_observation.line_status, observation.line_status)):
|
| 119 |
+
if not before and after and idx in requested_reconnects:
|
| 120 |
+
return True
|
| 121 |
+
return False
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
From `grid_environment.py:728-737` (`_reconnection_within_margin`):
|
| 125 |
+
```python
|
| 126 |
+
def _reconnection_within_margin(previous_observation, observation):
|
| 127 |
+
# Ensure reconnection doesn't worsen max_rho by more than 10%
|
| 128 |
+
previous_max = max(previous_observation.rho)
|
| 129 |
+
current_max = max(observation.rho)
|
| 130 |
+
return current_max <= previous_max + 0.1
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
### Components
|
| 134 |
+
|
| 135 |
| Component | Formula | Weight | Purpose |
|
| 136 |
|-----------|---------|--------|---------|
|
| 137 |
| `R_survive` | +1.0 per step | 0.3 | Constant survival signal |
|
|
|
|
| 186 |
|
| 187 |
## 6. Grading (Phase-Aware)
|
| 188 |
|
| 189 |
+
From `graders.py:58-83`:
|
| 190 |
+
|
| 191 |
```python
|
| 192 |
+
def grade_n_minus_1(episode_log: list[EpisodeStepLog], max_steps: int = 20) -> float:
|
| 193 |
+
if not episode_log:
|
| 194 |
+
return 0.0
|
|
|
|
| 195 |
|
| 196 |
# Component A: Emergency response (30%)
|
| 197 |
emergency_clear_step = next(
|
| 198 |
+
(entry.step for entry in episode_log[:5] if float(entry.max_rho) < 0.92),
|
| 199 |
None
|
| 200 |
)
|
| 201 |
+
emergency_score = (
|
| 202 |
+
max(0.0, 1.0 - (0.2 * max(0, emergency_clear_step - 1)))
|
| 203 |
+
if emergency_clear_step is not None
|
| 204 |
+
else 0.0
|
| 205 |
+
)
|
| 206 |
|
| 207 |
# Component B: Sustained security (50%)
|
| 208 |
+
# Phase 2: steps 6-20 (15 steps)
|
| 209 |
+
phase2_logs = [entry for entry in episode_log if entry.step >= 6]
|
| 210 |
+
security_ratio = (
|
| 211 |
+
sum(1 for entry in phase2_logs if float(entry.max_rho) < 0.90) / 15.0
|
| 212 |
+
if phase2_logs
|
| 213 |
+
else 0.0
|
| 214 |
+
)
|
| 215 |
|
| 216 |
# Component C: Reconnection (20%)
|
| 217 |
+
# Did line 0 get reconnected at any point?
|
| 218 |
+
reconnection_score = 1.0 if any(0 not in entry.disconnected_lines for entry in episode_log) else 0.0
|
| 219 |
+
|
| 220 |
+
# Survival gates the score
|
| 221 |
+
survival_ratio = min(max_steps, max(entry.step for entry in episode_log)) / max_steps
|
| 222 |
|
| 223 |
+
# Mastery = weighted combination
|
| 224 |
+
mastery_score = (0.30 * emergency_score) + (0.50 * security_ratio) + (0.20 * reconnection_score)
|
| 225 |
|
| 226 |
# Final: survival × mastery (no legacy override)
|
| 227 |
+
final_score = mastery_score * survival_ratio
|
| 228 |
+
return round(min(1.0, max(0.0, final_score)), 6)
|
| 229 |
```
|
| 230 |
|
| 231 |
| Component | Weight | What it measures |
|
| 232 |
|-----------|--------|------------------|
|
| 233 |
+
| Emergency response | 30% | Cleared within 5 steps? (0.92 threshold) |
|
| 234 |
| Sustained security | 50% | Steps 6-20 with rho < 0.90? |
|
| 235 |
| Reconnection | 20% | Did agent reconnect line 0? |
|
| 236 |
|
|
|
|
| 329 |
|
| 330 |
| File | Purpose |
|
| 331 |
|------|---------|
|
| 332 |
+
| `tasks.py:115-132` | Task 2 task spec and reset dispatch |
|
| 333 |
+
| `tasks.py:566-602` | `_reset_n_minus_1` - line 0 disconnection |
|
| 334 |
+
| `grid_environment.py:598-609` | Three-component reward function |
|
| 335 |
+
| `grid_environment.py:728-737` | `_reconnection_within_margin` - safety check |
|
| 336 |
+
| `grid_environment.py:853-869` | `_detect_successful_reconnection` |
|
| 337 |
+
| `graders.py:58-83` | Phase-aware grader |
|
| 338 |
+
| `graph_analysis.py` | N-1 security score (bridge line analysis) |
|
| 339 |
+
| `inference.py` | Prompt with two-threshold framing (EMERGENCY/WARNING/SAFE) |
|
| 340 |
|
| 341 |
---
|
| 342 |
|
architecture/task_4_architecture.md
CHANGED
|
@@ -611,18 +611,19 @@ MSCF RULE: Prefer actions that preserve transferable generation and keep islands
|
|
| 611 |
| File | Line Numbers | Purpose |
|
| 612 |
|------|------------|---------|
|
| 613 |
| `grid2op_env/server/tasks.py` | 53-62 | Task definition |
|
| 614 |
-
| `grid2op_env/server/tasks.py` | 70-74 | Line triplets |
|
| 615 |
-
| `grid2op_env/server/tasks.py` |
|
| 616 |
-
| `grid2op_env/server/tasks.py` |
|
| 617 |
-
| `grid2op_env/server/tasks.py` | 623-633 | Survival probe |
|
| 618 |
-
| `grid2op_env/server/grid_environment.py` | 63-66 | Reward constants |
|
| 619 |
| `grid2op_env/server/grid_environment.py` | 630-647 | Reward function |
|
| 620 |
-
| `grid2op_env/server/grid_environment.py` | 750-765 | Stage metadata |
|
| 621 |
-
| `grid2op_env/server/grid_environment.py` | 767-814 | Island assessment |
|
| 622 |
-
| `grid2op_env/server/grid_environment.py` | 816-849 | Connected components |
|
| 623 |
-
| `grid2op_env/server/graders.py` | 124-174 |
|
| 624 |
-
| `grid2op_env/inference.py` | 606-648 | LLM prompt |
|
| 625 |
-
| `grid2op_env/
|
|
|
|
| 626 |
|
| 627 |
---
|
| 628 |
|
|
|
|
| 611 |
| File | Line Numbers | Purpose |
|
| 612 |
|------|------------|---------|
|
| 613 |
| `grid2op_env/server/tasks.py` | 53-62 | Task definition |
|
| 614 |
+
| `grid2op_env/server/tasks.py` | 70-74 | Line triplets: (2,4,14), (2,4,15), (4,14,16) |
|
| 615 |
+
| `grid2op_env/server/tasks.py` | 334-337 | Profile function: returns 1.20 (20% load) |
|
| 616 |
+
| `grid2op_env/server/tasks.py` | 565-620 | Reset function with survival probe |
|
| 617 |
+
| `grid2op_env/server/tasks.py` | 623-633 | Survival probe function |
|
| 618 |
+
| `grid2op_env/server/grid_environment.py` | 63-66 | Reward constants (0.02, 5.0, 0.5, 8.0) |
|
| 619 |
| `grid2op_env/server/grid_environment.py` | 630-647 | Reward function |
|
| 620 |
+
| `grid2op_env/server/grid_environment.py` | 750-765 | Stage metadata computation |
|
| 621 |
+
| `grid2op_env/server/grid_environment.py` | 767-814 | Island availability assessment |
|
| 622 |
+
| `grid2op_env/server/grid_environment.py` | 816-849 | Connected components detection |
|
| 623 |
+
| `grid2op_env/server/graders.py` | 124-174 | Four-component grading |
|
| 624 |
+
| `grid2op_env/inference.py` | 606-648 | LLM prompt with stage context |
|
| 625 |
+
| `grid2op_env/inference.py` | 1003-1030 | Candidate filtering (removes unsafe disconnects) |
|
| 626 |
+
| `grid2op_env/models.py` | 54-59 | EpisodeStepLog fields for Task 4 |
|
| 627 |
|
| 628 |
---
|
| 629 |
|
grid2op_env/README.md
CHANGED
|
@@ -53,7 +53,7 @@ Supporting files outside the minimum template remain for quality and verificatio
|
|
| 53 |
|
| 54 |
- Grid2Op core simulator using `l2rpn_case14_sandbox`
|
| 55 |
- Typed `GridAction`, `GridObservation`, and `GridState`
|
| 56 |
-
-
|
| 57 |
- Reset-time scenario injection and retry logic for non-convergent starts
|
| 58 |
- Shaped reward, episode logging, and deterministic graders
|
| 59 |
- OpenEnv WebSocket interface plus `/tasks`, `/grader`, and `/baseline`
|
|
@@ -65,32 +65,72 @@ Supporting files outside the minimum template remain for quality and verificatio
|
|
| 65 |
|
| 66 |
## Recent fixes
|
| 67 |
|
| 68 |
-
1. **
|
| 69 |
- `single_fault_easy`: 0.82-0.85 (was mathematically impossible 0.90-0.94)
|
| 70 |
- `single_fault_moderate`: 0.86-0.89 (was 0.94-0.97)
|
| 71 |
- `single_fault_severe`: 0.90-0.93 (was 0.96-0.99)
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
- Terminal: +10×(s/m)² quadratic survival, -15 blackout
|
| 83 |
-
- Phase-aware grader
|
|
|
|
|
|
|
|
|
|
| 84 |
- N-1 security score (bridge lines) in prompt
|
| 85 |
- **Grading now honest**: score = survival_ratio × mastery_score (no override)
|
| 86 |
- Latest eval: 0.952 (was 1.0 with old override)
|
| 87 |
|
| 88 |
-
5. **Task
|
| 89 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
- Three explicit stages (10 steps each) with stage boundaries at step 10 and 20
|
| 91 |
-
-
|
| 92 |
-
-
|
| 93 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
- Latest eval: 0.929 (31x improvement from 0.027)
|
| 95 |
|
| 96 |
## Planner architecture
|
|
|
|
| 53 |
|
| 54 |
- Grid2Op core simulator using `l2rpn_case14_sandbox`
|
| 55 |
- Typed `GridAction`, `GridObservation`, and `GridState`
|
| 56 |
+
- Four tasks: `single_fault`, `n_minus_1`, `cascade_prevent`, `multi_stage_cascade`
|
| 57 |
- Reset-time scenario injection and retry logic for non-convergent starts
|
| 58 |
- Shaped reward, episode logging, and deterministic graders
|
| 59 |
- OpenEnv WebSocket interface plus `/tasks`, `/grader`, and `/baseline`
|
|
|
|
| 65 |
|
| 66 |
## Recent fixes
|
| 67 |
|
| 68 |
+
1. **Task 1 (single_fault) benchmark ranges corrected** (tasks.py:297-304):
|
| 69 |
- `single_fault_easy`: 0.82-0.85 (was mathematically impossible 0.90-0.94)
|
| 70 |
- `single_fault_moderate`: 0.86-0.89 (was 0.94-0.97)
|
| 71 |
- `single_fault_severe`: 0.90-0.93 (was 0.96-0.99)
|
| 72 |
+
- Warmup phase finds high-loading state in chronics, then agent has 10 steps to solve
|
| 73 |
+
|
| 74 |
+
2. **Task 1 reward function** (grid_environment.py:589-596):
|
| 75 |
+
- Target achieved bonus: `1.0 / step_count` (rewards early solution)
|
| 76 |
+
- Safe margin bonus: `0.05 × max(0.0, 1.0 - max_rho)`
|
| 77 |
+
- Overload penalty: `0.2 × overloaded_count` (lines > 100%)
|
| 78 |
+
- Redispatch penalty: `0.01 × MW` (discourages large interventions)
|
| 79 |
+
- Failure penalty: `-5.0` if time limit reached without target
|
| 80 |
+
|
| 81 |
+
3. **Task 1 grading** (graders.py:28-55):
|
| 82 |
+
- 70% weight on survival ratio
|
| 83 |
+
- 50% target achieved bonus
|
| 84 |
+
- Final state bonus (0.3 if below target, 0.15/+0.05, 0.05/+0.10)
|
| 85 |
+
- Legacy success score for early completion: `1.0 - 0.08 × (step - 1)`
|
| 86 |
+
|
| 87 |
+
4. **Task 2 (n_minus_1) redesign** based on RL2Grid paper (grid_environment.py:598-609):
|
| 88 |
+
- Three-component reward: `0.3×R_survive + 0.6×R_overload + 0.1×R_cost`
|
| 89 |
+
- `R_survive`: +1.0 per step (constant survival signal)
|
| 90 |
+
- `R_overload`: `(1/n) × Σ clip(1-ρ, -1, 1)` - loading margin quality
|
| 91 |
+
- `R_cost`: `-0.05 × Σ|ΔMW|/max_ramp` (normalized redispatch cost)
|
| 92 |
+
- Reconnection bonus: +2.0 when safely reconnecting (grid_environment.py:853-869)
|
| 93 |
- Terminal: +10×(s/m)² quadratic survival, -15 blackout
|
| 94 |
+
- Phase-aware grader (graders.py:58-83):
|
| 95 |
+
- Emergency response (30%): cleared within 5 steps at rho < 0.92
|
| 96 |
+
- Sustained security (50%): steps 6-20 at rho < 0.90
|
| 97 |
+
- Reconnection (20%): did agent reconnect line 0?
|
| 98 |
- N-1 security score (bridge lines) in prompt
|
| 99 |
- **Grading now honest**: score = survival_ratio × mastery_score (no override)
|
| 100 |
- Latest eval: 0.952 (was 1.0 with old override)
|
| 101 |
|
| 102 |
+
5. **Task 3 (cascade_prevent)** (grid_environment.py:611-628):
|
| 103 |
+
- 1-2 lines disconnected at reset + 5-15% load increase
|
| 104 |
+
- Key metric: `timestep_overflow` countdowns (not just max_rho)
|
| 105 |
+
- Quadratic overflow penalty: `-0.05 × Σ(overflow²)` - line at overflow=2 is 4x more urgent than overflow=1
|
| 106 |
+
- Reward components:
|
| 107 |
+
- Cascade prevention: +0.3 if no auto-trip, -2.5 if auto-trip
|
| 108 |
+
- Thermal margin: +0.1 × mean(clip(1-ρ, -1, 1))
|
| 109 |
+
- Terminal: +5.0 × (1 - auto_trips/5)² survival bonus, -12.0 blackout
|
| 110 |
+
- Grading (graders.py:86-121):
|
| 111 |
+
- Cascade containment (50%): steps without auto-trips / 30
|
| 112 |
+
- Thermal stability (30%): safe_steps / containment_steps
|
| 113 |
+
- Recovery speed (20%): how fast recovered from first overload
|
| 114 |
+
- Latest eval: 0.798 (hard/extreme tiers challenging)
|
| 115 |
+
|
| 116 |
+
6. **Task 4 (multi_stage_cascade)** (tasks.py:334-337, grid_environment.py:630-647):
|
| 117 |
+
- 3 lines disconnected at reset + **20% load increase** (not 15%)
|
| 118 |
- Three explicit stages (10 steps each) with stage boundaries at step 10 and 20
|
| 119 |
+
- Overflow window: 2 (faster cascades than default 3)
|
| 120 |
+
- Do-nothing survival probe: 5 steps minimum
|
| 121 |
+
- Island availability assessment at stage boundaries (grid_environment.py:767-814)
|
| 122 |
+
- Candidate filtering (inference.py:1003-1030): filters unsafe topology disconnects
|
| 123 |
+
- Reward (grid_environment.py:630-647):
|
| 124 |
+
- Generation cost: -0.02 × (total_gen / initial_load)
|
| 125 |
+
- Convergence: +0.5 × available_island_ratio
|
| 126 |
+
- Load loss penalty: -5.0 × (1 - available_load_ratio) at boundaries only
|
| 127 |
+
- Terminal win: +8.0 × (available_load_ratio)² if ≥50% load at step 30
|
| 128 |
+
- Terminal blackout: -12.0
|
| 129 |
+
- Grading (graders.py:124-174):
|
| 130 |
+
- Stage completion (30%): survived stages 1, 2, 3
|
| 131 |
+
- Load preservation (40%): available_load_ratio at end
|
| 132 |
+
- Island quality (20%): majority islands viable at boundaries
|
| 133 |
+
- Speed bonus (10%): how fast stability returned each stage
|
| 134 |
- Latest eval: 0.929 (31x improvement from 0.027)
|
| 135 |
|
| 136 |
## Planner architecture
|