Sidharth1743 commited on
Commit
da3c180
·
1 Parent(s): a65650a

docs updated

Browse files
architecture/task_1_architecture.md CHANGED
@@ -24,10 +24,10 @@ At each step, calculate: max_rho = max(all line loadings)
24
 
25
  Find the step where: target_min ≤ max_rho ≤ target_max
26
 
27
- │ Difficulty levels:
28
- │ - easy/curriculum: 0.90-0.94 0.82-0.85 (benchmark)
29
- │ - moderate: 0.94-0.97 0.86-0.89 (benchmark)
30
- │ - severe: 0.96-0.99 0.90-0.93 (benchmark)
31
 
32
 
33
  STOP at that step - this is your starting state
@@ -45,6 +45,7 @@ Return observation + scenario metadata
45
  - `target_rho_range`: [min, max] that was searched for
46
  - `warmup_steps`: How many steps were taken to find the state
47
  - `target_matched`: True if exact target found, False if fallback used
 
48
 
49
  ---
50
 
@@ -294,12 +295,18 @@ Result:
294
 
295
  ### Reward Breakdown (Step 1)
296
 
 
 
297
  ```
298
- Safe margin bonus: 0.05 × (1.0 - 0.82) = 0.05 × 0.18 = 0.009
299
- Overload penalty: 0 (no lines > 1.0)
300
- Redispatch penalty: 0.01 × |−10| + 0.01 × |10| = 0.01 × 20 = 0.2
301
- ─────────────────
302
- Total reward: 0.009 - 0.2 = -0.191
 
 
 
 
303
  ```
304
 
305
  ---
@@ -445,7 +452,7 @@ if self._task_id == "single_fault" and all_lines_below_target:
445
  ]
446
  ```
447
 
448
- ### Grader Calculation (from graders.py)
449
 
450
  ```python
451
  def grade_single_fault(episode_log):
@@ -454,31 +461,42 @@ def grade_single_fault(episode_log):
454
  survival_score = survival_ratio * 0.7 # = 0.21
455
 
456
  # 2. Target achieved bonus (50%)
457
- achieved_target = any(entry.all_lines_below_target for entry in episode_log)
458
  target_bonus = 0.5 if achieved_target else 0.0 # = 0.5
459
 
460
- # 3. Final state bonus
 
 
 
 
 
 
 
461
  final_rho = 0.77
462
  target_threshold = 0.80
463
  if final_rho < target_threshold:
464
  final_bonus = 0.3 # = 0.3
465
  elif final_rho < target_threshold + 0.05:
466
  final_bonus = 0.15
 
 
467
  else:
468
  final_bonus = 0.0
469
 
470
  # Total
471
- score = survival_score + target_bonus + final_bonus
 
472
  score = min(1.0, max(0.0, score))
473
 
474
- return score
475
 
476
- # Calculation:
477
- # survival_score = 0.3 × 0.7 = 0.21
478
- # target_bonus = 0.5
479
- # final_bonus = 0.3
 
480
  #
481
- # TOTAL = 0.21 + 0.5 + 0.3 = 1.01 → capped at 1.0
482
  ```
483
 
484
  ---
@@ -561,11 +579,11 @@ STEP 3
561
 
562
  | Tier | Target Range | Fixed in Code |
563
  |------|--------------|---------------|
564
- | `single_fault_easy` | 0.82-0.85 | tasks.py:250 |
565
- | `single_fault_moderate` | 0.86-0.89 | tasks.py:252 |
566
- | `single_fault_severe` | 0.90-0.93 | tasks.py:254 |
567
 
568
- **Note**: The original benchmark ranges (0.90-0.94, etc.) were mathematically impossible because generators could only reduce ~0.03-0.05 rho per step. Fixed in recent updates.
569
 
570
  ---
571
 
 
24
 
25
  Find the step where: target_min ≤ max_rho ≤ target_max
26
 
27
+ │ Difficulty levels (BENCHMARK - FIXED in tasks.py:297-304):
28
+ │ - easy: 0.82-0.85 (was impossible 0.90-0.94)
29
+ │ - moderate: 0.86-0.89 (was impossible 0.94-0.97)
30
+ │ - severe: 0.90-0.93 (was impossible 0.96-0.99)
31
 
32
 
33
  STOP at that step - this is your starting state
 
45
  - `target_rho_range`: [min, max] that was searched for
46
  - `warmup_steps`: How many steps were taken to find the state
47
  - `target_matched`: True if exact target found, False if fallback used
48
+ - `scenario`: "high_loading"
49
 
50
  ---
51
 
 
295
 
296
  ### Reward Breakdown (Step 1)
297
 
298
+ From `grid_environment.py:589-596`:
299
+
300
  ```
301
+ # Actual implementation:
302
+ safe_margin_bonus = 0.05 × max(0.0, 1.0 - max_rho) # = 0.05 × 0.18 = 0.009
303
+ overload_penalty = 0.2 × overloaded_count # = 0 (no lines > 1.0)
304
+ redispatch_penalty = _action_penalty(action) # = 0.01 × 20 = 0.2
305
+
306
+ # Plus: early termination bonus if target achieved (step 1)
307
+ # target_achieved_bonus = 1.0 / step_count = 1.0/1 = 1.0
308
+
309
+ Total reward: 0.009 - 0.2 + 1.0 = 0.809 (if target achieved)
310
  ```
311
 
312
  ---
 
452
  ]
453
  ```
454
 
455
+ ### Grader Calculation (from graders.py:28-55)
456
 
457
  ```python
458
  def grade_single_fault(episode_log):
 
461
  survival_score = survival_ratio * 0.7 # = 0.21
462
 
463
  # 2. Target achieved bonus (50%)
464
+ achieved_target = any(entry.all_lines_below_target or entry.all_lines_below_80 for entry in episode_log)
465
  target_bonus = 0.5 if achieved_target else 0.0 # = 0.5
466
 
467
+ # 3. Legacy success score (bonus for early completion)
468
+ legacy_success_score = 0.0
469
+ for entry in episode_log:
470
+ if entry.all_lines_below_target or entry.all_lines_below_80:
471
+ legacy_success_score = round(max(0.0, 1.0 - (0.08 * max(0, entry.step - 1))), 6)
472
+ break
473
+
474
+ # 4. Final state bonus (0.3 if below target, 0.15 if within +0.05, 0.05 if within +0.10)
475
  final_rho = 0.77
476
  target_threshold = 0.80
477
  if final_rho < target_threshold:
478
  final_bonus = 0.3 # = 0.3
479
  elif final_rho < target_threshold + 0.05:
480
  final_bonus = 0.15
481
+ elif final_rho < target_threshold + 0.10:
482
+ final_bonus = 0.05
483
  else:
484
  final_bonus = 0.0
485
 
486
  # Total
487
+ score = (survival_ratio * 0.7) + target_bonus + final_bonus
488
+ score = max(score, legacy_success_score) # Take max of both scoring methods
489
  score = min(1.0, max(0.0, score))
490
 
491
+ return round(score, 6)
492
 
493
+ # For example: episode completes at step 3 with max_rho=0.77:
494
+ # survival_score = (3/10) * 0.7 = 0.21
495
+ # target_bonus = 0.5 (achieved target)
496
+ # legacy_success_score = 1.0 - 0.08 * 2 = 0.84
497
+ # final_bonus = 0.3 (below target)
498
  #
499
+ # score = 0.21 + 0.5 + 0.3 = 1.01 → capped at 1.0
500
  ```
501
 
502
  ---
 
579
 
580
  | Tier | Target Range | Fixed in Code |
581
  |------|--------------|---------------|
582
+ | `single_fault_easy` | 0.82-0.85 | tasks.py:299 |
583
+ | `single_fault_moderate` | 0.86-0.89 | tasks.py:301 |
584
+ | `single_fault_severe` | 0.90-0.93 | tasks.py:303 |
585
 
586
+ **Note**: The original benchmark ranges (0.90-0.94, etc.) were mathematically impossible because generators could only reduce ~0.03-0.05 rho per step. Fixed in recent updates to 0.82-0.85, 0.86-0.89, 0.90-0.93.
587
 
588
  ---
589
 
architecture/task_2_architecture.md CHANGED
@@ -13,7 +13,7 @@ Key difference from Task 1:
13
  ## 1. Reset Phase
14
 
15
  ```python
16
- # tasks.py:434-456
17
  def _reset_n_minus_1(env, seed, difficulty_level, scenario_mode, benchmark_tier):
18
  obs = env.reset(
19
  seed=seed,
@@ -22,6 +22,8 @@ def _reset_n_minus_1(env, seed, difficulty_level, scenario_mode, benchmark_tier)
22
  return obs, {
23
  "faulted_lines": [0],
24
  "curriculum_stage": "fixed_n_minus_1",
 
 
25
  ...
26
  }
27
  ```
@@ -68,24 +70,68 @@ Based on how grid operators actually work:
68
 
69
  ## 3. Reward Function (RL2Grid-inspired)
70
 
 
 
71
  ```python
72
- # grid_environment.py:588-599
73
  elif self._task_id == "n_minus_1":
 
74
  r_survive = 1.0
 
 
75
  clipped_margins = [max(-1.0, min(1.0, 1.0 - float(rho))) for rho in observation.rho]
76
  r_overload = sum(clipped_margins) / len(clipped_margins)
 
 
77
  r_cost = -self._n_minus_1_redispatch_cost(action)
 
 
78
  reward += (0.3 * r_survive) + (0.6 * r_overload) + (0.1 * r_cost)
79
- if reconnect_successful and self._reconnection_within_margin(...):
 
 
80
  reward += 2.0
81
- if reached_time_limit:
82
- reward += 10.0 * ((step / max_steps) ** 2) # quadratic survival
83
- elif done:
 
 
84
  reward -= 15.0 # blackout penalty
85
  ```
86
 
87
  ### Components
88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
  | Component | Formula | Weight | Purpose |
90
  |-----------|---------|--------|---------|
91
  | `R_survive` | +1.0 per step | 0.3 | Constant survival signal |
@@ -140,35 +186,51 @@ N-1 STRUCTURAL SECURITY: score=0.941; bridge_lines=[4, 11, 15]
140
 
141
  ## 6. Grading (Phase-Aware)
142
 
 
 
143
  ```python
144
- # graders.py:56-80
145
- def grade_n_minus_1(episode_log, max_steps=20):
146
- # Survival gates the score
147
- survival_ratio = min(1.0, len(episode_log) / max_steps)
148
 
149
  # Component A: Emergency response (30%)
150
  emergency_clear_step = next(
151
- (entry.step for entry in episode_log[:5] if max_rho < 0.92),
152
  None
153
  )
154
- emergency_score = max(0, 1.0 - 0.2 × max(0, emergency_clear_step - 1))
 
 
 
 
155
 
156
  # Component B: Sustained security (50%)
157
- phase2_logs = [e for e in episode_log if e.step >= 6]
158
- security_ratio = sum(1 for e in phase2_logs if max_rho < 0.90) / 15
 
 
 
 
 
159
 
160
  # Component C: Reconnection (20%)
161
- reconnection_score = 1.0 if any(0 not in e.disconnected_lines) else 0.0
 
 
 
 
162
 
163
- mastery_score = 0.30 × emergency + 0.50 × security + 0.20 × reconnect
 
164
 
165
  # Final: survival × mastery (no legacy override)
166
- return survival_ratio × mastery_score
 
167
  ```
168
 
169
  | Component | Weight | What it measures |
170
  |-----------|--------|------------------|
171
- | Emergency response | 30% | Cleared within 5 steps? |
172
  | Sustained security | 50% | Steps 6-20 with rho < 0.90? |
173
  | Reconnection | 20% | Did agent reconnect line 0? |
174
 
@@ -267,12 +329,14 @@ Behavior observed:
267
 
268
  | File | Purpose |
269
  |------|---------|
270
- | `tasks.py:434` | Scenario injection (line 0 disconnection) |
271
- | `grid_environment.py:588` | Three-component reward function |
272
- | `grid_environment.py:689` | Reconnection margin check |
273
- | `graph_analysis.py:129` | N-1 security score calculation |
274
- | `graders.py:56` | Phase-aware grader |
275
- | `inference.py:521` | Prompt with two-threshold framing |
 
 
276
 
277
  ---
278
 
 
13
  ## 1. Reset Phase
14
 
15
  ```python
16
+ # tasks.py:566-602 (via _reset_n_minus_1)
17
  def _reset_n_minus_1(env, seed, difficulty_level, scenario_mode, benchmark_tier):
18
  obs = env.reset(
19
  seed=seed,
 
22
  return obs, {
23
  "faulted_lines": [0],
24
  "curriculum_stage": "fixed_n_minus_1",
25
+ "scenario_mode": scenario_mode,
26
+ "benchmark_tier": benchmark_tier or "n_minus_1_fixed",
27
  ...
28
  }
29
  ```
 
70
 
71
  ## 3. Reward Function (RL2Grid-inspired)
72
 
73
+ From `grid_environment.py:598-609`:
74
+
75
  ```python
 
76
  elif self._task_id == "n_minus_1":
77
+ # Component 1: Survival signal (+1.0 per step)
78
  r_survive = 1.0
79
+
80
+ # Component 2: Loading margin quality
81
  clipped_margins = [max(-1.0, min(1.0, 1.0 - float(rho))) for rho in observation.rho]
82
  r_overload = sum(clipped_margins) / len(clipped_margins)
83
+
84
+ # Component 3: Redispatch cost (from _n_minus_1_redispatch_cost)
85
  r_cost = -self._n_minus_1_redispatch_cost(action)
86
+
87
+ # Combined reward
88
  reward += (0.3 * r_survive) + (0.6 * r_overload) + (0.1 * r_cost)
89
+
90
+ # Reconnection bonus (+2.0 if safe)
91
+ if reconnect_successful and self._reconnection_within_margin(previous_observation=self._last_obs, observation=observation):
92
  reward += 2.0
93
+
94
+ # Terminal rewards
95
+ if reached_time_limit and not observation.metadata.get("convergence_failed"):
96
+ reward += 10.0 * ((self._state.step_count / max(1, self._max_steps)) ** 2)
97
+ elif done and not reached_time_limit:
98
  reward -= 15.0 # blackout penalty
99
  ```
100
 
101
  ### Components
102
 
103
+ | Component | Formula | Weight | Purpose |
104
+ |-----------|---------|--------|---------|
105
+ | `R_survive` | +1.0 per step | 0.3 | Constant survival signal |
106
+ | `R_overload` | (1/n) × Σ clip(1-ρ, -1, 1) | 0.6 | Loading margin quality |
107
+ | `R_cost` | -0.05 × Σ|ΔMW|/ramp | 0.1 | Economic cost of redispatch |
108
+ | `R_reconnect` | +2.0 if safe reconnection | - | Heuristic from winning agents |
109
+ | Terminal | +10×(s/m)² / -15 | - | Quadratic survival / blackout |
110
+
111
+ ### Reconnection Detection & Validation
112
+
113
+ From `grid_environment.py:853-869` (`_detect_successful_reconnection`):
114
+ ```python
115
+ def _detect_successful_reconnection(previous_observation, observation, action):
116
+ # Check if any requested reconnection actually succeeded
117
+ requested_reconnects = {line_id for line_id, status in action.line_set.items() if status == 1}
118
+ for idx, (before, after) in enumerate(zip(previous_observation.line_status, observation.line_status)):
119
+ if not before and after and idx in requested_reconnects:
120
+ return True
121
+ return False
122
+ ```
123
+
124
+ From `grid_environment.py:728-737` (`_reconnection_within_margin`):
125
+ ```python
126
+ def _reconnection_within_margin(previous_observation, observation):
127
+ # Ensure reconnection doesn't worsen max_rho by more than 10%
128
+ previous_max = max(previous_observation.rho)
129
+ current_max = max(observation.rho)
130
+ return current_max <= previous_max + 0.1
131
+ ```
132
+
133
+ ### Components
134
+
135
  | Component | Formula | Weight | Purpose |
136
  |-----------|---------|--------|---------|
137
  | `R_survive` | +1.0 per step | 0.3 | Constant survival signal |
 
186
 
187
  ## 6. Grading (Phase-Aware)
188
 
189
+ From `graders.py:58-83`:
190
+
191
  ```python
192
+ def grade_n_minus_1(episode_log: list[EpisodeStepLog], max_steps: int = 20) -> float:
193
+ if not episode_log:
194
+ return 0.0
 
195
 
196
  # Component A: Emergency response (30%)
197
  emergency_clear_step = next(
198
+ (entry.step for entry in episode_log[:5] if float(entry.max_rho) < 0.92),
199
  None
200
  )
201
+ emergency_score = (
202
+ max(0.0, 1.0 - (0.2 * max(0, emergency_clear_step - 1)))
203
+ if emergency_clear_step is not None
204
+ else 0.0
205
+ )
206
 
207
  # Component B: Sustained security (50%)
208
+ # Phase 2: steps 6-20 (15 steps)
209
+ phase2_logs = [entry for entry in episode_log if entry.step >= 6]
210
+ security_ratio = (
211
+ sum(1 for entry in phase2_logs if float(entry.max_rho) < 0.90) / 15.0
212
+ if phase2_logs
213
+ else 0.0
214
+ )
215
 
216
  # Component C: Reconnection (20%)
217
+ # Did line 0 get reconnected at any point?
218
+ reconnection_score = 1.0 if any(0 not in entry.disconnected_lines for entry in episode_log) else 0.0
219
+
220
+ # Survival gates the score
221
+ survival_ratio = min(max_steps, max(entry.step for entry in episode_log)) / max_steps
222
 
223
+ # Mastery = weighted combination
224
+ mastery_score = (0.30 * emergency_score) + (0.50 * security_ratio) + (0.20 * reconnection_score)
225
 
226
  # Final: survival × mastery (no legacy override)
227
+ final_score = mastery_score * survival_ratio
228
+ return round(min(1.0, max(0.0, final_score)), 6)
229
  ```
230
 
231
  | Component | Weight | What it measures |
232
  |-----------|--------|------------------|
233
+ | Emergency response | 30% | Cleared within 5 steps? (0.92 threshold) |
234
  | Sustained security | 50% | Steps 6-20 with rho < 0.90? |
235
  | Reconnection | 20% | Did agent reconnect line 0? |
236
 
 
329
 
330
  | File | Purpose |
331
  |------|---------|
332
+ | `tasks.py:115-132` | Task 2 task spec and reset dispatch |
333
+ | `tasks.py:566-602` | `_reset_n_minus_1` - line 0 disconnection |
334
+ | `grid_environment.py:598-609` | Three-component reward function |
335
+ | `grid_environment.py:728-737` | `_reconnection_within_margin` - safety check |
336
+ | `grid_environment.py:853-869` | `_detect_successful_reconnection` |
337
+ | `graders.py:58-83` | Phase-aware grader |
338
+ | `graph_analysis.py` | N-1 security score (bridge line analysis) |
339
+ | `inference.py` | Prompt with two-threshold framing (EMERGENCY/WARNING/SAFE) |
340
 
341
  ---
342
 
architecture/task_4_architecture.md CHANGED
@@ -611,18 +611,19 @@ MSCF RULE: Prefer actions that preserve transferable generation and keep islands
611
  | File | Line Numbers | Purpose |
612
  |------|------------|---------|
613
  | `grid2op_env/server/tasks.py` | 53-62 | Task definition |
614
- | `grid2op_env/server/tasks.py` | 70-74 | Line triplets |
615
- | `grid2op_env/server/tasks.py` | 335-338 | Profile function |
616
- | `grid2op_env/server/tasks.py` | 566-620 | Reset function |
617
- | `grid2op_env/server/tasks.py` | 623-633 | Survival probe |
618
- | `grid2op_env/server/grid_environment.py` | 63-66 | Reward constants |
619
  | `grid2op_env/server/grid_environment.py` | 630-647 | Reward function |
620
- | `grid2op_env/server/grid_environment.py` | 750-765 | Stage metadata |
621
- | `grid2op_env/server/grid_environment.py` | 767-814 | Island assessment |
622
- | `grid2op_env/server/grid_environment.py` | 816-849 | Connected components |
623
- | `grid2op_env/server/graders.py` | 124-174 | Grading function |
624
- | `grid2op_env/inference.py` | 606-648 | LLM prompt |
625
- | `grid2op_env/models.py` | 54-59 | EpisodeStepLog fields |
 
626
 
627
  ---
628
 
 
611
  | File | Line Numbers | Purpose |
612
  |------|------------|---------|
613
  | `grid2op_env/server/tasks.py` | 53-62 | Task definition |
614
+ | `grid2op_env/server/tasks.py` | 70-74 | Line triplets: (2,4,14), (2,4,15), (4,14,16) |
615
+ | `grid2op_env/server/tasks.py` | 334-337 | Profile function: returns 1.20 (20% load) |
616
+ | `grid2op_env/server/tasks.py` | 565-620 | Reset function with survival probe |
617
+ | `grid2op_env/server/tasks.py` | 623-633 | Survival probe function |
618
+ | `grid2op_env/server/grid_environment.py` | 63-66 | Reward constants (0.02, 5.0, 0.5, 8.0) |
619
  | `grid2op_env/server/grid_environment.py` | 630-647 | Reward function |
620
+ | `grid2op_env/server/grid_environment.py` | 750-765 | Stage metadata computation |
621
+ | `grid2op_env/server/grid_environment.py` | 767-814 | Island availability assessment |
622
+ | `grid2op_env/server/grid_environment.py` | 816-849 | Connected components detection |
623
+ | `grid2op_env/server/graders.py` | 124-174 | Four-component grading |
624
+ | `grid2op_env/inference.py` | 606-648 | LLM prompt with stage context |
625
+ | `grid2op_env/inference.py` | 1003-1030 | Candidate filtering (removes unsafe disconnects) |
626
+ | `grid2op_env/models.py` | 54-59 | EpisodeStepLog fields for Task 4 |
627
 
628
  ---
629
 
grid2op_env/README.md CHANGED
@@ -53,7 +53,7 @@ Supporting files outside the minimum template remain for quality and verificatio
53
 
54
  - Grid2Op core simulator using `l2rpn_case14_sandbox`
55
  - Typed `GridAction`, `GridObservation`, and `GridState`
56
- - Three tasks: `single_fault`, `n_minus_1`, `cascade_prevent`, `multi_stage_cascade`
57
  - Reset-time scenario injection and retry logic for non-convergent starts
58
  - Shaped reward, episode logging, and deterministic graders
59
  - OpenEnv WebSocket interface plus `/tasks`, `/grader`, and `/baseline`
@@ -65,32 +65,72 @@ Supporting files outside the minimum template remain for quality and verificatio
65
 
66
  ## Recent fixes
67
 
68
- 1. **Benchmark ranges corrected** (tasks.py lines 248-255):
69
  - `single_fault_easy`: 0.82-0.85 (was mathematically impossible 0.90-0.94)
70
  - `single_fault_moderate`: 0.86-0.89 (was 0.94-0.97)
71
  - `single_fault_severe`: 0.90-0.93 (was 0.96-0.99)
72
-
73
- 2. **Redispatch penalty added** (grid_environment.py line 58):
74
- - `SINGLE_FAULT_REDISPATCH_PENALTY_PER_MW = 0.01` per MW to discourage large interventions
75
-
76
- 3. **Survival-focused grading** (graders.py):
77
- - 70% weight on survival ratio + bonuses
78
-
79
- 4. **Task 2 (n_minus_1) redesign** based on RL2Grid paper:
80
- - Three-component reward: 0.3×R_survive + 0.6×R_overload + 0.1×R_cost
81
- - Reconnection bonus: +2.0 when safely reconnecting faulted line
 
 
 
 
 
 
 
 
 
 
 
82
  - Terminal: +10×(s/m)² quadratic survival, -15 blackout
83
- - Phase-aware grader: 30% emergency + 50% security + 20% reconnection
 
 
 
84
  - N-1 security score (bridge lines) in prompt
85
  - **Grading now honest**: score = survival_ratio × mastery_score (no override)
86
  - Latest eval: 0.952 (was 1.0 with old override)
87
 
88
- 5. **Task 4 (multi_stage_cascade) added**:
89
- - 3 lines disconnected at reset + 15% load increase
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
  - Three explicit stages (10 steps each) with stage boundaries at step 10 and 20
91
- - Island availability assessment at stage boundaries
92
- - Candidate filtering prevents grid collapse actions
93
- - Four-component grading: stage completion (30%) + load preservation (40%) + island quality (20%) + speed bonus (10%)
 
 
 
 
 
 
 
 
 
 
 
 
94
  - Latest eval: 0.929 (31x improvement from 0.027)
95
 
96
  ## Planner architecture
 
53
 
54
  - Grid2Op core simulator using `l2rpn_case14_sandbox`
55
  - Typed `GridAction`, `GridObservation`, and `GridState`
56
+ - Four tasks: `single_fault`, `n_minus_1`, `cascade_prevent`, `multi_stage_cascade`
57
  - Reset-time scenario injection and retry logic for non-convergent starts
58
  - Shaped reward, episode logging, and deterministic graders
59
  - OpenEnv WebSocket interface plus `/tasks`, `/grader`, and `/baseline`
 
65
 
66
  ## Recent fixes
67
 
68
+ 1. **Task 1 (single_fault) benchmark ranges corrected** (tasks.py:297-304):
69
  - `single_fault_easy`: 0.82-0.85 (was mathematically impossible 0.90-0.94)
70
  - `single_fault_moderate`: 0.86-0.89 (was 0.94-0.97)
71
  - `single_fault_severe`: 0.90-0.93 (was 0.96-0.99)
72
+ - Warmup phase finds high-loading state in chronics, then agent has 10 steps to solve
73
+
74
+ 2. **Task 1 reward function** (grid_environment.py:589-596):
75
+ - Target achieved bonus: `1.0 / step_count` (rewards early solution)
76
+ - Safe margin bonus: `0.05 × max(0.0, 1.0 - max_rho)`
77
+ - Overload penalty: `0.2 × overloaded_count` (lines > 100%)
78
+ - Redispatch penalty: `0.01 × MW` (discourages large interventions)
79
+ - Failure penalty: `-5.0` if time limit reached without target
80
+
81
+ 3. **Task 1 grading** (graders.py:28-55):
82
+ - 70% weight on survival ratio
83
+ - 50% target achieved bonus
84
+ - Final state bonus (0.3 if below target, 0.15/+0.05, 0.05/+0.10)
85
+ - Legacy success score for early completion: `1.0 - 0.08 × (step - 1)`
86
+
87
+ 4. **Task 2 (n_minus_1) redesign** based on RL2Grid paper (grid_environment.py:598-609):
88
+ - Three-component reward: `0.3×R_survive + 0.6×R_overload + 0.1×R_cost`
89
+ - `R_survive`: +1.0 per step (constant survival signal)
90
+ - `R_overload`: `(1/n) × Σ clip(1-ρ, -1, 1)` - loading margin quality
91
+ - `R_cost`: `-0.05 × Σ|ΔMW|/max_ramp` (normalized redispatch cost)
92
+ - Reconnection bonus: +2.0 when safely reconnecting (grid_environment.py:853-869)
93
  - Terminal: +10×(s/m)² quadratic survival, -15 blackout
94
+ - Phase-aware grader (graders.py:58-83):
95
+ - Emergency response (30%): cleared within 5 steps at rho < 0.92
96
+ - Sustained security (50%): steps 6-20 at rho < 0.90
97
+ - Reconnection (20%): did agent reconnect line 0?
98
  - N-1 security score (bridge lines) in prompt
99
  - **Grading now honest**: score = survival_ratio × mastery_score (no override)
100
  - Latest eval: 0.952 (was 1.0 with old override)
101
 
102
+ 5. **Task 3 (cascade_prevent)** (grid_environment.py:611-628):
103
+ - 1-2 lines disconnected at reset + 5-15% load increase
104
+ - Key metric: `timestep_overflow` countdowns (not just max_rho)
105
+ - Quadratic overflow penalty: `-0.05 × Σ(overflow²)` - line at overflow=2 is 4x more urgent than overflow=1
106
+ - Reward components:
107
+ - Cascade prevention: +0.3 if no auto-trip, -2.5 if auto-trip
108
+ - Thermal margin: +0.1 × mean(clip(1-ρ, -1, 1))
109
+ - Terminal: +5.0 × (1 - auto_trips/5)² survival bonus, -12.0 blackout
110
+ - Grading (graders.py:86-121):
111
+ - Cascade containment (50%): steps without auto-trips / 30
112
+ - Thermal stability (30%): safe_steps / containment_steps
113
+ - Recovery speed (20%): how fast recovered from first overload
114
+ - Latest eval: 0.798 (hard/extreme tiers challenging)
115
+
116
+ 6. **Task 4 (multi_stage_cascade)** (tasks.py:334-337, grid_environment.py:630-647):
117
+ - 3 lines disconnected at reset + **20% load increase** (not 15%)
118
  - Three explicit stages (10 steps each) with stage boundaries at step 10 and 20
119
+ - Overflow window: 2 (faster cascades than default 3)
120
+ - Do-nothing survival probe: 5 steps minimum
121
+ - Island availability assessment at stage boundaries (grid_environment.py:767-814)
122
+ - Candidate filtering (inference.py:1003-1030): filters unsafe topology disconnects
123
+ - Reward (grid_environment.py:630-647):
124
+ - Generation cost: -0.02 × (total_gen / initial_load)
125
+ - Convergence: +0.5 × available_island_ratio
126
+ - Load loss penalty: -5.0 × (1 - available_load_ratio) at boundaries only
127
+ - Terminal win: +8.0 × (available_load_ratio)² if ≥50% load at step 30
128
+ - Terminal blackout: -12.0
129
+ - Grading (graders.py:124-174):
130
+ - Stage completion (30%): survived stages 1, 2, 3
131
+ - Load preservation (40%): available_load_ratio at end
132
+ - Island quality (20%): majority islands viable at boundaries
133
+ - Speed bonus (10%): how fast stability returned each stage
134
  - Latest eval: 0.929 (31x improvement from 0.027)
135
 
136
  ## Planner architecture