Spaces:

Sidharth1743
/

grid2op-openenv

Running

App Files Files Community

Sidharth1743 commited on 23 days ago

Commit

da3c180

1 Parent(s): a65650a

docs updated

Browse files

Files changed (4) hide show

architecture/task_1_architecture.md +41 -23
architecture/task_2_architecture.md +88 -24
architecture/task_4_architecture.md +12 -11
grid2op_env/README.md +58 -18

architecture/task_1_architecture.md CHANGED Viewed

@@ -24,10 +24,10 @@ At each step, calculate: max_rho = max(all line loadings)
      ▼
 Find the step where: target_min ≤ max_rho ≤ target_max
      │
-     │  Difficulty levels:
-     │  - easy/curriculum: 0.90-0.94 → 0.82-0.85 (benchmark)
-     │  - moderate: 0.94-0.97 → 0.86-0.89 (benchmark)
-     │  - severe: 0.96-0.99 → 0.90-0.93 (benchmark)
      │
      ▼
 STOP at that step - this is your starting state
@@ -45,6 +45,7 @@ Return observation + scenario metadata
 - `target_rho_range`: [min, max] that was searched for
 - `warmup_steps`: How many steps were taken to find the state
 - `target_matched`: True if exact target found, False if fallback used
 ---
@@ -294,12 +295,18 @@ Result:
 ### Reward Breakdown (Step 1)
 ```
-Safe margin bonus:  0.05 × (1.0 - 0.82) = 0.05 × 0.18 = 0.009
-Overload penalty:   0 (no lines > 1.0)
-Redispatch penalty: 0.01 × |−10| + 0.01 × |10| = 0.01 × 20 = 0.2
-                                                    ─────────────────
-Total reward:      0.009 - 0.2 = -0.191
 ```
 ---
@@ -445,7 +452,7 @@ if self._task_id == "single_fault" and all_lines_below_target:
 ]
 ```
-### Grader Calculation (from graders.py)
 ```python
 def grade_single_fault(episode_log):
@@ -454,31 +461,42 @@ def grade_single_fault(episode_log):
     survival_score = survival_ratio * 0.7  # = 0.21
     # 2. Target achieved bonus (50%)
-    achieved_target = any(entry.all_lines_below_target for entry in episode_log)
     target_bonus = 0.5 if achieved_target else 0.0  # = 0.5
-    # 3. Final state bonus
     final_rho = 0.77
     target_threshold = 0.80
     if final_rho < target_threshold:
         final_bonus = 0.3  # = 0.3
     elif final_rho < target_threshold + 0.05:
         final_bonus = 0.15
     else:
         final_bonus = 0.0
     # Total
-    score = survival_score + target_bonus + final_bonus
     score = min(1.0, max(0.0, score))
-    return score
-# Calculation:
-# survival_score = 0.3 × 0.7 = 0.21
-# target_bonus = 0.5
-# final_bonus = 0.3
 #
-# TOTAL = 0.21 + 0.5 + 0.3 = 1.01 → capped at 1.0
 ```
 ---
@@ -561,11 +579,11 @@ STEP 3
 | Tier | Target Range | Fixed in Code |
 |------|--------------|---------------|
-| `single_fault_easy` | 0.82-0.85 | tasks.py:250 |
-| `single_fault_moderate` | 0.86-0.89 | tasks.py:252 |
-| `single_fault_severe` | 0.90-0.93 | tasks.py:254 |
-**Note**: The original benchmark ranges (0.90-0.94, etc.) were mathematically impossible because generators could only reduce ~0.03-0.05 rho per step. Fixed in recent updates.
 ---

      ▼
 Find the step where: target_min ≤ max_rho ≤ target_max
      │
+     │  Difficulty levels (BENCHMARK - FIXED in tasks.py:297-304):
+     │  - easy: 0.82-0.85 (was impossible 0.90-0.94)
+     │  - moderate: 0.86-0.89 (was impossible 0.94-0.97)
+     │  - severe: 0.90-0.93 (was impossible 0.96-0.99)
      │
      ▼
 STOP at that step - this is your starting state
 - `target_rho_range`: [min, max] that was searched for
 - `warmup_steps`: How many steps were taken to find the state
 - `target_matched`: True if exact target found, False if fallback used
+- `scenario`: "high_loading"
 ---
 ### Reward Breakdown (Step 1)
+From `grid_environment.py:589-596`:
 ```
+# Actual implementation:
+safe_margin_bonus = 0.05 × max(0.0, 1.0 - max_rho)  # = 0.05 × 0.18 = 0.009
+overload_penalty  = 0.2 × overloaded_count          # = 0 (no lines > 1.0)
+redispatch_penalty = _action_penalty(action)         # = 0.01 × 20 = 0.2
+# Plus: early termination bonus if target achieved (step 1)
+# target_achieved_bonus = 1.0 / step_count = 1.0/1 = 1.0
+Total reward: 0.009 - 0.2 + 1.0 = 0.809 (if target achieved)
 ```
 ---
 ]
 ```
+### Grader Calculation (from graders.py:28-55)
 ```python
 def grade_single_fault(episode_log):
     survival_score = survival_ratio * 0.7  # = 0.21
     # 2. Target achieved bonus (50%)
+    achieved_target = any(entry.all_lines_below_target or entry.all_lines_below_80 for entry in episode_log)
     target_bonus = 0.5 if achieved_target else 0.0  # = 0.5
+    # 3. Legacy success score (bonus for early completion)
+    legacy_success_score = 0.0
+    for entry in episode_log:
+        if entry.all_lines_below_target or entry.all_lines_below_80:
+            legacy_success_score = round(max(0.0, 1.0 - (0.08 * max(0, entry.step - 1))), 6)
+            break
+    # 4. Final state bonus (0.3 if below target, 0.15 if within +0.05, 0.05 if within +0.10)
     final_rho = 0.77
     target_threshold = 0.80
     if final_rho < target_threshold:
         final_bonus = 0.3  # = 0.3
     elif final_rho < target_threshold + 0.05:
         final_bonus = 0.15
+    elif final_rho < target_threshold + 0.10:
+        final_bonus = 0.05
     else:
         final_bonus = 0.0
     # Total
+    score = (survival_ratio * 0.7) + target_bonus + final_bonus
+    score = max(score, legacy_success_score)  # Take max of both scoring methods
     score = min(1.0, max(0.0, score))
+    return round(score, 6)
+# For example: episode completes at step 3 with max_rho=0.77:
+# survival_score = (3/10) * 0.7 = 0.21
+# target_bonus = 0.5 (achieved target)
+# legacy_success_score = 1.0 - 0.08 * 2 = 0.84
+# final_bonus = 0.3 (below target)
 #
+# score = 0.21 + 0.5 + 0.3 = 1.01 → capped at 1.0
 ```
 ---
 | Tier | Target Range | Fixed in Code |
 |------|--------------|---------------|
+| `single_fault_easy` | 0.82-0.85 | tasks.py:299 |
+| `single_fault_moderate` | 0.86-0.89 | tasks.py:301 |
+| `single_fault_severe` | 0.90-0.93 | tasks.py:303 |
+**Note**: The original benchmark ranges (0.90-0.94, etc.) were mathematically impossible because generators could only reduce ~0.03-0.05 rho per step. Fixed in recent updates to 0.82-0.85, 0.86-0.89, 0.90-0.93.
 ---

architecture/task_2_architecture.md CHANGED Viewed

@@ -13,7 +13,7 @@ Key difference from Task 1:
 ## 1. Reset Phase
 ```python
-# tasks.py:434-456
 def _reset_n_minus_1(env, seed, difficulty_level, scenario_mode, benchmark_tier):
     obs = env.reset(
         seed=seed,
@@ -22,6 +22,8 @@ def _reset_n_minus_1(env, seed, difficulty_level, scenario_mode, benchmark_tier)
     return obs, {
         "faulted_lines": [0],
         "curriculum_stage": "fixed_n_minus_1",
         ...
     }
 ```
@@ -68,24 +70,68 @@ Based on how grid operators actually work:
 ## 3. Reward Function (RL2Grid-inspired)
 ```python
-# grid_environment.py:588-599
 elif self._task_id == "n_minus_1":
     r_survive = 1.0
     clipped_margins = [max(-1.0, min(1.0, 1.0 - float(rho))) for rho in observation.rho]
     r_overload = sum(clipped_margins) / len(clipped_margins)
     r_cost = -self._n_minus_1_redispatch_cost(action)
     reward += (0.3 * r_survive) + (0.6 * r_overload) + (0.1 * r_cost)
-    if reconnect_successful and self._reconnection_within_margin(...):
         reward += 2.0
-    if reached_time_limit:
-        reward += 10.0 * ((step / max_steps) ** 2)  # quadratic survival
-    elif done:
         reward -= 15.0  # blackout penalty
 ```
 ### Components
 | Component | Formula | Weight | Purpose |
 |-----------|---------|--------|---------|
 | `R_survive` | +1.0 per step | 0.3 | Constant survival signal |
@@ -140,35 +186,51 @@ N-1 STRUCTURAL SECURITY: score=0.941; bridge_lines=[4, 11, 15]
 ## 6. Grading (Phase-Aware)
 ```python
-# graders.py:56-80
-def grade_n_minus_1(episode_log, max_steps=20):
-    # Survival gates the score
-    survival_ratio = min(1.0, len(episode_log) / max_steps)
     # Component A: Emergency response (30%)
     emergency_clear_step = next(
-        (entry.step for entry in episode_log[:5] if max_rho < 0.92),
         None
     )
-    emergency_score = max(0, 1.0 - 0.2 × max(0, emergency_clear_step - 1))
     # Component B: Sustained security (50%)
-    phase2_logs = [e for e in episode_log if e.step >= 6]
-    security_ratio = sum(1 for e in phase2_logs if max_rho < 0.90) / 15
     # Component C: Reconnection (20%)
-    reconnection_score = 1.0 if any(0 not in e.disconnected_lines) else 0.0
-    mastery_score = 0.30 × emergency + 0.50 × security + 0.20 × reconnect
     # Final: survival × mastery (no legacy override)
-    return survival_ratio × mastery_score
 ```
 | Component | Weight | What it measures |
 |-----------|--------|------------------|
-| Emergency response | 30% | Cleared within 5 steps? |
 | Sustained security | 50% | Steps 6-20 with rho < 0.90? |
 | Reconnection | 20% | Did agent reconnect line 0? |
@@ -267,12 +329,14 @@ Behavior observed:
 | File | Purpose |
 |------|---------|
-| `tasks.py:434` | Scenario injection (line 0 disconnection) |
-| `grid_environment.py:588` | Three-component reward function |
-| `grid_environment.py:689` | Reconnection margin check |
-| `graph_analysis.py:129` | N-1 security score calculation |
-| `graders.py:56` | Phase-aware grader |
-| `inference.py:521` | Prompt with two-threshold framing |
 ---

 ## 1. Reset Phase
 ```python
+# tasks.py:566-602 (via _reset_n_minus_1)
 def _reset_n_minus_1(env, seed, difficulty_level, scenario_mode, benchmark_tier):
     obs = env.reset(
         seed=seed,
     return obs, {
         "faulted_lines": [0],
         "curriculum_stage": "fixed_n_minus_1",
+        "scenario_mode": scenario_mode,
+        "benchmark_tier": benchmark_tier or "n_minus_1_fixed",
         ...
     }
 ```
 ## 3. Reward Function (RL2Grid-inspired)
+From `grid_environment.py:598-609`:
 ```python
 elif self._task_id == "n_minus_1":
+    # Component 1: Survival signal (+1.0 per step)
     r_survive = 1.0
+    # Component 2: Loading margin quality
     clipped_margins = [max(-1.0, min(1.0, 1.0 - float(rho))) for rho in observation.rho]
     r_overload = sum(clipped_margins) / len(clipped_margins)
+    # Component 3: Redispatch cost (from _n_minus_1_redispatch_cost)
     r_cost = -self._n_minus_1_redispatch_cost(action)
+    # Combined reward
     reward += (0.3 * r_survive) + (0.6 * r_overload) + (0.1 * r_cost)
+    # Reconnection bonus (+2.0 if safe)
+    if reconnect_successful and self._reconnection_within_margin(previous_observation=self._last_obs, observation=observation):
         reward += 2.0
+    # Terminal rewards
+    if reached_time_limit and not observation.metadata.get("convergence_failed"):
+        reward += 10.0 * ((self._state.step_count / max(1, self._max_steps)) ** 2)
+    elif done and not reached_time_limit:
         reward -= 15.0  # blackout penalty
 ```
 ### Components
+| Component | Formula | Weight | Purpose |
+|-----------|---------|--------|---------|
+| `R_survive` | +1.0 per step | 0.3 | Constant survival signal |
+| `R_overload` | (1/n) × Σ clip(1-ρ, -1, 1) | 0.6 | Loading margin quality |
+| `R_cost` | -0.05 × Σ|ΔMW|/ramp | 0.1 | Economic cost of redispatch |
+| `R_reconnect` | +2.0 if safe reconnection | - | Heuristic from winning agents |
+| Terminal | +10×(s/m)² / -15 | - | Quadratic survival / blackout |
+### Reconnection Detection & Validation
+From `grid_environment.py:853-869` (`_detect_successful_reconnection`):
+```python
+def _detect_successful_reconnection(previous_observation, observation, action):
+    # Check if any requested reconnection actually succeeded
+    requested_reconnects = {line_id for line_id, status in action.line_set.items() if status == 1}
+    for idx, (before, after) in enumerate(zip(previous_observation.line_status, observation.line_status)):
+        if not before and after and idx in requested_reconnects:
+            return True
+    return False
+```
+From `grid_environment.py:728-737` (`_reconnection_within_margin`):
+```python
+def _reconnection_within_margin(previous_observation, observation):
+    # Ensure reconnection doesn't worsen max_rho by more than 10%
+    previous_max = max(previous_observation.rho)
+    current_max = max(observation.rho)
+    return current_max <= previous_max + 0.1
+```
+### Components
 | Component | Formula | Weight | Purpose |
 |-----------|---------|--------|---------|
 | `R_survive` | +1.0 per step | 0.3 | Constant survival signal |
 ## 6. Grading (Phase-Aware)
+From `graders.py:58-83`:
 ```python
+def grade_n_minus_1(episode_log: list[EpisodeStepLog], max_steps: int = 20) -> float:
+    if not episode_log:
+        return 0.0
     # Component A: Emergency response (30%)
     emergency_clear_step = next(
+        (entry.step for entry in episode_log[:5] if float(entry.max_rho) < 0.92),
         None
     )
+    emergency_score = (
+        max(0.0, 1.0 - (0.2 * max(0, emergency_clear_step - 1)))
+        if emergency_clear_step is not None
+        else 0.0
+    )
     # Component B: Sustained security (50%)
+    # Phase 2: steps 6-20 (15 steps)
+    phase2_logs = [entry for entry in episode_log if entry.step >= 6]
+    security_ratio = (
+        sum(1 for entry in phase2_logs if float(entry.max_rho) < 0.90) / 15.0
+        if phase2_logs
+        else 0.0
+    )
     # Component C: Reconnection (20%)
+    # Did line 0 get reconnected at any point?
+    reconnection_score = 1.0 if any(0 not in entry.disconnected_lines for entry in episode_log) else 0.0
+    # Survival gates the score
+    survival_ratio = min(max_steps, max(entry.step for entry in episode_log)) / max_steps
+    # Mastery = weighted combination
+    mastery_score = (0.30 * emergency_score) + (0.50 * security_ratio) + (0.20 * reconnection_score)
     # Final: survival × mastery (no legacy override)
+    final_score = mastery_score * survival_ratio
+    return round(min(1.0, max(0.0, final_score)), 6)
 ```
 | Component | Weight | What it measures |
 |-----------|--------|------------------|
+| Emergency response | 30% | Cleared within 5 steps? (0.92 threshold) |
 | Sustained security | 50% | Steps 6-20 with rho < 0.90? |
 | Reconnection | 20% | Did agent reconnect line 0? |
 | File | Purpose |
 |------|---------|
+| `tasks.py:115-132` | Task 2 task spec and reset dispatch |
+| `tasks.py:566-602` | `_reset_n_minus_1` - line 0 disconnection |
+| `grid_environment.py:598-609` | Three-component reward function |
+| `grid_environment.py:728-737` | `_reconnection_within_margin` - safety check |
+| `grid_environment.py:853-869` | `_detect_successful_reconnection` |
+| `graders.py:58-83` | Phase-aware grader |
+| `graph_analysis.py` | N-1 security score (bridge line analysis) |
+| `inference.py` | Prompt with two-threshold framing (EMERGENCY/WARNING/SAFE) |
 ---

architecture/task_4_architecture.md CHANGED Viewed

@@ -611,18 +611,19 @@ MSCF RULE: Prefer actions that preserve transferable generation and keep islands
 | File | Line Numbers | Purpose |
 |------|------------|---------|
 | `grid2op_env/server/tasks.py` | 53-62 | Task definition |
-| `grid2op_env/server/tasks.py` | 70-74 | Line triplets |
-| `grid2op_env/server/tasks.py` | 335-338 | Profile function |
-| `grid2op_env/server/tasks.py` | 566-620 | Reset function |
-| `grid2op_env/server/tasks.py` | 623-633 | Survival probe |
-| `grid2op_env/server/grid_environment.py` | 63-66 | Reward constants |
 | `grid2op_env/server/grid_environment.py` | 630-647 | Reward function |
-| `grid2op_env/server/grid_environment.py` | 750-765 | Stage metadata |
-| `grid2op_env/server/grid_environment.py` | 767-814 | Island assessment |
-| `grid2op_env/server/grid_environment.py` | 816-849 | Connected components |
-| `grid2op_env/server/graders.py` | 124-174 | Grading function |
-| `grid2op_env/inference.py` | 606-648 | LLM prompt |
-| `grid2op_env/models.py` | 54-59 | EpisodeStepLog fields |
 ---

 | File | Line Numbers | Purpose |
 |------|------------|---------|
 | `grid2op_env/server/tasks.py` | 53-62 | Task definition |
+| `grid2op_env/server/tasks.py` | 70-74 | Line triplets: (2,4,14), (2,4,15), (4,14,16) |
+| `grid2op_env/server/tasks.py` | 334-337 | Profile function: returns 1.20 (20% load) |
+| `grid2op_env/server/tasks.py` | 565-620 | Reset function with survival probe |
+| `grid2op_env/server/tasks.py` | 623-633 | Survival probe function |
+| `grid2op_env/server/grid_environment.py` | 63-66 | Reward constants (0.02, 5.0, 0.5, 8.0) |
 | `grid2op_env/server/grid_environment.py` | 630-647 | Reward function |
+| `grid2op_env/server/grid_environment.py` | 750-765 | Stage metadata computation |
+| `grid2op_env/server/grid_environment.py` | 767-814 | Island availability assessment |
+| `grid2op_env/server/grid_environment.py` | 816-849 | Connected components detection |
+| `grid2op_env/server/graders.py` | 124-174 | Four-component grading |
+| `grid2op_env/inference.py` | 606-648 | LLM prompt with stage context |
+| `grid2op_env/inference.py` | 1003-1030 | Candidate filtering (removes unsafe disconnects) |
+| `grid2op_env/models.py` | 54-59 | EpisodeStepLog fields for Task 4 |
 ---

grid2op_env/README.md CHANGED Viewed

@@ -53,7 +53,7 @@ Supporting files outside the minimum template remain for quality and verificatio
 - Grid2Op core simulator using `l2rpn_case14_sandbox`
 - Typed `GridAction`, `GridObservation`, and `GridState`
-- Three tasks: `single_fault`, `n_minus_1`, `cascade_prevent`, `multi_stage_cascade`
 - Reset-time scenario injection and retry logic for non-convergent starts
 - Shaped reward, episode logging, and deterministic graders
 - OpenEnv WebSocket interface plus `/tasks`, `/grader`, and `/baseline`
@@ -65,32 +65,72 @@ Supporting files outside the minimum template remain for quality and verificatio
 ## Recent fixes
-1. **Benchmark ranges corrected** (tasks.py lines 248-255):
    - `single_fault_easy`: 0.82-0.85 (was mathematically impossible 0.90-0.94)
    - `single_fault_moderate`: 0.86-0.89 (was 0.94-0.97)
    - `single_fault_severe`: 0.90-0.93 (was 0.96-0.99)
-2. **Redispatch penalty added** (grid_environment.py line 58):
-   - `SINGLE_FAULT_REDISPATCH_PENALTY_PER_MW = 0.01` per MW to discourage large interventions
-3. **Survival-focused grading** (graders.py):
-   - 70% weight on survival ratio + bonuses
-4. **Task 2 (n_minus_1) redesign** based on RL2Grid paper:
-   - Three-component reward: 0.3×R_survive + 0.6×R_overload + 0.1×R_cost
-   - Reconnection bonus: +2.0 when safely reconnecting faulted line
    - Terminal: +10×(s/m)² quadratic survival, -15 blackout
-   - Phase-aware grader: 30% emergency + 50% security + 20% reconnection
    - N-1 security score (bridge lines) in prompt
    - **Grading now honest**: score = survival_ratio × mastery_score (no override)
    - Latest eval: 0.952 (was 1.0 with old override)
-5. **Task 4 (multi_stage_cascade) added**:
-   - 3 lines disconnected at reset + 15% load increase
    - Three explicit stages (10 steps each) with stage boundaries at step 10 and 20
-   - Island availability assessment at stage boundaries
-   - Candidate filtering prevents grid collapse actions
-   - Four-component grading: stage completion (30%) + load preservation (40%) + island quality (20%) + speed bonus (10%)
    - Latest eval: 0.929 (31x improvement from 0.027)
 ## Planner architecture

 - Grid2Op core simulator using `l2rpn_case14_sandbox`
 - Typed `GridAction`, `GridObservation`, and `GridState`
+- Four tasks: `single_fault`, `n_minus_1`, `cascade_prevent`, `multi_stage_cascade`
 - Reset-time scenario injection and retry logic for non-convergent starts
 - Shaped reward, episode logging, and deterministic graders
 - OpenEnv WebSocket interface plus `/tasks`, `/grader`, and `/baseline`
 ## Recent fixes
+1. **Task 1 (single_fault) benchmark ranges corrected** (tasks.py:297-304):
    - `single_fault_easy`: 0.82-0.85 (was mathematically impossible 0.90-0.94)
    - `single_fault_moderate`: 0.86-0.89 (was 0.94-0.97)
    - `single_fault_severe`: 0.90-0.93 (was 0.96-0.99)
+   - Warmup phase finds high-loading state in chronics, then agent has 10 steps to solve
+2. **Task 1 reward function** (grid_environment.py:589-596):
+   - Target achieved bonus: `1.0 / step_count` (rewards early solution)
+   - Safe margin bonus: `0.05 × max(0.0, 1.0 - max_rho)`
+   - Overload penalty: `0.2 × overloaded_count` (lines > 100%)
+   - Redispatch penalty: `0.01 × MW` (discourages large interventions)
+   - Failure penalty: `-5.0` if time limit reached without target
+3. **Task 1 grading** (graders.py:28-55):
+   - 70% weight on survival ratio
+   - 50% target achieved bonus
+   - Final state bonus (0.3 if below target, 0.15/+0.05, 0.05/+0.10)
+   - Legacy success score for early completion: `1.0 - 0.08 × (step - 1)`
+4. **Task 2 (n_minus_1) redesign** based on RL2Grid paper (grid_environment.py:598-609):
+   - Three-component reward: `0.3×R_survive + 0.6×R_overload + 0.1×R_cost`
+   - `R_survive`: +1.0 per step (constant survival signal)
+   - `R_overload`: `(1/n) × Σ clip(1-ρ, -1, 1)` - loading margin quality
+   - `R_cost`: `-0.05 × Σ|ΔMW|/max_ramp` (normalized redispatch cost)
+   - Reconnection bonus: +2.0 when safely reconnecting (grid_environment.py:853-869)
    - Terminal: +10×(s/m)² quadratic survival, -15 blackout
+   - Phase-aware grader (graders.py:58-83):
+     - Emergency response (30%): cleared within 5 steps at rho < 0.92
+     - Sustained security (50%): steps 6-20 at rho < 0.90
+     - Reconnection (20%): did agent reconnect line 0?
    - N-1 security score (bridge lines) in prompt
    - **Grading now honest**: score = survival_ratio × mastery_score (no override)
    - Latest eval: 0.952 (was 1.0 with old override)
+5. **Task 3 (cascade_prevent)** (grid_environment.py:611-628):
+   - 1-2 lines disconnected at reset + 5-15% load increase
+   - Key metric: `timestep_overflow` countdowns (not just max_rho)
+   - Quadratic overflow penalty: `-0.05 × Σ(overflow²)` - line at overflow=2 is 4x more urgent than overflow=1
+   - Reward components:
+     - Cascade prevention: +0.3 if no auto-trip, -2.5 if auto-trip
+     - Thermal margin: +0.1 × mean(clip(1-ρ, -1, 1))
+     - Terminal: +5.0 × (1 - auto_trips/5)² survival bonus, -12.0 blackout
+   - Grading (graders.py:86-121):
+     - Cascade containment (50%): steps without auto-trips / 30
+     - Thermal stability (30%): safe_steps / containment_steps
+     - Recovery speed (20%): how fast recovered from first overload
+   - Latest eval: 0.798 (hard/extreme tiers challenging)
+6. **Task 4 (multi_stage_cascade)** (tasks.py:334-337, grid_environment.py:630-647):
+   - 3 lines disconnected at reset + **20% load increase** (not 15%)
    - Three explicit stages (10 steps each) with stage boundaries at step 10 and 20
+   - Overflow window: 2 (faster cascades than default 3)
+   - Do-nothing survival probe: 5 steps minimum
+   - Island availability assessment at stage boundaries (grid_environment.py:767-814)
+   - Candidate filtering (inference.py:1003-1030): filters unsafe topology disconnects
+   - Reward (grid_environment.py:630-647):
+     - Generation cost: -0.02 × (total_gen / initial_load)
+     - Convergence: +0.5 × available_island_ratio
+     - Load loss penalty: -5.0 × (1 - available_load_ratio) at boundaries only
+     - Terminal win: +8.0 × (available_load_ratio)² if ≥50% load at step 30
+     - Terminal blackout: -12.0
+   - Grading (graders.py:124-174):
+     - Stage completion (30%): survived stages 1, 2, 3
+     - Load preservation (40%): available_load_ratio at end
+     - Island quality (20%): majority islands viable at boundaries
+     - Speed bonus (10%): how fast stability returned each stage
    - Latest eval: 0.929 (31x improvement from 0.027)
 ## Planner architecture