dc_ops_env / DEMO_PLANS.md
Melikshah's picture
Upload folder using huggingface_hub
5c44f63 verified

DC-Ops: Scenario Demo Plans

All demos verified with 256/256 tests passing. Every plan resolves successfully in 8–10 steps.


Overview

# Scenario Facility Steps Reward Key Skill Tested
1 A1 — Setpoint Optimization Default 9 +0.324 Energy efficiency tuning (PUE optimization)
2 A2 — Thermal Event Default 8 +0.873 Single-failure incident response
3 A2 — Thermal Event Large 8 +0.831 Multi-zone awareness, H1 isolation
4 A4 — CRAC Cascade Default 8 +1.230 Multi-failure triage + load shedding
5 A4 — CRAC Cascade Large 8 +1.150 Multi-zone cascade with H1 envelope
6 B1 — UPS Alarm Default 8 +0.512 Power chain audit, alarm management
7 B3 — Generator Test Default 10 +0.567 5-step protocol compliance
8 B4 — Power Failure Default 8 +0.934 Battery management, gen startup
9 B4 — Power Failure Small 8 +0.948 Aggressive load shedding at scale

All demos resolve in 8–10 steps and demonstrate proper operational procedure: assess → diagnose → compensate → verify → resolve.


Demo 1: A1 — Cooling Setpoint Optimization

Facility: Default (2 zones, 160 kW, 4 CRACs)
Difficulty: Easy | Budget: 10 steps | Result: Resolved in 9 steps, cumulative reward +0.324

Context

All four CRACs are set to 15°C — far below what the servers need. This wastes energy: the compressors run hard, fans blow at 100%, and PUE sits at 1.87. ASHRAE A2 class allows inlet temps up to 27°C. The agent must raise setpoints and reduce fan speeds to approach the PUE target of 1.6 without overheating.

Step-by-Step

Step Command Reward Cumul PUE Inlets (A/B) Reasoning
1 check_status +0.131 +0.131 1.87 17.1 / 17.1 Procedure rule: must check status before adjusting setpoints (+0.2 bonus). Establishes baseline — all CRACs at 15°C, PUE 1.87, inlet temps far below recommended range.
2 adjust_setpoint CRAC-1 24 +0.047 +0.178 1.80 17.7 / 17.1 Raise from 15→24°C. CRAC-1 compressor works less → immediate PUE drop. Supply temp lags ~30s behind the setpoint due to first-order thermal lag.
3 adjust_setpoint CRAC-2 24 +0.039 +0.217 1.76 19.2 / 17.0 Same for CRAC-2. PUE continues falling. Zone A inlets rising as CRACs supply warmer air — still safely below 27°C limit.
4 adjust_setpoint CRAC-3 24 +0.038 +0.255 1.71 20.7 / 17.6 Zone B beginning to respond. The thermal mass (11.1 kJ/K per server) causes gradual warming — the RC network models this physical inertia.
5 adjust_setpoint CRAC-4 24 +0.025 +0.280 1.69 21.9 / 19.1 All CRACs at 24°C. PUE at 1.69 — still above 1.6 target. Diminishing returns from setpoint-only optimization; fan power reduction is the next lever.
6 set_fan_speed CRAC-1 70 −0.019 +0.261 1.68 22.8 / 20.6 Reduce fan from 100→70%. Fan power follows cubic affinity law: P = P_rated × (speed%)³. At 70%, fan power = 34% of rated — a 66% power reduction per CRAC fan.
7 set_fan_speed CRAC-2 70 −0.017 +0.244 1.66 23.4 / 21.9 PUE 1.66 — getting close. Inlet temps at 23.4°C — still 3.6°C below ASHRAE A2 recommended max (27°C).
8 set_fan_speed CRAC-3 70 −0.012 +0.232 1.63 23.9 / 22.7 Almost there. The diminishing negative rewards show the system approaching equilibrium.
9 set_fan_speed CRAC-4 70 +0.092 +0.324 1.60 24.3 / 23.3 ✓ RESOLVED. PUE hits target (≤1.6). All inlets within ASHRAE recommended range (18–27°C). Speed bonus: (10−9)/10 = +0.1.

Why This Works

The agent follows a two-phase optimization strategy:

  1. Phase 1 (steps 2–5): Setpoint adjustment. Raising setpoints from 15→24°C reduces compressor load. PUE drops 1.87→1.69 (10% improvement). But setpoint changes alone can't reach 1.6 because fan power is a significant fraction of total cooling power.

  2. Phase 2 (steps 6–9): Fan speed reduction. The cubic fan law means reducing speed from 100→70% cuts fan power by 66%. This pushes PUE from 1.69→1.60, crossing the target.


Demo 2: A2 — Thermal Event Response (Default Facility)

Facility: Default (2 zones, 160 kW, 4 CRACs)
Difficulty: Medium | Budget: 15 steps | Result: Resolved in 8 steps, cumulative reward +0.873

Context

CRAC-3 compressor has failed. With only 3 of 4 CRACs operational, cooling capacity is reduced. The default facility has N+1 redundancy, so temperatures won't spike catastrophically, but the agent must diagnose the fault and compensate to ensure long-term stability.

Step-by-Step

Step Command Reward Cumul Inlets (A/B) Reasoning
1 check_status +0.204 +0.204 19.8 / 20.0 Assess the situation. Dashboard shows CRAC-3 as "!! COMPRESSOR" with no supply temp, no airflow, 0 kW. Zone B slightly warmer — the lost CRAC served zone B.
2 diagnose CRAC-3 +0.054 +0.258 19.8 / 20.2 Unlocks the resolution gate. Diagnostic report confirms "FAULT DETECTED: compressor." Procedure rule rewards diagnosing before adjusting (+0.3 bonus on subsequent setpoint changes).
3 diagnose CRAC-1 −0.021 +0.237 19.8 / 20.4 Verify remaining CRACs are healthy. Report: "No faults detected. Unit operating normally."
4 adjust_setpoint CRAC-1 16 +0.034 +0.270 19.6 / 20.5 Lower setpoint 18→16°C on a surviving CRAC. Increases cooling output. Earns procedure bonus (diagnosis was done first).
5 adjust_setpoint CRAC-2 16 +0.034 +0.304 19.3 / 20.6 Both zone A CRACs now overcooling to compensate. Zone A cools to 19.3°C while zone B stabilizes around 20.6°C.
6 set_fan_speed CRAC-1 100 +0.034 +0.338 19.0 / 20.8 Ensure max airflow on surviving CRACs.
7 set_fan_speed CRAC-2 100 +0.034 +0.372 18.7 / 20.8 Zone B stabilizing at ~20.8°C — well within ASHRAE A2 recommended range (27°C max).
8 set_fan_speed CRAC-4 100 +0.501 +0.873 18.5 / 20.9 ✓ RESOLVED. All zones within recommended range for 2+ consecutive stable steps. Speed bonus: (15−8)/15 = +0.467.

Demo 3: A2 — Thermal Event Response (Large Facility)

Facility: Large (4 zones, 600 kW, 8 CRACs including H1 zone)
Difficulty: Medium | Budget: 15 steps | Result: Resolved in 8 steps, cumulative reward +0.831

Context

Same CRAC-3 failure, but in a larger facility with 4 zones including an H1 (high-density GPU) zone. The H1 zone has a tighter thermal envelope (recommended max 22°C vs 27°C for A2).

Step-by-Step

Step Command Reward Cumul Inlets (A/B/C/D) Reasoning
1 check_status +0.207 +0.207 20.0 / 20.8 / 19.2 / 19.7 4-zone dashboard. Zone B already 0.8°C warmer. Zone C (H1) at 19.2°C — only 2.8°C below its 22°C recommended max.
2 diagnose CRAC-3 +0.057 +0.263 20.0 / 21.6 / 19.2 / 19.7 FAULT: compressor. Zone B rising +0.8°C/step. Zone C stable — dedicated CRACs unaffected.
3 diagnose CRAC-1 −0.019 +0.245 20.0 / 22.3 / 19.2 / 19.6 Confirm CRAC-1 healthy.
4 diagnose CRAC-2 −0.019 +0.225 20.0 / 23.0 / 19.2 / 19.6 Confirm CRAC-2 healthy. Thorough diagnosis of all CRACs in the affected zone pair.
5 adjust_setpoint CRAC-2 16 +0.035 +0.261 19.9 / 23.6 / 19.2 / 19.6 Lower setpoint on surviving zone B CRAC. Procedure bonus earned (diagnosed first).
6 adjust_setpoint CRAC-4 16 +0.035 +0.296 19.7 / 24.0 / 19.2 / 19.6 Zone B rate of rise slowing — compensation taking effect.
7 set_fan_speed CRAC-2 100 +0.035 +0.330 19.6 / 24.3 / 19.2 / 19.6 Max airflow. Zone B at 24.3°C — 2.7°C below ASHRAE A2 recommended max.
8 set_fan_speed CRAC-4 100 +0.501 +0.831 19.5 / 24.6 / 19.2 / 19.6 ✓ RESOLVED. Zone B stabilized at 24.6°C. H1 zone completely unaffected (19.2°C). Speed bonus: +0.467.

Default vs Large Comparison

Metric Default Large
Max inlet temp 20.9°C 24.6°C
H1 zone impact N/A None (19.2°C)
Cumulative reward +0.873 +0.831

Demo 4: A4 — CRAC Failure Cascade (Default Facility)

Facility: Default (2 zones, 160 kW, 4 CRACs)
Difficulty: Hard | Budget: 20 steps | Result: Resolved in 8 steps, cumulative reward +1.230

Context

Two CRACs fail simultaneously: CRAC-1 (compressor) and CRAC-3 (fan). Only CRAC-2 and CRAC-4 remain operational — 50% cooling capacity lost. The agent must diagnose both failures, aggressively compensate, and consider load shedding.

Step-by-Step

Step Command Reward Cumul Inlets (A/B) Reasoning
1 check_status +0.212 +0.212 19.9 / 19.9 Dashboard shows two red-flagged CRACs. 50% cooling capacity lost.
2 diagnose CRAC-1 +0.062 +0.274 20.0 / 20.0 "FAULT DETECTED: compressor." First of two required diagnoses.
3 diagnose CRAC-3 +0.027 +0.301 20.1 / 20.1 "FAULT DETECTED: fan." Both failures identified → resolution gate unlocked. Extra +0.2 procedure bonus for diagnosing both units.
4 adjust_setpoint CRAC-2 16 +0.062 +0.363 20.2 / 20.2 Lower surviving CRAC setpoint. Procedure bonus (+0.2) earned because diagnosis preceded this.
5 adjust_setpoint CRAC-4 16 +0.062 +0.425 20.2 / 20.3 Both survivors at aggressive 16°C setpoints.
6 set_fan_speed CRAC-2 100 +0.062 +0.487 20.1 / 20.2 Confirm max airflow. Temps stabilizing.
7 set_fan_speed CRAC-4 100 +0.062 +0.549 20.1 / 20.2 Temps flat at ~20.1°C. Stable for consecutive steps.
8 set_rack_load B-05 4 +0.681 +1.230 20.0 / 20.1 ✓ RESOLVED. Load shed on hottest rack as precaution. Speed bonus: (20−8)/20 = +0.600.

Why Load Shedding Matters

Reducing rack B-05 from 8 kW to 4 kW:

  • Provides additional thermal margin (less heat to extract)
  • Demonstrates workload migration capability
  • Earns action quality bonus (+0.2 for interventions)

Demo 5: A4 — CRAC Failure Cascade (Large Facility)

Facility: Large (4 zones, 600 kW, 8 CRACs)
Difficulty: Hard | Budget: 20 steps | Result: Resolved in 8 steps, cumulative reward +1.150

Step-by-Step

Step Command Reward Cumul Inlets (A/B/C/D) Reasoning
1 check_status +0.210 +0.210 20.4 / 20.4 / 19.2 / 19.7 CRAC-1 and CRAC-3 down out of 8. Zones C/D (including H1) have their own CRACs.
2 diagnose CRAC-1 +0.059 +0.269 20.8 / 20.8 / 19.2 / 19.7 FAULT: compressor. Zone A/B rising ~0.4°C/step.
3 diagnose CRAC-3 +0.024 +0.293 21.2 / 21.2 / 19.2 / 19.7 FAULT: fan. Both diagnosed — resolution gate unlocked.
4 diagnose CRAC-2 +0.024 +0.317 21.6 / 21.6 / 19.2 / 19.7 Verify CRAC-2 healthy. With 2 failures, confirming survivor health is critical.
5 adjust_setpoint CRAC-2 16 +0.059 +0.376 21.9 / 22.0 / 19.2 / 19.6 Compensate. Zone B at 22.0°C — 13°C below ASHRAE A2 allowable max (35°C).
6 adjust_setpoint CRAC-4 16 +0.058 +0.434 22.2 / 22.3 / 19.2 / 19.6 Rate of rise slowing. H1 zone completely unaffected.
7 set_fan_speed CRAC-2 100 +0.058 +0.492 22.5 / 22.5 / 19.2 / 19.6 Max airflow on survivor. Zone B stabilizing.
8 set_fan_speed CRAC-4 100 +0.658 +1.150 22.7 / 22.8 / 19.2 / 19.6 ✓ RESOLVED. All zones within allowable. Speed bonus: +0.600.

Default vs Large Comparison

Metric Default Large
Final zone B inlet 20.1°C 22.8°C
H1 zone impact N/A None (19.2°C)
Cumulative reward +1.230 +1.150

Demo 6: B1 — UPS Alarm Response

Facility: Default (2 zones, 160 kW)
Difficulty: Medium | Budget: 10 steps | Result: Resolved in 8 steps, cumulative reward +0.512

Context

A brief utility dip caused UPS-1 to transfer to battery. Utility has been restored and UPS switched back to double-conversion mode, but the alarm persists. The agent must investigate the entire power chain and acknowledge the alarm.

Step-by-Step

Step Command Reward Cumul Reasoning
1 check_status −0.007 −0.007 Baseline assessment. Dashboard shows utility NORMAL, generator OFF, ATS on UTILITY.
2 diagnose UPS-1 +0.143 +0.137 Key step. Report: mode=double_conversion, SOC=86%. Resolution gate requires this diagnosis.
3 diagnose UPS-2 −0.007 +0.130 Verify redundant UPS. mode=double_conversion, SOC=87%.
4 diagnose GEN-1 −0.007 +0.123 Generator in standby and ready — confirming readiness is proper procedure.
5 diagnose PDU-A-01 −0.007 +0.117 Verify power distribution is intact.
6 check_status −0.007 +0.110 Re-verify overall state before closing incident.
7 diagnose PDU-B-01 −0.007 +0.103 Complete the power chain audit.
8 acknowledge_alarm +0.408 +0.512 ✓ RESOLVED. Alarm acknowledged after thorough investigation. Speed bonus: (10−8)/10 = +0.200.

Reward Structure Note

Steps 3–7 return −0.007 each because the delta-based progress metric doesn't change during investigation — only the final acknowledgment triggers progress completion. The cumulative reward is still positive (+0.512) thanks to the large resolution step reward.


Demo 7: B3 — Generator Test Protocol

Facility: Default (2 zones, 160 kW)
Difficulty: Easy | Budget: 15 steps | Result: Resolved in 10 steps, cumulative reward +0.567

Context

Routine monthly generator test. No fault, no emergency — the agent must follow the 5-step protocol: check → start → verify → stop → acknowledge.

Step-by-Step

Step Command Reward Cumul Generator State Reasoning
1 check_status −0.007 −0.007 OFF Baseline. All systems normal.
2 diagnose GEN-1 −0.007 −0.013 off Pre-test inspection — verify before starting.
3 start_generator +0.113 +0.100 START_DELAY→CRANKING Generator start sequence initiated.
4 wait −0.022 +0.078 READY/LOADED Let the generator complete warmup.
5 diagnose GEN-1 +0.043 +0.122 ready Critical verification. Confirms generator running properly.
6 check_status −0.007 +0.115 LOADED Full dashboard confirms generator loaded.
7 stop_generator +0.113 +0.228 COOLDOWN Initiate cooldown (300s for turbocharger).
8 wait −0.022 +0.207 COOLDOWN Allow cooldown to proceed.
9 diagnose GEN-1 −0.032 +0.175 cooldown Post-shutdown inspection.
10 acknowledge_alarm +0.392 +0.567 COOLDOWN ✓ RESOLVED. Protocol complete: started ✓, verified ✓, stopped ✓, acknowledged ✓. Speed bonus: (15−10)/15 = +0.333.

Protocol Enforcement

B3 tracks four internal flags that must be set in order:

  1. _startedstart_generator issued
  2. _verifieddiagnose GEN-1 while generator is running
  3. _stoppedstop_generator (only if started + verified)
  4. _completedacknowledge_alarm (only if stopped)

The agent cannot skip steps — issuing stop_generator before diagnose GEN-1 won't set _stopped.


Demo 8: B4 — Power Failure Cascade (Default Facility)

Facility: Default (2 zones, 160 kW)
Difficulty: Hard | Budget: 20 steps | Result: Resolved in 8 steps, cumulative reward +0.934

Context

Total utility power loss. UPS batteries bridging while generator starts. Generator warmup extended (15s vs default 8s). Agent must manage battery life, consider load shedding, and verify generator operation.

Step-by-Step

Step Command Reward Cumul Key Metrics Reasoning
1 check_status +0.108 +0.108 UPS on battery, SOC ~97% Utility LOST. ATS initiating transfer. Generator auto-starting.
2 diagnose UPS-1 +0.131 +0.239 mode=on_battery, SOC=95% Resolution gate unlocked. Battery draining ~2%/step at 160 kW load.
3 diagnose UPS-2 +0.078 +0.317 mode=on_battery, SOC=90% Redundant UPS also on battery.
4 start_generator −0.007 +0.310 Gen: CRANKING→READY Generator already auto-starting — hence slight negative reward for redundant command.
5 set_rack_load A-05 4 +0.062 +0.371 IT load: 156 kW Shed 4 kW to extend battery life.
6 set_rack_load B-05 4 +0.054 +0.425 IT load: 152 kW Total shed: 8 kW (5% of IT load). Conservative but effective.
7 wait −0.052 +0.373 Gen: LOADED, ATS: GEN Generator online. UPS batteries begin recharging.
8 diagnose GEN-1 +0.561 +0.934 state=loaded ✓ RESOLVED. All conditions met: gen loaded, temps OK, SOC >10%. Speed bonus: (20−8)/20 = +0.600.

Battery SOC Timeline

Step Sim Time SOC (UPS-1) Event
0 0 min 97% Utility lost
2 2 min 95% Diagnosed
4 4 min ~90% Gen starting
6 6 min ~87% Load shed
7 7 min ~88% Gen loaded, recharging begins
8 8 min ~89% Resolved

Demo 9: B4 — Power Failure Cascade (Small Facility)

Facility: Small (1 zone, 80 kW, 2 CRACs)
Difficulty: Hard | Budget: 20 steps | Result: Resolved in 8 steps, cumulative reward +0.948

Step-by-Step

Step Command Reward Cumul Key Metrics Reasoning
1 check_status +0.111 +0.111 1 zone, UPS on battery Smaller facility: 80 kW IT load, 1 zone, 2 CRACs. Less redundancy.
2 diagnose UPS-1 +0.133 +0.244 SOC=91% Battery draining faster relative to capacity. Gate unlocked.
3 start_generator +0.073 +0.317 Gen starting Explicit start command.
4 set_rack_load A-05 4 +0.066 +0.383 IT: 76 kW 5% load reduction.
5 set_rack_load A-04 4 +0.056 +0.438 IT: 72 kW 10% load reduction.
6 set_rack_load A-03 4 +0.042 +0.481 IT: 68 kW 15% total load shed — more aggressive for smaller facility.
7 wait −0.069 +0.412 Gen LOADED Generator online. Battery recharging.
8 diagnose GEN-1 +0.537 +0.948 state=loaded ✓ RESOLVED. Speed bonus: +0.600.

Small vs Default Comparison

Metric Default (160 kW) Small (80 kW)
Racks shed 2 (8 kW, 5%) 3 (12 kW, 15%)
Cumulative reward +0.934 +0.948

The small facility earns slightly higher reward due to more aggressive proportional load shedding, producing a stronger positive signal from the power safety component.


Resolution Gate Design

Each affected scenario requires the agent to actually do something before resolution:

Scenario Diagnosis Gate Min Steps
A2 Must diagnose CRAC-3 ≥ 8 steps
A4 Must diagnose CRAC-1 AND diagnose CRAC-3 ≥ 8 steps
B4 Must diagnose UPS-* ≥ 8 steps

Without these gates, scenarios A2, A4, and B4 would auto-resolve within 2–3 steps of passive wait commands due to:

  • N+1 cooling redundancy handling CRAC failures naturally
  • Generator auto-start/load completing within one 60-second agent step
  • resolved only checking _stable_count >= 2 with no gate on agent behavior

Reward ordering validates the design: fast diagnosis > late diagnosis > no diagnosis (never resolves).