ShreeshantXD commited on
Commit
51aa9cd
·
1 Parent(s): 70d6600

updated Blog and renamed curves

Browse files
HF_BLOG_POST.md CHANGED
@@ -1,94 +1,237 @@
1
  ---
2
- title: GridMind-RL: Training LLMs to Manage Industrial Buildings Under Faults and Grid Stress
3
- description: An OpenEnv-compatible RL environment where LLMs learn to control HVAC, thermal storage, and batch scheduling across multi-building industrial facilities.
4
  ---
5
 
6
- **Every industrial building wastes 20–30% of its energy because control systems can't handle real-time pricing, equipment faults, and grid stress simultaneously.** GridMind-RL is an OpenEnv-compatible RL environment that makes LLMs trainable on this problem.
7
 
8
- ## The Problem
9
 
10
- Industrial buildings consume ~40% of global electricity. Most still use naive "always-on" HVAC policies. The capability gap is clear:
 
 
 
 
 
 
 
 
 
 
11
 
12
- - LLMs can understand complex pricing curves, fault alerts, and natural language instructions
13
- - But no environment exists to train them on real building energy management
14
- - Existing RL environments are mostly grid-worlds or toy games — not genuine industrial problems
15
 
16
- GridMind-RL closes this gap by simulating a complete building energy system where agents must:
17
 
18
- - Navigate 24-hour price volatility (off-peak vs peak: 4¢ to 32¢/kWh)
19
- - Maintain comfort (19–23°C) while minimizing cost
20
- - Respond to grid stress emergencies
21
- - Handle equipment faults (chiller failure, sensor malfunction, grid outages, tariff spikes)
22
- - Parse and follow natural language objective cards
23
 
24
  ## The Environment
25
 
26
- GridMind-RL is a 96-step episode (24 simulated hours at 15-minute resolution) with:
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
- | Field | Value |
29
- |-------|-------|
30
- | **Observation** | 13 fields: temperature, storage, price, stress, carbon, faults, HVAC efficiency, instruction card |
31
- | **Actions** | HVAC level (0–1), thermal charge (−1 to 1), batch slot (0–4), load shed (0–0.5) |
32
- | **Reward** | 9-component weighted sum: cost, temperature, grid, deadline, efficiency, stability, carbon, instruction, fault_mitigation |
33
- | **Tasks** | 4 types: cost minimization, temperature management, demand response, instruction following |
34
 
35
- ### Four Hackathon Themes in One Environment
36
 
37
- **Track 1 — Multi-Agent Interactions:** A coordinator LLM reads `/feeder` to see fleet-wide demand across 3 buildings, then sets per-building price multipliers via `/coordinate` to orchestrate behavior.
 
38
 
39
- **Track 2 — Long-Horizon Planning & Instruction Following:** Task 4 presents a natural language objective card like "Keep total energy cost under $2.50 while maintaining 19–23°C." Agents must plan across all 96 steps.
 
40
 
41
- **Track 3 — World Modeling:** The `/simulate` endpoint lets agents ask "what if?" before acting. When HVAC efficiency is low or faults are active, the agent simulates the proposed action and revises if the predicted reward is poor.
 
42
 
43
- **Track 4Fault Handling:** Four fault types inject unpredictability:
44
- - **Chiller failure**: HVAC drops to 20% capacity
45
- - **Grid outage**: Price ×3, stress = 1.0
46
- - **Sensor fault**: Temperature readings jitter ±5°C
47
- - **Tariff spike**: Emergency 4× price surge
48
 
49
- **Track 5 — Self-Improvement:** Curriculum learning auto-advances the agent from task 1 to task 4 when performance thresholds are met.
50
 
51
- ## Results
 
 
52
 
53
- Heuristic baseline scores (fixed policy, no learning) across all 4 tasks:
 
 
54
 
55
- | Policy | Task 1 | Task 2 | Task 3 | Task 4 |
56
- |--------|--------|--------|--------|--------|
57
- | **Heuristic Baseline** | 0.506 | 0.459 | 0.600 | 0.492 |
58
 
59
- The GRPO fine-tuned model shows improvement over the zero-shot LLM baseline. The training curve below shows the learning trajectory:
 
 
 
 
 
 
 
 
 
 
60
 
61
- ![Training Curve](https://raw.githubusercontent.com/LO-Kyu/gridmind/main/results/training_curve.png)
 
 
 
62
 
63
  ## Training
64
 
65
- GridMind-RL uses GRPO (Group Relative Policy Optimization) via HuggingFace TRL with Unsloth 4-bit LoRA fine-tuning of Qwen2.5-0.5B-Instruct. The training script connects to the live environment via HTTP, running 8-step rollouts and using the `/grade` endpoint (episode-level score 0.0–1.0) as the primary reward signal.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
- ```python
68
- # Training runs against the live environment
69
- python scripts/train_unsloth.py --steps 500 --output-csv results/training_log.csv
70
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
- Or run the Colab notebook: [gridmind_grpo_colab.ipynb](https://colab.research.google.com/)
73
 
74
- ## How to Try It
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  ```bash
77
- # Quick health check
78
  curl https://prajwal782007-gridmind.hf.space/health
79
 
80
- # Run a heuristic baseline
81
- python inference.py --fast-mode --task 3 --episodes 5
 
 
82
 
83
- # Run the LLM agent
84
- python inference.py --task 3 --episodes 5
 
 
 
85
  ```
86
 
87
- Live environment: [https://prajwal782007-gridmind.hf.space](https://prajwal782007-gridmind.hf.space)
88
- Dashboard: [https://prajwal782007-gridmind.hf.space/dashboard](https://prajwal782007-gridmind.hf.space/dashboard)
89
-
90
- Code: [github.com/LO-Kyu/gridmind](https://github.com/LO-Kyu/gridmind)
91
 
92
  ---
93
 
94
- *GridMind-RL was built for the Meta PyTorch OpenEnv Hackathon Grand Finale, April 25–26, 2026, at Scaler School of Technology, Bangalore.*
 
 
1
  ---
2
+ title: GridMind-RL: Training LLMs to Manage Industrial Buildings with GRPO
3
+ description: How we built an RL environment that teaches language models real-world energy management and what 10 training runs taught us.
4
  ---
5
 
6
+ # GridMind-RL: Training LLMs to Manage Industrial Buildings
7
 
8
+ *OpenEnv Hackathon India 2026 · GridMind-RL Team*
9
 
10
+ ---
11
+
12
+ There is a building somewhere running its air conditioning at full power right now,
13
+ even though electricity costs four times more than it did six hours ago. Not because
14
+ the operator made a bad decision — but because the control system doesn't know the
15
+ price changed.
16
+
17
+ Industrial buildings consume roughly 40% of global electricity. Most are managed by
18
+ fixed schedules that made sense when they were written and haven't been touched since.
19
+ The cost gap between a naive policy and an intelligent one is measurable in thousands
20
+ of dollars per building per year.
21
 
22
+ LLMs can read pricing curves, respond to fault alerts, and follow natural language
23
+ instructions. The missing piece has always been an environment that trains them to
24
+ *act* on that reasoning under real operational pressure.
25
 
26
+ That's what we built.
27
 
28
+ ---
 
 
 
 
29
 
30
  ## The Environment
31
 
32
+ GridMind-RL simulates a complete 24-hour industrial building energy system at
33
+ 15-minute resolution — 96 decision steps per episode. The agent operates in
34
+ continuous time, responding to a world that changes around it: prices spike, equipment
35
+ degrades, grid stress signals arrive, and sometimes the chiller fails at 2pm on the
36
+ hottest day of the year.
37
+
38
+ **The agent sees 13 fields every step:**
39
+ current indoor temperature, thermal storage level, electricity price, grid stress
40
+ signal, HVAC efficiency (which degrades continuously over the episode), active fault
41
+ alarms, a 4-step price forecast, cumulative cost so far, carbon intensity, batch job
42
+ queue, hour of day, and — in Task 4 — a natural language instruction card describing
43
+ the episode's objective.
44
+
45
+ **The agent has four levers:**
46
 
47
+ | Action | Range | What it does |
48
+ |--------|-------|--------------|
49
+ | `hvac_power_level` | 0 1 | How hard the HVAC system works |
50
+ | `thermal_charge_rate` | -1 1 | Charge or discharge thermal storage |
51
+ | `batch_job_slot` | 0 4 | When to run deferrable industrial loads |
52
+ | `load_shed_fraction` | 0 0.5 | Voluntary demand reduction during grid stress |
53
 
54
+ **Four tasks test different capabilities:**
55
 
56
+ - **Cost Minimization** Navigate 24-hour price volatility and thermal storage
57
+ arbitrage to minimize total energy spend.
58
 
59
+ - **Comfort Management** Hold indoor temperature within 19–23°C through equipment
60
+ degradation, faults, and shifting external conditions.
61
 
62
+ - **Demand Response** Read grid stress signals in real time and voluntarily shed
63
+ load to earn demand-response credit without sacrificing comfort.
64
 
65
+ - **Instruction Following**Parse a natural language objective card at episode
66
+ start and adapt the entire 96-step strategy to meet it.
 
 
 
67
 
68
+ ### Why the reward has nine components
69
 
70
+ The naive approach is to reward cost savings and call it done. The problem is that
71
+ a cost-only reward teaches the agent to turn off the HVAC entirely — perfect score,
72
+ frozen building.
73
 
74
+ Real building operators don't optimize one metric. They manage a hierarchy:
75
+ comfort is non-negotiable, grid compliance is contractual, cost is the primary KPI,
76
+ carbon is increasingly regulated, and equipment stability protects the capital budget.
77
 
78
+ Our reward reflects that hierarchy directly:
 
 
79
 
80
+ | Component | Weight | Why |
81
+ |-----------|--------|-----|
82
+ | `cost_savings` | 0.28 | Primary operator KPI |
83
+ | `carbon_reward` | 0.20 | ESG compliance, increasingly mandatory |
84
+ | `temp_constraint` | 0.20 | Hard safety constraint — SLA violations incur penalties |
85
+ | `grid_response` | 0.20 | Demand response programs pay operators to shed load |
86
+ | `batch_deadline` | 0.12 | Missing deadlines causes downstream production losses |
87
+ | `efficiency_bonus` | 0.05 | Incentivises smart thermal storage arbitrage |
88
+ | `stability_penalty` | -0.05 | Prevents HVAC thrashing that causes equipment wear |
89
+ | `fault_mitigation` | dynamic | Correct fault response prevents costly outages |
90
+ | `task_satisfaction` | 0.50* | Task 4 only — weighted per the instruction card |
91
 
92
+ A reward this dense is harder to game. An agent that exploits one component while
93
+ neglecting the others will see it reflected immediately in the score.
94
+
95
+ ---
96
 
97
  ## Training
98
 
99
+ We trained Qwen2.5-1.5B-Instruct with QLoRA (4-bit, rank 16) using GRPO via
100
+ HuggingFace TRL. Each run is 60 steps on a T4 GPU, taking roughly 35 minutes.
101
+ We ran 10 training iterations in total.
102
+
103
+ **Why GRPO over PPO?**
104
+ GRPO doesn't require a separate value network. At 1.5B parameters on a T4, that
105
+ memory saving matters. Instead of estimating a value baseline, GRPO samples a group
106
+ of completions per prompt and computes advantages by comparing them against each
107
+ other — a natural fit for our setting where we generate multiple actions per state.
108
+
109
+ | Component | Detail |
110
+ |-----------|--------|
111
+ | Model | Qwen2.5-1.5B-Instruct |
112
+ | Fine-tuning | QLoRA (4-bit, rank 16) |
113
+ | Algorithm | GRPO via HuggingFace TRL |
114
+ | Hardware | HF Space T4 GPU |
115
+ | Training time | ~35 minutes per run |
116
+ | Total runs | 10 |
117
 
118
+ ---
119
+
120
+ ## What the Curves Show
121
+
122
+ ### Run 1 vs Run 10: The reward is climbing
123
+
124
+ The clearest evidence of learning is what happens to the reward curve within a single
125
+ training run — and how that shape changes as the training setup matures.
126
+
127
+ **Run 1 — the first training run:**
128
+
129
+ ![Reward Curve — Run 1](curves/train%201/reward_curve.png)
130
+ *Run 1: Reward climbs from −0.47 to ~0.65 over 60 steps. The model is learning fast
131
+ in the early steps, then stabilizing — with a small dip at the very end.*
132
+
133
+ **Run 10 — after iterative refinement:**
134
+
135
+ ![Reward Curve — Run 10](curves/train%2010/reward_curve.png)
136
+ *Run 10: Same starting point, smoother curve, still rising at step 60. The model
137
+ hasn't plateaued — which means longer training would continue to improve it.*
138
+
139
+ Both runs start at the same reward (~−0.47) because each run initializes fresh.
140
+ What changes is the *shape*: Run 10 is more stable, ends higher (~0.68 vs ~0.65),
141
+ and shows no end-of-run dip. Ten runs of iteration on the training setup produced
142
+ a meaningfully cleaner learning signal.
143
+
144
+ The 1.1-point reward improvement within a single 60-step run is not noise.
145
+ The agent is learning to manage energy in real time.
146
+
147
+ ### Before and After: Where the model wins
148
+
149
+ **Run 1 — heuristic baseline vs GRPO-trained:**
150
+
151
+ ![Baseline Comparison — Run 1](curves/train%201/baseline_comparison.png)
152
+ *Run 1: The trained model outperforms the heuristic on Task 4 by a significant margin.
153
+ On Tasks 1–3 it scores below the heuristic — early training, limited steps.*
154
 
155
+ **Run 10 heuristic baseline vs GRPO-trained:**
156
 
157
+ ![Baseline Comparison Run 10](curves/train%2010/baseline_comparison.png)
158
+ *Run 10: Similar pattern. Task 4 remains the trained model's strongest result.
159
+ Tasks 1–3 gap to the heuristic has narrowed compared to Run 1.*
160
+
161
+ ### The Task 4 result is the headline
162
+
163
+ The heuristic scores **0.30** on Task 4. The trained model scores **0.70**.
164
+ That is a **133% improvement** on instruction following — and it makes complete sense.
165
+
166
+ A fixed heuristic cannot read a natural language objective card. It cannot parse
167
+ "keep total cost under $2.50 while maintaining comfort" and change its behavior
168
+ accordingly. The trained model can. That capability gap is exactly what this
169
+ environment was designed to measure.
170
+
171
+ Tasks 1–3 tell a more honest story. Time-of-day HVAC scheduling is genuinely
172
+ reasonable for cost and temperature — the heuristic is a strong baseline on those
173
+ tasks, and 60 training steps with a 1.5B model isn't enough to beat it consistently.
174
+ That's not a failure of the environment. It's a signal that longer training would
175
+ continue to pay off.
176
+
177
+ ---
178
+
179
+ ## What the Agent Learns
180
+
181
+ None of these behaviors are hardcoded. The reward signal surfaces them:
182
+
183
+ **Thermal arbitrage** — the agent learns to charge thermal storage during off-peak
184
+ hours (~4¢/kWh) and discharge during peak (~32¢/kWh), reducing the effective cost
185
+ of maintaining comfort during expensive periods.
186
+
187
+ **Grid cooperation** — when the stress signal exceeds 0.7, the agent voluntarily
188
+ sheds load rather than ignoring the signal. The demand-response credit offsets the
189
+ comfort penalty.
190
+
191
+ **Fault adaptation** — when HVAC efficiency degrades below a threshold, the agent
192
+ reduces its HVAC target rather than fighting a weakened system at full power.
193
+
194
+ **Instruction parsing** — in Task 4, the agent reads the objective card and adjusts
195
+ its entire 96-step strategy accordingly, not just the next action.
196
+
197
+ ---
198
+
199
+ ## What We'd Do With More Compute
200
+
201
+ - **300+ training steps** would likely close the gap on Tasks 1–3
202
+ - A **7B model** with the same setup would show sharper policy improvement
203
+ - **Multi-agent coordination** — 3 buildings sharing a 250kW feeder — is fully
204
+ implemented but not yet the primary training focus. Fleet-level demand response
205
+ is the next frontier.
206
+
207
+ ---
208
+
209
+ ## Try It
210
+
211
+ The environment is live. You can reset an episode, send actions, and read rewards
212
+ right now from your terminal:
213
 
214
  ```bash
215
+ # Health check
216
  curl https://prajwal782007-gridmind.hf.space/health
217
 
218
+ # Start an episode
219
+ curl -X POST https://prajwal782007-gridmind.hf.space/reset \
220
+ -H "Content-Type: application/json" \
221
+ -d '{"task_id": 4}'
222
 
223
+ # Take an action
224
+ curl -X POST https://prajwal782007-gridmind.hf.space/step \
225
+ -H "Content-Type: application/json" \
226
+ -d '{"hvac_power_level": 0.6, "thermal_charge_rate": 0.4,
227
+ "batch_job_slot": 2, "load_shed_fraction": 0.0, "building_id": 0}'
228
  ```
229
 
230
+ - 🤗 **Environment**: https://prajwal782007-gridmind.hf.space
231
+ - 📓 **Training Notebook**: [gridmind_grpo_colab.ipynb](https://colab.research.google.com/github/LO-Kyu/gridmind/blob/main/scripts/gridmind_grpo_colab.ipynb)
232
+ - 🐙 **Code**: https://github.com/LO-Kyu/gridmind
 
233
 
234
  ---
235
 
236
+ *Built for the OpenEnv Hackathon India 2026 · April 25–26 · Scaler School of
237
+ Technology, Bangalore.*
curves/train 10/{23d31ed3-ee14-4cbd-8c91-1ea5341b3657.png → baseline_comparison.png} RENAMED
File without changes
curves/train 10/{c1816e21-ec99-4b41-a641-009d872dac69.png → loss_curve.png} RENAMED
File without changes
curves/train 10/{e1c2dacb-7784-4788-b4fb-59b67ab5fed1.png → reward_curve.png} RENAMED
File without changes