Prajwal782007 commited on
Commit
fb9da93
·
1 Parent(s): 51aa9cd

docs: update HF_BLOG_POST.md with OpenEnv interface details, reward engineering insights, and refined training results

Browse files
Files changed (1) hide show
  1. HF_BLOG_POST.md +153 -98
HF_BLOG_POST.md CHANGED
@@ -1,16 +1,16 @@
1
  ---
2
  title: GridMind-RL: Training LLMs to Manage Industrial Buildings with GRPO
3
- description: How we built an RL environment that teaches language models real-world energy management — and what 10 training runs taught us.
4
  ---
5
 
6
  # GridMind-RL: Training LLMs to Manage Industrial Buildings
7
 
8
- *OpenEnv Hackathon India 2026 · GridMind-RL Team*
9
 
10
  ---
11
 
12
  There is a building somewhere running its air conditioning at full power right now,
13
- even though electricity costs four times more than it did six hours ago. Not because
14
  the operator made a bad decision — but because the control system doesn't know the
15
  price changed.
16
 
@@ -20,27 +20,58 @@ The cost gap between a naive policy and an intelligent one is measurable in thou
20
  of dollars per building per year.
21
 
22
  LLMs can read pricing curves, respond to fault alerts, and follow natural language
23
- instructions. The missing piece has always been an environment that trains them to
24
- *act* on that reasoning under real operational pressure.
 
25
 
26
- That's what we built.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  ---
29
 
30
  ## The Environment
31
 
32
- GridMind-RL simulates a complete 24-hour industrial building energy system at
33
- 15-minute resolution 96 decision steps per episode. The agent operates in
34
- continuous time, responding to a world that changes around it: prices spike, equipment
35
- degrades, grid stress signals arrive, and sometimes the chiller fails at 2pm on the
36
- hottest day of the year.
 
 
 
37
 
38
- **The agent sees 13 fields every step:**
39
- current indoor temperature, thermal storage level, electricity price, grid stress
40
- signal, HVAC efficiency (which degrades continuously over the episode), active fault
41
- alarms, a 4-step price forecast, cumulative cost so far, carbon intensity, batch job
42
- queue, hour of day, and — in Task 4 a natural language instruction card describing
43
- the episode's objective.
44
 
45
  **The agent has four levers:**
46
 
@@ -51,16 +82,17 @@ the episode's objective.
51
  | `batch_job_slot` | 0 → 4 | When to run deferrable industrial loads |
52
  | `load_shed_fraction` | 0 → 0.5 | Voluntary demand reduction during grid stress |
53
 
54
- **Four tasks test different capabilities:**
55
 
56
- - **Cost Minimization** — Navigate 24-hour price volatility and thermal storage
57
- arbitrage to minimize total energy spend.
58
 
59
  - **Comfort Management** — Hold indoor temperature within 19–23°C through equipment
60
  degradation, faults, and shifting external conditions.
61
 
62
  - **Demand Response** — Read grid stress signals in real time and voluntarily shed
63
- load to earn demand-response credit without sacrificing comfort.
 
64
 
65
  - **Instruction Following** — Parse a natural language objective card at episode
66
  start and adapt the entire 96-step strategy to meet it.
@@ -69,7 +101,7 @@ the episode's objective.
69
 
70
  The naive approach is to reward cost savings and call it done. The problem is that
71
  a cost-only reward teaches the agent to turn off the HVAC entirely — perfect score,
72
- frozen building.
73
 
74
  Real building operators don't optimize one metric. They manage a hierarchy:
75
  comfort is non-negotiable, grid compliance is contractual, cost is the primary KPI,
@@ -87,24 +119,33 @@ Our reward reflects that hierarchy directly:
87
  | `efficiency_bonus` | 0.05 | Incentivises smart thermal storage arbitrage |
88
  | `stability_penalty` | -0.05 | Prevents HVAC thrashing that causes equipment wear |
89
  | `fault_mitigation` | dynamic | Correct fault response prevents costly outages |
90
- | `task_satisfaction` | 0.50* | Task 4 only — weighted per the instruction card |
 
 
 
 
 
 
 
91
 
92
- A reward this dense is harder to game. An agent that exploits one component while
93
- neglecting the others will see it reflected immediately in the score.
 
 
 
 
 
 
 
 
 
94
 
95
  ---
96
 
97
  ## Training
98
 
99
  We trained Qwen2.5-1.5B-Instruct with QLoRA (4-bit, rank 16) using GRPO via
100
- HuggingFace TRL. Each run is 60 steps on a T4 GPU, taking roughly 35 minutes.
101
- We ran 10 training iterations in total.
102
-
103
- **Why GRPO over PPO?**
104
- GRPO doesn't require a separate value network. At 1.5B parameters on a T4, that
105
- memory saving matters. Instead of estimating a value baseline, GRPO samples a group
106
- of completions per prompt and computes advantages by comparing them against each
107
- other — a natural fit for our setting where we generate multiple actions per state.
108
 
109
  | Component | Detail |
110
  |-----------|--------|
@@ -112,67 +153,61 @@ other — a natural fit for our setting where we generate multiple actions per s
112
  | Fine-tuning | QLoRA (4-bit, rank 16) |
113
  | Algorithm | GRPO via HuggingFace TRL |
114
  | Hardware | HF Space T4 GPU |
115
- | Training time | ~35 minutes per run |
116
- | Total runs | 10 |
117
-
118
- ---
119
-
120
- ## What the Curves Show
121
-
122
- ### Run 1 vs Run 10: The reward is climbing
123
 
124
- The clearest evidence of learning is what happens to the reward curve within a single
125
- training run and how that shape changes as the training setup matures.
126
-
127
- **Run 1 the first training run:**
128
-
129
- ![Reward Curve Run 1](curves/train%201/reward_curve.png)
130
- *Run 1: Reward climbs from −0.47 to ~0.65 over 60 steps. The model is learning fast
131
- in the early steps, then stabilizing — with a small dip at the very end.*
132
 
133
- **Run 10 after iterative refinement:**
 
 
134
 
135
- ![Reward Curve — Run 10](curves/train%2010/reward_curve.png)
136
- *Run 10: Same starting point, smoother curve, still rising at step 60. The model
137
- hasn't plateaued — which means longer training would continue to improve it.*
138
 
139
- Both runs start at the same reward (~−0.47) because each run initializes fresh.
140
- What changes is the *shape*: Run 10 is more stable, ends higher (~0.68 vs ~0.65),
141
- and shows no end-of-run dip. Ten runs of iteration on the training setup produced
142
- a meaningfully cleaner learning signal.
143
 
144
- The 1.1-point reward improvement within a single 60-step run is not noise.
145
- The agent is learning to manage energy in real time.
146
 
147
- ### Before and After: Where the model wins
 
 
 
148
 
149
- **Run 1 heuristic baseline vs GRPO-trained:**
 
 
150
 
151
- ![Baseline Comparison Run 1](curves/train%201/baseline_comparison.png)
152
- *Run 1: The trained model outperforms the heuristic on Task 4 by a significant margin.
153
- On Tasks 1–3 it scores below the heuristic — early training, limited steps.*
154
 
155
- **Run 10 heuristic baseline vs GRPO-trained:**
 
 
 
156
 
157
- ![Baseline Comparison Run 10](curves/train%2010/baseline_comparison.png)
158
- *Run 10: Similar pattern. Task 4 remains the trained model's strongest result.
159
- Tasks 1–3 gap to the heuristic has narrowed compared to Run 1.*
 
 
160
 
161
- ### The Task 4 result is the headline
162
 
163
- The heuristic scores **0.30** on Task 4. The trained model scores **0.70**.
164
- That is a **133% improvement** on instruction following and it makes complete sense.
 
 
165
 
166
- A fixed heuristic cannot read a natural language objective card. It cannot parse
167
- "keep total cost under $2.50 while maintaining comfort" and change its behavior
168
- accordingly. The trained model can. That capability gap is exactly what this
169
- environment was designed to measure.
170
 
171
- Tasks 1–3 tell a more honest story. Time-of-day HVAC scheduling is genuinely
172
- reasonable for cost and temperature — the heuristic is a strong baseline on those
173
- tasks, and 60 training steps with a 1.5B model isn't enough to beat it consistently.
174
- That's not a failure of the environment. It's a signal that longer training would
175
- continue to pay off.
176
 
177
  ---
178
 
@@ -181,50 +216,70 @@ continue to pay off.
181
  None of these behaviors are hardcoded. The reward signal surfaces them:
182
 
183
  **Thermal arbitrage** — the agent learns to charge thermal storage during off-peak
184
- hours (~4¢/kWh) and discharge during peak (~32¢/kWh), reducing the effective cost
185
  of maintaining comfort during expensive periods.
186
 
187
  **Grid cooperation** — when the stress signal exceeds 0.7, the agent voluntarily
188
- sheds load rather than ignoring the signal. The demand-response credit offsets the
189
- comfort penalty.
190
 
191
- **Fault adaptation** — when HVAC efficiency degrades below a threshold, the agent
192
- reduces its HVAC target rather than fighting a weakened system at full power.
 
193
 
194
  **Instruction parsing** — in Task 4, the agent reads the objective card and adjusts
195
- its entire 96-step strategy accordingly, not just the next action.
 
196
 
197
  ---
198
 
199
- ## What We'd Do With More Compute
 
 
 
200
 
201
- - **300+ training steps** would likely close the gap on Tasks 1–3
202
- - A **7B model** with the same setup would show sharper policy improvement
203
- - **Multi-agent coordination** — 3 buildings sharing a 250kW feeder — is fully
204
- implemented but not yet the primary training focus. Fleet-level demand response
205
- is the next frontier.
 
 
 
 
 
 
 
 
 
 
 
206
 
207
  ---
208
 
209
  ## Try It
210
 
211
- The environment is live. You can reset an episode, send actions, and read rewards
212
- right now from your terminal:
 
213
 
214
  ```bash
215
  # Health check
216
  curl https://prajwal782007-gridmind.hf.space/health
217
 
218
- # Start an episode
219
  curl -X POST https://prajwal782007-gridmind.hf.space/reset \
220
  -H "Content-Type: application/json" \
221
  -d '{"task_id": 4}'
222
 
223
- # Take an action
224
  curl -X POST https://prajwal782007-gridmind.hf.space/step \
225
  -H "Content-Type: application/json" \
226
  -d '{"hvac_power_level": 0.6, "thermal_charge_rate": 0.4,
227
  "batch_job_slot": 2, "load_shed_fraction": 0.0, "building_id": 0}'
 
 
 
228
  ```
229
 
230
  - 🤗 **Environment**: https://prajwal782007-gridmind.hf.space
@@ -233,5 +288,5 @@ curl -X POST https://prajwal782007-gridmind.hf.space/step \
233
 
234
  ---
235
 
236
- *Built for the OpenEnv Hackathon India 2026 · April 25–26 · Scaler School of
237
- Technology, Bangalore.*
 
1
  ---
2
  title: GridMind-RL: Training LLMs to Manage Industrial Buildings with GRPO
3
+ description: How we built an OpenEnv-compatible RL environment that teaches language models real-world energy management — and what the training curves actually show.
4
  ---
5
 
6
  # GridMind-RL: Training LLMs to Manage Industrial Buildings
7
 
8
+ *OpenEnv Hackathon India 2026 · Aditya Suryavanshi, Shreeshant Bokade, Prajwal Valekar*
9
 
10
  ---
11
 
12
  There is a building somewhere running its air conditioning at full power right now,
13
+ even though electricity costs five times more than it did six hours ago. Not because
14
  the operator made a bad decision — but because the control system doesn't know the
15
  price changed.
16
 
 
20
  of dollars per building per year.
21
 
22
  LLMs can read pricing curves, respond to fault alerts, and follow natural language
23
+ instructions but there has never been an environment that trains them to *act* on
24
+ that reasoning under real operational pressure. We built one, trained on it, and the
25
+ results show an agent that beats a hand-crafted heuristic on the tasks that matter most.
26
 
27
+ ---
28
+
29
+ ## Who We Are
30
+
31
+ We are a team of three fascinated by the gap between what LLMs can reason about and
32
+ what they can actually *do*. Building energy management sits right at that frontier —
33
+ the domain is rich, the stakes are real, and no RL benchmark has touched it.
34
+ GridMind-RL is our attempt to change that.
35
+
36
+ We built this for the Meta PyTorch OpenEnv Hackathon Grand Finale at Scaler School
37
+ of Technology, Bangalore, April 25–26, 2026.
38
+
39
+ ---
40
+
41
+ ## Which Themes We're Targeting
42
+
43
+ GridMind-RL directly addresses two hackathon themes:
44
+
45
+ **Theme 1 — Multi-Agent Interactions:** Three buildings share a 360kW grid feeder
46
+ (120kW per building). A coordinator LLM reads fleet-wide demand via `/feeder` and
47
+ sets per-building price multipliers via `/coordinate`. Buildings that ignore the
48
+ signal trip the feeder limit — causing a grid fault penalty for all three. This
49
+ creates genuine emergent coordination pressure without explicit communication.
50
+
51
+ **Theme 3.1 — World Modeling (Professional Tasks):** The `/simulate` endpoint lets
52
+ the agent ask "what if?" before committing an action. When HVAC efficiency is low or
53
+ faults are active, the agent can simulate a proposed action and revise its plan if
54
+ the predicted reward is poor. This trains causal reasoning and persistent world
55
+ modeling — exactly what Theme 3 targets.
56
 
57
  ---
58
 
59
  ## The Environment
60
 
61
+ GridMind-RL implements the OpenEnv-compatible interface (reset/step/state/grade)
62
+ via a high-performance Go HTTP server. openenv-core==0.2.3 is used as the
63
+ Python client library for training-side interaction. It simulates a complete 24-hour industrial
64
+ building energy system at 15-minute resolution 96 decision steps per episode.
65
+
66
+ The agent operates in continuous time, responding to a world that changes around it:
67
+ prices spike up to 5× during tariff faults, equipment degrades, grid stress signals
68
+ arrive, and sometimes the chiller fails at 2pm on the hottest day of the year.
69
 
70
+ **The agent sees a rich observation space every step, including:**
71
+ indoor temperature, thermal storage level, electricity price, grid stress signal,
72
+ HVAC efficiency (which degrades continuously throughout the episode), active fault
73
+ alarms, a 4-step price forecast, cumulative cost, carbon intensity, batch job queue,
74
+ and hour of day. In Task 4, this also includes a natural language instruction card.
 
75
 
76
  **The agent has four levers:**
77
 
 
82
  | `batch_job_slot` | 0 → 4 | When to run deferrable industrial loads |
83
  | `load_shed_fraction` | 0 → 0.5 | Voluntary demand reduction during grid stress |
84
 
85
+ **Four tasks of increasing difficulty:**
86
 
87
+ - **Cost Minimization** — Navigate 24-hour price volatility (~2¢ to ~36¢/kWh) and
88
+ thermal storage arbitrage to minimize total energy spend.
89
 
90
  - **Comfort Management** — Hold indoor temperature within 19–23°C through equipment
91
  degradation, faults, and shifting external conditions.
92
 
93
  - **Demand Response** — Read grid stress signals in real time and voluntarily shed
94
+ load (when signal exceeds 0.7) to earn demand-response credit without sacrificing
95
+ comfort.
96
 
97
  - **Instruction Following** — Parse a natural language objective card at episode
98
  start and adapt the entire 96-step strategy to meet it.
 
101
 
102
  The naive approach is to reward cost savings and call it done. The problem is that
103
  a cost-only reward teaches the agent to turn off the HVAC entirely — perfect score,
104
+ frozen building. This is textbook reward hacking.
105
 
106
  Real building operators don't optimize one metric. They manage a hierarchy:
107
  comfort is non-negotiable, grid compliance is contractual, cost is the primary KPI,
 
119
  | `efficiency_bonus` | 0.05 | Incentivises smart thermal storage arbitrage |
120
  | `stability_penalty` | -0.05 | Prevents HVAC thrashing that causes equipment wear |
121
  | `fault_mitigation` | dynamic | Correct fault response prevents costly outages |
122
+ | `task_satisfaction` | 0.10–0.50* | Task 4 only — weighted per the instruction card |
123
+
124
+ > *`task_satisfaction` weight varies by instruction template, ranging from
125
+ > 0.10 to 0.50 depending on the episode's objective card (tasks.go).
126
+
127
+ ### How we prevent reward hacking
128
+
129
+ A multi-component reward is only part of the answer. We also:
130
 
131
+ - **Clamp all actions** at the server side the agent cannot exceed valid ranges
132
+ regardless of what it outputs (`hvac_power_level` hard-clamped 0–1,
133
+ `load_shed_fraction` hard-clamped 0–0.5, etc.)
134
+ - **Inject four fault types** that make naive exploitation brittle: chiller failure
135
+ (HVAC drops to 20% capacity), grid outage (price up to ×4, stress = 1.0), sensor
136
+ fault (temperature jitter ±5°C), and tariff spike (price up to ×5)
137
+ - **Use a seeded but stochastic environment** — price curves, fault timing, and
138
+ demand patterns vary across episodes, preventing the agent from memorizing a
139
+ fixed solution
140
+ - **Score via `/grade`** at episode end using a separate grading function that is
141
+ decoupled from the per-step reward signal
142
 
143
  ---
144
 
145
  ## Training
146
 
147
  We trained Qwen2.5-1.5B-Instruct with QLoRA (4-bit, rank 16) using GRPO via
148
+ HuggingFace TRL on a T4 GPU roughly 35 minutes per run.
 
 
 
 
 
 
 
149
 
150
  | Component | Detail |
151
  |-----------|--------|
 
153
  | Fine-tuning | QLoRA (4-bit, rank 16) |
154
  | Algorithm | GRPO via HuggingFace TRL |
155
  | Hardware | HF Space T4 GPU |
156
+ | Training time | ~35 minutes |
157
+ | Steps | 60 |
 
 
 
 
 
 
158
 
159
+ **Why GRPO over PPO?**
160
+ GRPO doesn't require a separate value network. At 1.5B parameters on a T4, that
161
+ memory saving matters. Instead of estimating a value baseline, GRPO samples a group
162
+ of completions per prompt and computes advantages by comparing them against each
163
+ other — a natural fit for our setting where we generate multiple actions per state
164
+ and want to reinforce the better ones.
 
 
165
 
166
+ The hackathon context emphasized that RL only works if the probability of a good
167
+ answer is greater than zero. We confirmed this by running a heuristic baseline first
168
+ to verify the environment produces non-zero reward before starting RL training.
169
 
170
+ ---
 
 
171
 
172
+ ## Results
 
 
 
173
 
174
+ ### The numbers first
 
175
 
176
+ | Policy | Task 1 | Task 2 | Task 3 | Task 4 | Avg (unweighted) |
177
+ |--------|--------|--------|--------|--------|------------------|
178
+ | Heuristic Baseline | 0.54 | 0.56 | 0.50 | 0.31 | 0.48 |
179
+ | GRPO Fine-tuned | 0.42 | 0.34 | 0.47 | **0.49** | 0.43 |
180
 
181
+ > Heuristic = fixed time-of-day HVAC scheduling, no learning.
182
+ > GRPO Fine-tuned = Qwen2.5-1.5B-Instruct after 60 steps of GRPO training
183
+ > against the live environment.
184
 
185
+ The trained model **beats the heuristic on Task 4 by 58%** (0.49 vs 0.31) and
186
+ **comes within 6% of the heuristic on Task 3** (0.47 vs 0.50).
 
187
 
188
+ These are the two tasks where intelligent reasoning matters most — instruction
189
+ parsing and real-time grid cooperation. A fixed schedule cannot read an objective
190
+ card. A fixed schedule cannot respond to a grid stress signal that arrives mid-episode.
191
+ The trained model can do both.
192
 
193
+ Tasks 1 and 2 are an honest result. Time-of-day HVAC scheduling is genuinely
194
+ competitive for cost and comfort the heuristic baseline is strong on those
195
+ objectives because the physics are predictable. Closing that gap requires more
196
+ training steps. The reward curve shows the trend is still moving upward at step 60,
197
+ meaning training had not plateaued.
198
 
199
+ ### The reward curve
200
 
201
+ ![Reward Curve](curves/train%204/reward_curve.png)
202
+ *Reward vs training step. From −0.47 at step 5 to +0.61 at step 60 — a 1.08-point
203
+ gain. The smoothed average (red dashed) is still rising at the final step, confirming
204
+ training had not saturated.*
205
 
206
+ ### The before/after
 
 
 
207
 
208
+ ![Baseline Comparison](curves/train%204/baseline_comparison.png)
209
+ *Grade scores per task: heuristic baseline (blue) vs GRPO-trained LLM (green).
210
+ Task 4 is where the trained model pulls clearly ahead 58% above the heuristic.*
 
 
211
 
212
  ---
213
 
 
216
  None of these behaviors are hardcoded. The reward signal surfaces them:
217
 
218
  **Thermal arbitrage** — the agent learns to charge thermal storage during off-peak
219
+ hours (~3.5¢/kWh) and discharge during peak (~31¢/kWh), reducing the effective cost
220
  of maintaining comfort during expensive periods.
221
 
222
  **Grid cooperation** — when the stress signal exceeds 0.7, the agent voluntarily
223
+ sheds load rather than ignoring it. The demand-response credit offsets the comfort
224
+ penalty — which is why Task 3 performance is closest to the heuristic.
225
 
226
+ **Fault adaptation** — when HVAC efficiency degrades, the agent reduces its HVAC
227
+ target rather than fighting a weakened system at full power. This behavior emerges
228
+ purely from the `fault_mitigation` reward component.
229
 
230
  **Instruction parsing** — in Task 4, the agent reads the objective card and adjusts
231
+ its entire 96-step strategy to meet it. This is the hardest capability for a
232
+ heuristic to replicate — and where the trained model wins most clearly.
233
 
234
  ---
235
 
236
+ ## What's Next
237
+
238
+ GridMind-RL is a foundation, not a finished product. The directions we find most
239
+ interesting:
240
 
241
+ **Longer training runs** the reward curve hasn't plateaued at 60 steps. 300+
242
+ steps would likely close the gap on Tasks 1 and 2 and push Task 4 performance
243
+ further above the heuristic.
244
+
245
+ **Larger models** — a 7B model with the same training setup would bring stronger
246
+ instruction-following capability and better multi-step planning out of the box.
247
+
248
+ **Fleet-level coordination** — three buildings share a 360kW grid feeder (120kW per
249
+ building). Fleet-level coordination is fully implemented — training a coordinator LLM
250
+ that orchestrates all three through price signals is the next research direction.
251
+ The shared feeder constraint creates genuine emergent coordination pressure — if one
252
+ building ignores the signal, all three pay the penalty.
253
+
254
+ **Real deployment** — the environment's physics are grounded in real building
255
+ parameters. The gap between this simulator and a real BMS integration is smaller
256
+ than it looks.
257
 
258
  ---
259
 
260
  ## Try It
261
 
262
+ GridMind-RL is live and OpenEnv-compliant. Task 4 is the most interesting to try —
263
+ the agent receives a natural language objective card and must adapt its entire
264
+ strategy to meet it:
265
 
266
  ```bash
267
  # Health check
268
  curl https://prajwal782007-gridmind.hf.space/health
269
 
270
+ # Start a Task 4 episode (instruction following)
271
  curl -X POST https://prajwal782007-gridmind.hf.space/reset \
272
  -H "Content-Type: application/json" \
273
  -d '{"task_id": 4}'
274
 
275
+ # Take an action and observe the reward
276
  curl -X POST https://prajwal782007-gridmind.hf.space/step \
277
  -H "Content-Type: application/json" \
278
  -d '{"hvac_power_level": 0.6, "thermal_charge_rate": 0.4,
279
  "batch_job_slot": 2, "load_shed_fraction": 0.0, "building_id": 0}'
280
+
281
+ # Grade the full episode
282
+ curl https://prajwal782007-gridmind.hf.space/grade
283
  ```
284
 
285
  - 🤗 **Environment**: https://prajwal782007-gridmind.hf.space
 
288
 
289
  ---
290
 
291
+ *Built for the Meta PyTorch OpenEnv Hackathon × Scaler School of Technology ·
292
+ Grand Finale, April 25–26, 2026, Bangalore.*