Spaces:
Running
Running
Commit ·
51aa9cd
1
Parent(s): 70d6600
updated Blog and renamed curves
Browse files
HF_BLOG_POST.md
CHANGED
|
@@ -1,94 +1,237 @@
|
|
| 1 |
---
|
| 2 |
-
title: GridMind-RL: Training LLMs to Manage Industrial Buildings
|
| 3 |
-
description:
|
| 4 |
---
|
| 5 |
|
| 6 |
-
|
| 7 |
|
| 8 |
-
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
-
|
| 19 |
-
- Maintain comfort (19–23°C) while minimizing cost
|
| 20 |
-
- Respond to grid stress emergencies
|
| 21 |
-
- Handle equipment faults (chiller failure, sensor malfunction, grid outages, tariff spikes)
|
| 22 |
-
- Parse and follow natural language objective cards
|
| 23 |
|
| 24 |
## The Environment
|
| 25 |
|
| 26 |
-
GridMind-RL
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
-
|
|
| 29 |
-
|-------|-------|
|
| 30 |
-
|
|
| 31 |
-
|
|
| 32 |
-
|
|
| 33 |
-
|
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
-
**
|
|
|
|
| 38 |
|
| 39 |
-
**
|
|
|
|
| 40 |
|
| 41 |
-
**
|
|
|
|
| 42 |
|
| 43 |
-
**
|
| 44 |
-
|
| 45 |
-
- **Grid outage**: Price ×3, stress = 1.0
|
| 46 |
-
- **Sensor fault**: Temperature readings jitter ±5°C
|
| 47 |
-
- **Tariff spike**: Emergency 4× price surge
|
| 48 |
|
| 49 |
-
|
| 50 |
|
| 51 |
-
|
|
|
|
|
|
|
| 52 |
|
| 53 |
-
|
|
|
|
|
|
|
| 54 |
|
| 55 |
-
|
| 56 |
-
|--------|--------|--------|--------|--------|
|
| 57 |
-
| **Heuristic Baseline** | 0.506 | 0.459 | 0.600 | 0.492 |
|
| 58 |
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
## Training
|
| 64 |
|
| 65 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
-
|
| 73 |
|
| 74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
```bash
|
| 77 |
-
#
|
| 78 |
curl https://prajwal782007-gridmind.hf.space/health
|
| 79 |
|
| 80 |
-
#
|
| 81 |
-
|
|
|
|
|
|
|
| 82 |
|
| 83 |
-
#
|
| 84 |
-
|
|
|
|
|
|
|
|
|
|
| 85 |
```
|
| 86 |
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
Code: [github.com/LO-Kyu/gridmind](https://github.com/LO-Kyu/gridmind)
|
| 91 |
|
| 92 |
---
|
| 93 |
|
| 94 |
-
*
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: GridMind-RL: Training LLMs to Manage Industrial Buildings with GRPO
|
| 3 |
+
description: How we built an RL environment that teaches language models real-world energy management — and what 10 training runs taught us.
|
| 4 |
---
|
| 5 |
|
| 6 |
+
# GridMind-RL: Training LLMs to Manage Industrial Buildings
|
| 7 |
|
| 8 |
+
*OpenEnv Hackathon India 2026 · GridMind-RL Team*
|
| 9 |
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
There is a building somewhere running its air conditioning at full power right now,
|
| 13 |
+
even though electricity costs four times more than it did six hours ago. Not because
|
| 14 |
+
the operator made a bad decision — but because the control system doesn't know the
|
| 15 |
+
price changed.
|
| 16 |
+
|
| 17 |
+
Industrial buildings consume roughly 40% of global electricity. Most are managed by
|
| 18 |
+
fixed schedules that made sense when they were written and haven't been touched since.
|
| 19 |
+
The cost gap between a naive policy and an intelligent one is measurable in thousands
|
| 20 |
+
of dollars per building per year.
|
| 21 |
|
| 22 |
+
LLMs can read pricing curves, respond to fault alerts, and follow natural language
|
| 23 |
+
instructions. The missing piece has always been an environment that trains them to
|
| 24 |
+
*act* on that reasoning under real operational pressure.
|
| 25 |
|
| 26 |
+
That's what we built.
|
| 27 |
|
| 28 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
## The Environment
|
| 31 |
|
| 32 |
+
GridMind-RL simulates a complete 24-hour industrial building energy system at
|
| 33 |
+
15-minute resolution — 96 decision steps per episode. The agent operates in
|
| 34 |
+
continuous time, responding to a world that changes around it: prices spike, equipment
|
| 35 |
+
degrades, grid stress signals arrive, and sometimes the chiller fails at 2pm on the
|
| 36 |
+
hottest day of the year.
|
| 37 |
+
|
| 38 |
+
**The agent sees 13 fields every step:**
|
| 39 |
+
current indoor temperature, thermal storage level, electricity price, grid stress
|
| 40 |
+
signal, HVAC efficiency (which degrades continuously over the episode), active fault
|
| 41 |
+
alarms, a 4-step price forecast, cumulative cost so far, carbon intensity, batch job
|
| 42 |
+
queue, hour of day, and — in Task 4 — a natural language instruction card describing
|
| 43 |
+
the episode's objective.
|
| 44 |
+
|
| 45 |
+
**The agent has four levers:**
|
| 46 |
|
| 47 |
+
| Action | Range | What it does |
|
| 48 |
+
|--------|-------|--------------|
|
| 49 |
+
| `hvac_power_level` | 0 → 1 | How hard the HVAC system works |
|
| 50 |
+
| `thermal_charge_rate` | -1 → 1 | Charge or discharge thermal storage |
|
| 51 |
+
| `batch_job_slot` | 0 → 4 | When to run deferrable industrial loads |
|
| 52 |
+
| `load_shed_fraction` | 0 → 0.5 | Voluntary demand reduction during grid stress |
|
| 53 |
|
| 54 |
+
**Four tasks test different capabilities:**
|
| 55 |
|
| 56 |
+
- **Cost Minimization** — Navigate 24-hour price volatility and thermal storage
|
| 57 |
+
arbitrage to minimize total energy spend.
|
| 58 |
|
| 59 |
+
- **Comfort Management** — Hold indoor temperature within 19–23°C through equipment
|
| 60 |
+
degradation, faults, and shifting external conditions.
|
| 61 |
|
| 62 |
+
- **Demand Response** — Read grid stress signals in real time and voluntarily shed
|
| 63 |
+
load to earn demand-response credit without sacrificing comfort.
|
| 64 |
|
| 65 |
+
- **Instruction Following** — Parse a natural language objective card at episode
|
| 66 |
+
start and adapt the entire 96-step strategy to meet it.
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
+
### Why the reward has nine components
|
| 69 |
|
| 70 |
+
The naive approach is to reward cost savings and call it done. The problem is that
|
| 71 |
+
a cost-only reward teaches the agent to turn off the HVAC entirely — perfect score,
|
| 72 |
+
frozen building.
|
| 73 |
|
| 74 |
+
Real building operators don't optimize one metric. They manage a hierarchy:
|
| 75 |
+
comfort is non-negotiable, grid compliance is contractual, cost is the primary KPI,
|
| 76 |
+
carbon is increasingly regulated, and equipment stability protects the capital budget.
|
| 77 |
|
| 78 |
+
Our reward reflects that hierarchy directly:
|
|
|
|
|
|
|
| 79 |
|
| 80 |
+
| Component | Weight | Why |
|
| 81 |
+
|-----------|--------|-----|
|
| 82 |
+
| `cost_savings` | 0.28 | Primary operator KPI |
|
| 83 |
+
| `carbon_reward` | 0.20 | ESG compliance, increasingly mandatory |
|
| 84 |
+
| `temp_constraint` | 0.20 | Hard safety constraint — SLA violations incur penalties |
|
| 85 |
+
| `grid_response` | 0.20 | Demand response programs pay operators to shed load |
|
| 86 |
+
| `batch_deadline` | 0.12 | Missing deadlines causes downstream production losses |
|
| 87 |
+
| `efficiency_bonus` | 0.05 | Incentivises smart thermal storage arbitrage |
|
| 88 |
+
| `stability_penalty` | -0.05 | Prevents HVAC thrashing that causes equipment wear |
|
| 89 |
+
| `fault_mitigation` | dynamic | Correct fault response prevents costly outages |
|
| 90 |
+
| `task_satisfaction` | 0.50* | Task 4 only — weighted per the instruction card |
|
| 91 |
|
| 92 |
+
A reward this dense is harder to game. An agent that exploits one component while
|
| 93 |
+
neglecting the others will see it reflected immediately in the score.
|
| 94 |
+
|
| 95 |
+
---
|
| 96 |
|
| 97 |
## Training
|
| 98 |
|
| 99 |
+
We trained Qwen2.5-1.5B-Instruct with QLoRA (4-bit, rank 16) using GRPO via
|
| 100 |
+
HuggingFace TRL. Each run is 60 steps on a T4 GPU, taking roughly 35 minutes.
|
| 101 |
+
We ran 10 training iterations in total.
|
| 102 |
+
|
| 103 |
+
**Why GRPO over PPO?**
|
| 104 |
+
GRPO doesn't require a separate value network. At 1.5B parameters on a T4, that
|
| 105 |
+
memory saving matters. Instead of estimating a value baseline, GRPO samples a group
|
| 106 |
+
of completions per prompt and computes advantages by comparing them against each
|
| 107 |
+
other — a natural fit for our setting where we generate multiple actions per state.
|
| 108 |
+
|
| 109 |
+
| Component | Detail |
|
| 110 |
+
|-----------|--------|
|
| 111 |
+
| Model | Qwen2.5-1.5B-Instruct |
|
| 112 |
+
| Fine-tuning | QLoRA (4-bit, rank 16) |
|
| 113 |
+
| Algorithm | GRPO via HuggingFace TRL |
|
| 114 |
+
| Hardware | HF Space T4 GPU |
|
| 115 |
+
| Training time | ~35 minutes per run |
|
| 116 |
+
| Total runs | 10 |
|
| 117 |
|
| 118 |
+
---
|
| 119 |
+
|
| 120 |
+
## What the Curves Show
|
| 121 |
+
|
| 122 |
+
### Run 1 vs Run 10: The reward is climbing
|
| 123 |
+
|
| 124 |
+
The clearest evidence of learning is what happens to the reward curve within a single
|
| 125 |
+
training run — and how that shape changes as the training setup matures.
|
| 126 |
+
|
| 127 |
+
**Run 1 — the first training run:**
|
| 128 |
+
|
| 129 |
+

|
| 130 |
+
*Run 1: Reward climbs from −0.47 to ~0.65 over 60 steps. The model is learning fast
|
| 131 |
+
in the early steps, then stabilizing — with a small dip at the very end.*
|
| 132 |
+
|
| 133 |
+
**Run 10 — after iterative refinement:**
|
| 134 |
+
|
| 135 |
+

|
| 136 |
+
*Run 10: Same starting point, smoother curve, still rising at step 60. The model
|
| 137 |
+
hasn't plateaued — which means longer training would continue to improve it.*
|
| 138 |
+
|
| 139 |
+
Both runs start at the same reward (~−0.47) because each run initializes fresh.
|
| 140 |
+
What changes is the *shape*: Run 10 is more stable, ends higher (~0.68 vs ~0.65),
|
| 141 |
+
and shows no end-of-run dip. Ten runs of iteration on the training setup produced
|
| 142 |
+
a meaningfully cleaner learning signal.
|
| 143 |
+
|
| 144 |
+
The 1.1-point reward improvement within a single 60-step run is not noise.
|
| 145 |
+
The agent is learning to manage energy in real time.
|
| 146 |
+
|
| 147 |
+
### Before and After: Where the model wins
|
| 148 |
+
|
| 149 |
+
**Run 1 — heuristic baseline vs GRPO-trained:**
|
| 150 |
+
|
| 151 |
+

|
| 152 |
+
*Run 1: The trained model outperforms the heuristic on Task 4 by a significant margin.
|
| 153 |
+
On Tasks 1–3 it scores below the heuristic — early training, limited steps.*
|
| 154 |
|
| 155 |
+
**Run 10 — heuristic baseline vs GRPO-trained:**
|
| 156 |
|
| 157 |
+

|
| 158 |
+
*Run 10: Similar pattern. Task 4 remains the trained model's strongest result.
|
| 159 |
+
Tasks 1–3 gap to the heuristic has narrowed compared to Run 1.*
|
| 160 |
+
|
| 161 |
+
### The Task 4 result is the headline
|
| 162 |
+
|
| 163 |
+
The heuristic scores **0.30** on Task 4. The trained model scores **0.70**.
|
| 164 |
+
That is a **133% improvement** on instruction following — and it makes complete sense.
|
| 165 |
+
|
| 166 |
+
A fixed heuristic cannot read a natural language objective card. It cannot parse
|
| 167 |
+
"keep total cost under $2.50 while maintaining comfort" and change its behavior
|
| 168 |
+
accordingly. The trained model can. That capability gap is exactly what this
|
| 169 |
+
environment was designed to measure.
|
| 170 |
+
|
| 171 |
+
Tasks 1–3 tell a more honest story. Time-of-day HVAC scheduling is genuinely
|
| 172 |
+
reasonable for cost and temperature — the heuristic is a strong baseline on those
|
| 173 |
+
tasks, and 60 training steps with a 1.5B model isn't enough to beat it consistently.
|
| 174 |
+
That's not a failure of the environment. It's a signal that longer training would
|
| 175 |
+
continue to pay off.
|
| 176 |
+
|
| 177 |
+
---
|
| 178 |
+
|
| 179 |
+
## What the Agent Learns
|
| 180 |
+
|
| 181 |
+
None of these behaviors are hardcoded. The reward signal surfaces them:
|
| 182 |
+
|
| 183 |
+
**Thermal arbitrage** — the agent learns to charge thermal storage during off-peak
|
| 184 |
+
hours (~4¢/kWh) and discharge during peak (~32¢/kWh), reducing the effective cost
|
| 185 |
+
of maintaining comfort during expensive periods.
|
| 186 |
+
|
| 187 |
+
**Grid cooperation** — when the stress signal exceeds 0.7, the agent voluntarily
|
| 188 |
+
sheds load rather than ignoring the signal. The demand-response credit offsets the
|
| 189 |
+
comfort penalty.
|
| 190 |
+
|
| 191 |
+
**Fault adaptation** — when HVAC efficiency degrades below a threshold, the agent
|
| 192 |
+
reduces its HVAC target rather than fighting a weakened system at full power.
|
| 193 |
+
|
| 194 |
+
**Instruction parsing** — in Task 4, the agent reads the objective card and adjusts
|
| 195 |
+
its entire 96-step strategy accordingly, not just the next action.
|
| 196 |
+
|
| 197 |
+
---
|
| 198 |
+
|
| 199 |
+
## What We'd Do With More Compute
|
| 200 |
+
|
| 201 |
+
- **300+ training steps** would likely close the gap on Tasks 1–3
|
| 202 |
+
- A **7B model** with the same setup would show sharper policy improvement
|
| 203 |
+
- **Multi-agent coordination** — 3 buildings sharing a 250kW feeder — is fully
|
| 204 |
+
implemented but not yet the primary training focus. Fleet-level demand response
|
| 205 |
+
is the next frontier.
|
| 206 |
+
|
| 207 |
+
---
|
| 208 |
+
|
| 209 |
+
## Try It
|
| 210 |
+
|
| 211 |
+
The environment is live. You can reset an episode, send actions, and read rewards
|
| 212 |
+
right now from your terminal:
|
| 213 |
|
| 214 |
```bash
|
| 215 |
+
# Health check
|
| 216 |
curl https://prajwal782007-gridmind.hf.space/health
|
| 217 |
|
| 218 |
+
# Start an episode
|
| 219 |
+
curl -X POST https://prajwal782007-gridmind.hf.space/reset \
|
| 220 |
+
-H "Content-Type: application/json" \
|
| 221 |
+
-d '{"task_id": 4}'
|
| 222 |
|
| 223 |
+
# Take an action
|
| 224 |
+
curl -X POST https://prajwal782007-gridmind.hf.space/step \
|
| 225 |
+
-H "Content-Type: application/json" \
|
| 226 |
+
-d '{"hvac_power_level": 0.6, "thermal_charge_rate": 0.4,
|
| 227 |
+
"batch_job_slot": 2, "load_shed_fraction": 0.0, "building_id": 0}'
|
| 228 |
```
|
| 229 |
|
| 230 |
+
- 🤗 **Environment**: https://prajwal782007-gridmind.hf.space
|
| 231 |
+
- 📓 **Training Notebook**: [gridmind_grpo_colab.ipynb](https://colab.research.google.com/github/LO-Kyu/gridmind/blob/main/scripts/gridmind_grpo_colab.ipynb)
|
| 232 |
+
- 🐙 **Code**: https://github.com/LO-Kyu/gridmind
|
|
|
|
| 233 |
|
| 234 |
---
|
| 235 |
|
| 236 |
+
*Built for the OpenEnv Hackathon India 2026 · April 25–26 · Scaler School of
|
| 237 |
+
Technology, Bangalore.*
|
curves/train 10/{23d31ed3-ee14-4cbd-8c91-1ea5341b3657.png → baseline_comparison.png}
RENAMED
|
File without changes
|
curves/train 10/{c1816e21-ec99-4b41-a641-009d872dac69.png → loss_curve.png}
RENAMED
|
File without changes
|
curves/train 10/{e1c2dacb-7784-4788-b4fb-59b67ab5fed1.png → reward_curve.png}
RENAMED
|
File without changes
|