Spaces:
Running
Running
File size: 10,616 Bytes
bacf63d 0af208b 1875b13 4c68447 2787b1e b4281fc 2787b1e 8204dc0 52635ef 8204dc0 0af208b 84fb786 a4bc605 84fb786 a4bc605 84fb786 b4281fc 2787b1e 0af208b 52635ef 0af208b 74dc7b5 52635ef 74dc7b5 0af208b 52635ef 0af208b 52635ef 0af208b 52635ef 0af208b a4be35d 52635ef a4be35d 52635ef 0af208b a4be35d 52635ef 0af208b a4be35d 52635ef b054ef7 0af208b b054ef7 0af208b b054ef7 52635ef b054ef7 0af208b 52635ef 0af208b 52635ef 2787b1e 52635ef 2787b1e 0af208b 2787b1e 0af208b 574589d 0af208b b4281fc 4c68447 e3130b4 0af208b 574589d 0af208b a4be35d 0af208b b4281fc 0af208b 2787b1e 0af208b a4be35d 0af208b b054ef7 e3fbc9c 0af208b 2787b1e 0af208b a4be35d 0af208b a4be35d 0af208b a4be35d 2787b1e b4281fc 2787b1e 0af208b 2787b1e 0af208b 2787b1e 0af208b 2787b1e 0af208b 52635ef 0af208b 52635ef 1875b13 2787b1e 4c68447 574589d b4281fc 4c68447 0af208b 52635ef 0af208b 52635ef 0af208b 52635ef 0af208b b4281fc 574589d 1875b13 0af208b e3fbc9c a4bc605 52635ef b4281fc e3fbc9c 4c68447 b054ef7 4c68447 b054ef7 0af208b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 | ---
title: GridMind-RL
emoji: ⚡
colorFrom: green
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
license: mit
---
# GridMind-RL — Train LLMs to manage industrial buildings under faults, grid stress, and natural language objectives.
[](https://openenv.org/)
[](https://golang.org/)
[](https://www.python.org/)
[](https://www.docker.com/)
[](LICENSE)
---
## Why This Environment Is Novel
Industrial buildings consume ~40% of global electricity yet rely on naive "always-on" HVAC policies. LLMs can reason about pricing curves, fault alerts, and natural language objectives—but no environment trains them for this. GridMind-RL simulates a full 24-hour building energy system with stochastic electricity prices, equipment faults, and instruction cards, creating a genuinely challenging domain where learned policies translate to real operational value.
## Live Demo
| | URL |
|--|-----|
| **Environment API** | https://prajwal782007-gridmind.hf.space |
| **Live Dashboard** | https://prajwal782007-gridmind.hf.space/dashboard |
**Quick test:**
```bash
curl https://prajwal782007-gridmind.hf.space/health
curl https://prajwal782007-gridmind.hf.space/tasks
```
---
## Environment
| | Description |
|---|-------------|
| **Observation** | 13 fields: temperature, storage, price, stress, carbon, faults, HVAC efficiency, process demand, batch queue, price forecast |
| **Actions** | HVAC level (0-1), thermal charge (-1 to 1), batch slot (0-4), load shed (0-0.5) |
| **Reward** | 9-component weighted sum: cost, temperature, grid, deadline, efficiency, stability, carbon, instruction, fault_mitigation |
| **Episode** | 96 steps = 24 simulated hours @ 15-min resolution |
| **Tasks** | 4 tasks: (1) cost, (2) temperature, (3) demand_response, (4) instruction_following |
### Reward Weight Rationale
Weights reflect real-world building operator priorities — not arbitrary values:
| Component | Weight | Rationale |
|---|---|---|
| `cost_savings` | 0.28 | Primary operator KPI — energy spend is the main business metric |
| `carbon_reward` | 0.20 | ESG compliance — increasingly mandatory for industrial operators |
| `temp_constraint` | 0.20 | Hard safety constraint — comfort SLA violations incur penalties |
| `grid_response` | 0.20 | Regulatory SLA — demand response programs pay operators to shed load |
| `batch_deadline` | 0.12 | Production continuity — missing batch deadlines causes downstream losses |
| `efficiency_bonus` | 0.05 | Storage arbitrage — incentivises smart charge/discharge timing |
| `stability_penalty` | -0.05 | Anti-cycling — prevents HVAC thrashing that causes equipment wear |
| `task_satisfaction` | 0.50* | Task 4 only — weighted per the episode's instruction card |
| `fault_mitigation` | dynamic | Emergency response — computed based on fault type and response |
> *Task 4 instruction reward weight comes from the sampled instruction card, not a fixed value.
### Observation Fields
| Field | Type | Description |
|-------|------|-------------|
| indoor_temperature | float | °C |
| thermal_storage_level | float | 0-1 (0=empty, 1=full) |
| process_demand | float | kW current industrial power demand |
| current_price | float | $/kWh |
| grid_stress_signal | float | 0-1 (>0.7 = critical) |
| carbon_intensity | float | gCO2/kWh |
| hour_of_day | int | 0-23 |
| batch_queue | int[] | pending job deadline slots |
| cumulative_cost | float | $ total incurred this episode |
| hvac_efficiency | float | 1.0 → degrades to 0.5 over episode |
| active_faults | string[] | Active fault alarm strings |
| instruction_card | object | Task 4 objective only |
| price_forecast | float[] | 4-step upcoming price preview |
### Action Fields
| Field | Type | Range |
|-------|------|-------|
| hvac_power_level | float | 0.0-1.0 |
| thermal_charge_rate | float | -1.0 to 1.0 |
| batch_job_slot | int | 0-4 |
| load_shed_fraction | float | 0.0-0.5 |
---
## Core Capabilities
### Multi-Agent Coordination
A single oversight LLM coordinates multiple buildings through price signals. The coordinator reads `/feeder` to see fleet-wide demand, then sets per-building price multipliers via `/coordinate` to orchestrate behavior.
### Long-Horizon Instruction Following
Task 4 presents a natural language objective card like "Keep total energy cost under $2.50 while maintaining 19-23°C". Agents must plan across all 96 steps—not greedy per-step control.
These two capabilities map directly to Theme 1 and Theme 3 of the OpenEnv Hackathon.
---
## Results
### What the Agent Learns
A naive heuristic runs HVAC at fixed levels based on time-of-day. After GRPO training on GridMind-RL, the agent learns to charge thermal storage during off-peak hours (4¢/kWh) and discharge during peak (32¢/kWh), voluntarily shed load during grid stress signals above 0.7, and adjust HVAC intensity as efficiency degrades over the episode. None of these behaviors are hardcoded — the agent discovers them through the reward signal alone.
| Policy | Task 1 | Task 2 | Task 3 | Task 4 |
|--------|--------|--------|--------|--------|
| Heuristic Baseline | 0.494 | 0.471 | 0.748 | 0.478 |
| Zero-shot LLM | 0.715 | 0.645 | 0.610 | 0.582 |
| GRPO Fine-tuned LLM | — | — | — | — |
> *GRPO fine-tuned scores updating after full training run on T4 GPU.
> Training plots below show live progress from the actual run.*

*Reward vs training step. Blue = per-step reward, red dashed = smoothed average.*

*Training loss decreasing over steps — confirms the model is updating.*

*Grade scores per task: heuristic baseline vs GRPO-trained LLM.*
> Scores are episode grade scores (0.0–1.0, clamped open interval). Heuristic = fixed policy with no learning. Zero-shot = Qwen2.5-1.5B-Instruct prompted with task description, no fine-tuning, evaluated over 1 episode per task. Fine-tuned = GRPO-trained on GridMind-RL environment.
> 🔄 **Live update:** GRPO fine-tuned scores will be filled in here immediately
> after the final training run completes on the T4 GPU.
---
## How to Run
### Start the environment server
```bash
go run main.go
```
### Run the LLM agent (task 1-4)
```bash
# Set up your API token
cp .env.example .env
# Edit .env with HF_TOKEN
# Task 1: Cost minimization
python inference.py --task 1 --episodes 5
# Task 2: Temperature management
python inference.py --task 2 --episodes 5
# Task 3: Full demand response
python inference.py --task 3 --episodes 5
# Task 4: Instruction following
python inference.py --task 4 --episodes 5
# Heuristic baseline (fast, no LLM)
python inference.py --fast-mode --task 3 --episodes 5
```
### Run multi-building coordinator demo
```bash
python scripts/multi_building_demo.py
```
### Run training (requires GPU)
```bash
python scripts/train_unsloth.py --steps 500 --output-csv results/training_log.csv
```
### Generate training curve plot
```bash
python scripts/plot_results.py
```
---
## Architecture
```
Agent (python/inference.py)
→ HTTP POST /step, /reset, /grade
↓
Go Environment Server (main.go) → Port 7860
↓
Physics Engine (env/environment.go) + Rewards (env/rewards.go) + Tasks (env/tasks.go)
↓
Web Dashboard (dashboard/server.py) → Port 7861
```
**Design philosophy:**
- **Separation of concerns**: Physics engine (Go) decoupled from policy layer (Python)
- **OpenEnv compliance**: Standardized REST API enables any language agent
- **Deterministic simulation**: Seeded RNG for reproducible experiments
- **Dense rewards**: 9-component reward for effective learning
---
## API Reference
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | /health | Health check |
| GET | /ping | Liveness probe |
| POST | /reset | Start new episode |
| POST | /step | Take action step |
| GET | /state | Get current state |
| GET | /grade | Grade episode (0.0-1.0 score) |
| GET | /tasks | Available tasks |
| GET | /metrics | Prometheus metrics |
| GET | /replay | Episode history |
| GET | /feeder | Aggregate fleet state |
| POST | /coordinate | Set price multipliers |
| POST | /simulate | World model prediction |
| POST | /coordinator/reset | Reset multi-building episode |
| POST | /coordinator/step | Step with per-building actions |
| GET | /info | OpenEnv metadata |
| GET | /ws | WebSocket endpoint |
---
## Project Structure
```
gridmind-rl/
├── main.go # HTTP server & OpenEnv API
├── inference.py # Agent entry point (LLM + heuristic)
├── openenv.yaml # OpenEnv spec
├── Dockerfile # Container build
├── HF_BLOG_POST.md # Blog write-up
├── baseline_scores.json # Heuristic baseline scores
├── env/
│ ├── environment.go # Physics simulation
│ ├── models.go # Data models
│ ├── rewards.go # Reward computation
│ ├── tasks.go # Task grading
│ └── faults.go # Fault injection
├── scripts/
│ ├── train_unsloth.py # GRPO training
│ ├── plot_results.py # Training curve visualizer
│ ├── multi_building_demo.py # Fleet AI demo
│ └── gridmind_grpo_colab.ipynb # Colab training notebook
├── server/
│ └── app.py # Python fallback server
├── dashboard/
│ ├── server.py # Web server (port 7861)
│ └── static/ # Frontend assets
├── curves/ # Training curves (train N/)
│ └── train N/ # Per-run plots
├── results/ # Training outputs (generated)
└── README.md
```
---
## Links
- 🤗 HuggingFace Space: [GridMind-RL](https://prajwal782007-gridmind.hf.space)
- 📓 Training Notebook: [gridmind_grpo_colab.ipynb](https://colab.research.google.com/github/LO-Kyu/gridmind/blob/main/scripts/gridmind_grpo_colab.ipynb)
- 📝 Blog Post: [Read the write-up](./HF_BLOG_POST.md)
- 🐙 GitHub: [Code Repository](https://github.com/LO-Kyu/gridmind)
---
## License
MIT License. See [LICENSE](LICENSE) file.
---
**Questions?** Open an issue on GitHub. |