Spaces:
Running
Running
Commit ·
52635ef
1
Parent(s): 28abef0
Updated Readme for ROund 2
Browse files
README.md
CHANGED
|
@@ -21,7 +21,7 @@ license: mit
|
|
| 21 |
|
| 22 |
## Why This Environment Is Novel
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
## Live Demo
|
| 27 |
|
|
@@ -38,24 +38,11 @@ curl https://prajwal782007-gridmind.hf.space/tasks
|
|
| 38 |
|
| 39 |
---
|
| 40 |
|
| 41 |
-
## Problem
|
| 42 |
-
|
| 43 |
-
Industrial buildings consume ~40% of global electricity, yet most still use naive "always-on" HVAC policies. The capability gap is clear: **LLMs can understand complex pricing curves, natural language instructions, and fault alerts—but no environment exists to train them to manage buildings.**
|
| 44 |
-
|
| 45 |
-
GridMind-RL closes this gap by simulating a complete building energy system where agents must:
|
| 46 |
-
- Navigate 24-hour price volatility (off-peak vs peak: 4¢ to 32¢/kWh)
|
| 47 |
-
- Maintain comfort (19-23°C) while minimizing cost
|
| 48 |
-
- Respond to grid stress emergencies
|
| 49 |
-
- Handle equipment faults (chiller failure, sensor malfunction, grid outages)
|
| 50 |
-
- Parse and follow natural language objective cards
|
| 51 |
-
|
| 52 |
-
---
|
| 53 |
-
|
| 54 |
## Environment
|
| 55 |
|
| 56 |
| | Description |
|
| 57 |
|---|-------------|
|
| 58 |
-
| **Observation** |
|
| 59 |
| **Actions** | HVAC level (0-1), thermal charge (-1 to 1), batch slot (0-4), load shed (0-0.5) |
|
| 60 |
| **Reward** | 9-component weighted sum: cost, temperature, grid, deadline, efficiency, stability, carbon, instruction, fault_mitigation |
|
| 61 |
| **Episode** | 96 steps = 24 simulated hours @ 15-min resolution |
|
|
@@ -74,8 +61,8 @@ Weights reflect real-world building operator priorities — not arbitrary values
|
|
| 74 |
| `batch_deadline` | 0.12 | Production continuity — missing batch deadlines causes downstream losses |
|
| 75 |
| `efficiency_bonus` | 0.05 | Storage arbitrage — incentivises smart charge/discharge timing |
|
| 76 |
| `stability_penalty` | -0.05 | Anti-cycling — prevents HVAC thrashing that causes equipment wear |
|
| 77 |
-
| `
|
| 78 |
-
| `
|
| 79 |
|
| 80 |
> *Task 4 instruction reward weight comes from the sampled instruction card, not a fixed value.
|
| 81 |
|
|
@@ -85,11 +72,17 @@ Weights reflect real-world building operator priorities — not arbitrary values
|
|
| 85 |
|-------|------|-------------|
|
| 86 |
| indoor_temperature | float | °C |
|
| 87 |
| thermal_storage_level | float | 0-1 (0=empty, 1=full) |
|
|
|
|
| 88 |
| current_price | float | $/kWh |
|
| 89 |
| grid_stress_signal | float | 0-1 (>0.7 = critical) |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
| hvac_efficiency | float | 1.0 → degrades to 0.5 over episode |
|
| 91 |
| active_faults | string[] | Active fault alarm strings |
|
| 92 |
| instruction_card | object | Task 4 objective only |
|
|
|
|
| 93 |
|
| 94 |
### Action Fields
|
| 95 |
|
|
@@ -102,41 +95,46 @@ Weights reflect real-world building operator priorities — not arbitrary values
|
|
| 102 |
|
| 103 |
---
|
| 104 |
|
| 105 |
-
##
|
| 106 |
|
| 107 |
-
###
|
| 108 |
A single oversight LLM coordinates multiple buildings through price signals. The coordinator reads `/feeder` to see fleet-wide demand, then sets per-building price multipliers via `/coordinate` to orchestrate behavior.
|
| 109 |
|
| 110 |
-
###
|
| 111 |
Task 4 presents a natural language objective card like "Keep total energy cost under $2.50 while maintaining 19-23°C". Agents must plan across all 96 steps—not greedy per-step control.
|
| 112 |
|
| 113 |
-
|
| 114 |
-
The `/simulate` endpoint lets agents ask "what if?" before acting. When HVAC efficiency is low or faults are active, the agent simulates the proposed action and revises if the predicted reward is poor.
|
| 115 |
-
|
| 116 |
-
### Track 4: Fault Handling (Wild Card)
|
| 117 |
-
Four fault types inject unpredictability:
|
| 118 |
-
- **Chiller failure**: HVAC drops to 20% capacity
|
| 119 |
-
- **Grid outage**: Price ×3, stress = 1.0
|
| 120 |
-
- **Sensor fault**: Temperature readings jitter ±5°C
|
| 121 |
-
- **Tariff spike**: Emergency 4× price surge
|
| 122 |
-
|
| 123 |
-
### Track 5: HVAC Degradation
|
| 124 |
-
Real HVAC systems degrade over time. Efficiency starts at 1.0 and drops ~0.1% per step. The agent must account for declining capacity—a hidden state requiring inference.
|
| 125 |
|
| 126 |
---
|
| 127 |
|
| 128 |
## Results
|
| 129 |
|
| 130 |
-
|
| 131 |
-
|
|
|
|
| 132 |
|
| 133 |
| Policy | Task 1 | Task 2 | Task 3 | Task 4 |
|
| 134 |
|--------|--------|--------|--------|--------|
|
| 135 |
-
| Heuristic Baseline | 0.
|
| 136 |
| Zero-shot LLM | 0.715 | 0.645 | 0.610 | 0.582 |
|
| 137 |
-
| GRPO Fine-tuned LLM |
|
|
|
|
|
|
|
|
|
|
| 138 |
|
| 139 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
---
|
| 142 |
|
|
@@ -186,18 +184,6 @@ python scripts/plot_results.py
|
|
| 186 |
|
| 187 |
---
|
| 188 |
|
| 189 |
-
## Self-Improvement: Curriculum Learning
|
| 190 |
-
|
| 191 |
-
The `--curriculum` flag enables automatic task progression:
|
| 192 |
-
- Agent starts on Task 1 (easy)
|
| 193 |
-
- After 5 episodes with average reward ≥ 0.55, advances to Task 2
|
| 194 |
-
- After 5 episodes with average reward ≥ 0.50, advances to Task 3
|
| 195 |
-
- After 5 episodes with average reward ≥ 0.45, advances to Task 4
|
| 196 |
-
|
| 197 |
-
This directly targets the Self-Improvement hackathon theme.
|
| 198 |
-
|
| 199 |
-
---
|
| 200 |
-
|
| 201 |
## Architecture
|
| 202 |
|
| 203 |
```
|
|
@@ -230,11 +216,15 @@ Web Dashboard (dashboard/server.py) → Port 7861
|
|
| 230 |
| GET | /state | Get current state |
|
| 231 |
| GET | /grade | Grade episode (0.0-1.0 score) |
|
| 232 |
| GET | /tasks | Available tasks |
|
| 233 |
-
| GET | /metrics |
|
| 234 |
| GET | /replay | Episode history |
|
| 235 |
| GET | /feeder | Aggregate fleet state |
|
| 236 |
| POST | /coordinate | Set price multipliers |
|
| 237 |
| POST | /simulate | World model prediction |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 238 |
|
| 239 |
---
|
| 240 |
|
|
@@ -246,6 +236,8 @@ gridmind-rl/
|
|
| 246 |
├── inference.py # Agent entry point (LLM + heuristic)
|
| 247 |
├── openenv.yaml # OpenEnv spec
|
| 248 |
├── Dockerfile # Container build
|
|
|
|
|
|
|
| 249 |
├── env/
|
| 250 |
│ ├── environment.go # Physics simulation
|
| 251 |
│ ├── models.go # Data models
|
|
@@ -256,10 +248,14 @@ gridmind-rl/
|
|
| 256 |
│ ├── train_unsloth.py # GRPO training
|
| 257 |
│ ├── plot_results.py # Training curve visualizer
|
| 258 |
│ ├── multi_building_demo.py # Fleet AI demo
|
| 259 |
-
│ └──
|
|
|
|
|
|
|
| 260 |
├── dashboard/
|
| 261 |
│ ├── server.py # Web server (port 7861)
|
| 262 |
│ └── static/ # Frontend assets
|
|
|
|
|
|
|
| 263 |
├── results/ # Training outputs (generated)
|
| 264 |
└── README.md
|
| 265 |
```
|
|
@@ -269,9 +265,9 @@ gridmind-rl/
|
|
| 269 |
## Links
|
| 270 |
|
| 271 |
- 🤗 HuggingFace Space: [GridMind-RL](https://prajwal782007-gridmind.hf.space)
|
| 272 |
-
-
|
| 273 |
-
-
|
| 274 |
-
- 🐙
|
| 275 |
|
| 276 |
---
|
| 277 |
|
|
|
|
| 21 |
|
| 22 |
## Why This Environment Is Novel
|
| 23 |
|
| 24 |
+
Industrial buildings consume ~40% of global electricity yet rely on naive "always-on" HVAC policies. LLMs can reason about pricing curves, fault alerts, and natural language objectives—but no environment trains them for this. GridMind-RL simulates a full 24-hour building energy system with stochastic electricity prices, equipment faults, and instruction cards, creating a genuinely challenging domain where learned policies translate to real operational value.
|
| 25 |
|
| 26 |
## Live Demo
|
| 27 |
|
|
|
|
| 38 |
|
| 39 |
---
|
| 40 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
## Environment
|
| 42 |
|
| 43 |
| | Description |
|
| 44 |
|---|-------------|
|
| 45 |
+
| **Observation** | 13 fields: temperature, storage, price, stress, carbon, faults, HVAC efficiency, process demand, batch queue, price forecast |
|
| 46 |
| **Actions** | HVAC level (0-1), thermal charge (-1 to 1), batch slot (0-4), load shed (0-0.5) |
|
| 47 |
| **Reward** | 9-component weighted sum: cost, temperature, grid, deadline, efficiency, stability, carbon, instruction, fault_mitigation |
|
| 48 |
| **Episode** | 96 steps = 24 simulated hours @ 15-min resolution |
|
|
|
|
| 61 |
| `batch_deadline` | 0.12 | Production continuity — missing batch deadlines causes downstream losses |
|
| 62 |
| `efficiency_bonus` | 0.05 | Storage arbitrage — incentivises smart charge/discharge timing |
|
| 63 |
| `stability_penalty` | -0.05 | Anti-cycling — prevents HVAC thrashing that causes equipment wear |
|
| 64 |
+
| `task_satisfaction` | 0.50* | Task 4 only — weighted per the episode's instruction card |
|
| 65 |
+
| `fault_mitigation` | dynamic | Emergency response — computed based on fault type and response |
|
| 66 |
|
| 67 |
> *Task 4 instruction reward weight comes from the sampled instruction card, not a fixed value.
|
| 68 |
|
|
|
|
| 72 |
|-------|------|-------------|
|
| 73 |
| indoor_temperature | float | °C |
|
| 74 |
| thermal_storage_level | float | 0-1 (0=empty, 1=full) |
|
| 75 |
+
| process_demand | float | kW current industrial power demand |
|
| 76 |
| current_price | float | $/kWh |
|
| 77 |
| grid_stress_signal | float | 0-1 (>0.7 = critical) |
|
| 78 |
+
| carbon_intensity | float | gCO2/kWh |
|
| 79 |
+
| hour_of_day | int | 0-23 |
|
| 80 |
+
| batch_queue | int[] | pending job deadline slots |
|
| 81 |
+
| cumulative_cost | float | $ total incurred this episode |
|
| 82 |
| hvac_efficiency | float | 1.0 → degrades to 0.5 over episode |
|
| 83 |
| active_faults | string[] | Active fault alarm strings |
|
| 84 |
| instruction_card | object | Task 4 objective only |
|
| 85 |
+
| price_forecast | float[] | 4-step upcoming price preview |
|
| 86 |
|
| 87 |
### Action Fields
|
| 88 |
|
|
|
|
| 95 |
|
| 96 |
---
|
| 97 |
|
| 98 |
+
## Core Capabilities
|
| 99 |
|
| 100 |
+
### Multi-Agent Coordination
|
| 101 |
A single oversight LLM coordinates multiple buildings through price signals. The coordinator reads `/feeder` to see fleet-wide demand, then sets per-building price multipliers via `/coordinate` to orchestrate behavior.
|
| 102 |
|
| 103 |
+
### Long-Horizon Instruction Following
|
| 104 |
Task 4 presents a natural language objective card like "Keep total energy cost under $2.50 while maintaining 19-23°C". Agents must plan across all 96 steps—not greedy per-step control.
|
| 105 |
|
| 106 |
+
These two capabilities map directly to Theme 1 and Theme 3 of the OpenEnv Hackathon.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
|
| 108 |
---
|
| 109 |
|
| 110 |
## Results
|
| 111 |
|
| 112 |
+
### What the Agent Learns
|
| 113 |
+
|
| 114 |
+
A naive heuristic runs HVAC at fixed levels based on time-of-day. After GRPO training on GridMind-RL, the agent learns to charge thermal storage during off-peak hours (4¢/kWh) and discharge during peak (32¢/kWh), voluntarily shed load during grid stress signals above 0.7, and adjust HVAC intensity as efficiency degrades over the episode. None of these behaviors are hardcoded — the agent discovers them through the reward signal alone.
|
| 115 |
|
| 116 |
| Policy | Task 1 | Task 2 | Task 3 | Task 4 |
|
| 117 |
|--------|--------|--------|--------|--------|
|
| 118 |
+
| Heuristic Baseline | 0.494 | 0.471 | 0.748 | 0.478 |
|
| 119 |
| Zero-shot LLM | 0.715 | 0.645 | 0.610 | 0.582 |
|
| 120 |
+
| GRPO Fine-tuned LLM | — | — | — | — |
|
| 121 |
+
|
| 122 |
+
> *GRPO fine-tuned scores updating after full training run on T4 GPU.
|
| 123 |
+
> Training plots below show live progress from the actual run.*
|
| 124 |
|
| 125 |
+

|
| 126 |
+
*Reward vs training step. Blue = per-step reward, red dashed = smoothed average.*
|
| 127 |
+
|
| 128 |
+

|
| 129 |
+
*Training loss decreasing over steps — confirms the model is updating.*
|
| 130 |
+
|
| 131 |
+

|
| 132 |
+
*Grade scores per task: heuristic baseline vs GRPO-trained LLM.*
|
| 133 |
+
|
| 134 |
+
> Scores are episode grade scores (0.0–1.0, clamped open interval). Heuristic = fixed policy with no learning. Zero-shot = Qwen2.5-1.5B-Instruct prompted with task description, no fine-tuning, evaluated over 1 episode per task. Fine-tuned = GRPO-trained on GridMind-RL environment.
|
| 135 |
+
|
| 136 |
+
> 🔄 **Live update:** GRPO fine-tuned scores will be filled in here immediately
|
| 137 |
+
> after the final training run completes on the T4 GPU.
|
| 138 |
|
| 139 |
---
|
| 140 |
|
|
|
|
| 184 |
|
| 185 |
---
|
| 186 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
## Architecture
|
| 188 |
|
| 189 |
```
|
|
|
|
| 216 |
| GET | /state | Get current state |
|
| 217 |
| GET | /grade | Grade episode (0.0-1.0 score) |
|
| 218 |
| GET | /tasks | Available tasks |
|
| 219 |
+
| GET | /metrics | Prometheus metrics |
|
| 220 |
| GET | /replay | Episode history |
|
| 221 |
| GET | /feeder | Aggregate fleet state |
|
| 222 |
| POST | /coordinate | Set price multipliers |
|
| 223 |
| POST | /simulate | World model prediction |
|
| 224 |
+
| POST | /coordinator/reset | Reset multi-building episode |
|
| 225 |
+
| POST | /coordinator/step | Step with per-building actions |
|
| 226 |
+
| GET | /info | OpenEnv metadata |
|
| 227 |
+
| GET | /ws | WebSocket endpoint |
|
| 228 |
|
| 229 |
---
|
| 230 |
|
|
|
|
| 236 |
├── inference.py # Agent entry point (LLM + heuristic)
|
| 237 |
├── openenv.yaml # OpenEnv spec
|
| 238 |
├── Dockerfile # Container build
|
| 239 |
+
├── HF_BLOG_POST.md # Blog write-up
|
| 240 |
+
├── baseline_scores.json # Heuristic baseline scores
|
| 241 |
├── env/
|
| 242 |
│ ├── environment.go # Physics simulation
|
| 243 |
│ ├── models.go # Data models
|
|
|
|
| 248 |
│ ├── train_unsloth.py # GRPO training
|
| 249 |
│ ├── plot_results.py # Training curve visualizer
|
| 250 |
│ ├── multi_building_demo.py # Fleet AI demo
|
| 251 |
+
│ └── gridmind_grpo_colab.ipynb # Colab training notebook
|
| 252 |
+
├── server/
|
| 253 |
+
│ └── app.py # Python fallback server
|
| 254 |
├── dashboard/
|
| 255 |
│ ├── server.py # Web server (port 7861)
|
| 256 |
│ └── static/ # Frontend assets
|
| 257 |
+
├── curves/ # Training curves (train N/)
|
| 258 |
+
│ └── train N/ # Per-run plots
|
| 259 |
├── results/ # Training outputs (generated)
|
| 260 |
└── README.md
|
| 261 |
```
|
|
|
|
| 265 |
## Links
|
| 266 |
|
| 267 |
- 🤗 HuggingFace Space: [GridMind-RL](https://prajwal782007-gridmind.hf.space)
|
| 268 |
+
- 📓 Training Notebook: [gridmind_grpo_colab.ipynb](https://colab.research.google.com/github/LO-Kyu/gridmind/blob/main/scripts/gridmind_grpo_colab.ipynb)
|
| 269 |
+
- 📝 Blog Post: [Read the write-up](./HF_BLOG_POST.md)
|
| 270 |
+
- 🐙 GitHub: [Code Repository](https://github.com/LO-Kyu/gridmind)
|
| 271 |
|
| 272 |
---
|
| 273 |
|