ShreeshantXD commited on
Commit
52635ef
·
1 Parent(s): 28abef0

Updated Readme for ROund 2

Browse files
Files changed (1) hide show
  1. README.md +50 -54
README.md CHANGED
@@ -21,7 +21,7 @@ license: mit
21
 
22
  ## Why This Environment Is Novel
23
 
24
- Most RL environments for LLMs are grid-worlds or toy games. GridMind-RL simulates a **real industrial problem** building energy management where agents must juggle stochastic electricity prices, multi-objective constraints, equipment faults, and natural language operating objectives. An LLM that learns to manage a building under these conditions has a genuinely useful skill, not just a high game score.
25
 
26
  ## Live Demo
27
 
@@ -38,24 +38,11 @@ curl https://prajwal782007-gridmind.hf.space/tasks
38
 
39
  ---
40
 
41
- ## Problem
42
-
43
- Industrial buildings consume ~40% of global electricity, yet most still use naive "always-on" HVAC policies. The capability gap is clear: **LLMs can understand complex pricing curves, natural language instructions, and fault alerts—but no environment exists to train them to manage buildings.**
44
-
45
- GridMind-RL closes this gap by simulating a complete building energy system where agents must:
46
- - Navigate 24-hour price volatility (off-peak vs peak: 4¢ to 32¢/kWh)
47
- - Maintain comfort (19-23°C) while minimizing cost
48
- - Respond to grid stress emergencies
49
- - Handle equipment faults (chiller failure, sensor malfunction, grid outages)
50
- - Parse and follow natural language objective cards
51
-
52
- ---
53
-
54
  ## Environment
55
 
56
  | | Description |
57
  |---|-------------|
58
- | **Observation** | 11 fields: temperature, storage, price, stress, carbon, faults, HVAC efficiency |
59
  | **Actions** | HVAC level (0-1), thermal charge (-1 to 1), batch slot (0-4), load shed (0-0.5) |
60
  | **Reward** | 9-component weighted sum: cost, temperature, grid, deadline, efficiency, stability, carbon, instruction, fault_mitigation |
61
  | **Episode** | 96 steps = 24 simulated hours @ 15-min resolution |
@@ -74,8 +61,8 @@ Weights reflect real-world building operator priorities — not arbitrary values
74
  | `batch_deadline` | 0.12 | Production continuity — missing batch deadlines causes downstream losses |
75
  | `efficiency_bonus` | 0.05 | Storage arbitrage — incentivises smart charge/discharge timing |
76
  | `stability_penalty` | -0.05 | Anti-cycling — prevents HVAC thrashing that causes equipment wear |
77
- | `fault_mitigation` | 0.05 | Emergency responsecorrect fault handling prevents costly outages |
78
- | `instruction_reward` | 0.50* | Task 4 only weighted per the episode's instruction card |
79
 
80
  > *Task 4 instruction reward weight comes from the sampled instruction card, not a fixed value.
81
 
@@ -85,11 +72,17 @@ Weights reflect real-world building operator priorities — not arbitrary values
85
  |-------|------|-------------|
86
  | indoor_temperature | float | °C |
87
  | thermal_storage_level | float | 0-1 (0=empty, 1=full) |
 
88
  | current_price | float | $/kWh |
89
  | grid_stress_signal | float | 0-1 (>0.7 = critical) |
 
 
 
 
90
  | hvac_efficiency | float | 1.0 → degrades to 0.5 over episode |
91
  | active_faults | string[] | Active fault alarm strings |
92
  | instruction_card | object | Task 4 objective only |
 
93
 
94
  ### Action Fields
95
 
@@ -102,41 +95,46 @@ Weights reflect real-world building operator priorities — not arbitrary values
102
 
103
  ---
104
 
105
- ## Five Tracks
106
 
107
- ### Track 1: Multi-Agent Interactions
108
  A single oversight LLM coordinates multiple buildings through price signals. The coordinator reads `/feeder` to see fleet-wide demand, then sets per-building price multipliers via `/coordinate` to orchestrate behavior.
109
 
110
- ### Track 2: Long-Horizon Planning & Instruction Following
111
  Task 4 presents a natural language objective card like "Keep total energy cost under $2.50 while maintaining 19-23°C". Agents must plan across all 96 steps—not greedy per-step control.
112
 
113
- ### Track 3: World Modeling
114
- The `/simulate` endpoint lets agents ask "what if?" before acting. When HVAC efficiency is low or faults are active, the agent simulates the proposed action and revises if the predicted reward is poor.
115
-
116
- ### Track 4: Fault Handling (Wild Card)
117
- Four fault types inject unpredictability:
118
- - **Chiller failure**: HVAC drops to 20% capacity
119
- - **Grid outage**: Price ×3, stress = 1.0
120
- - **Sensor fault**: Temperature readings jitter ±5°C
121
- - **Tariff spike**: Emergency 4× price surge
122
-
123
- ### Track 5: HVAC Degradation
124
- Real HVAC systems degrade over time. Efficiency starts at 1.0 and drops ~0.1% per step. The agent must account for declining capacity—a hidden state requiring inference.
125
 
126
  ---
127
 
128
  ## Results
129
 
130
- ![Training Curve](results/training_curve.png)
131
- *Episode grade scores vs training step. Heuristic baseline (red) vs GRPO fine-tuned LLM (teal). Higher = better energy management.*
 
132
 
133
  | Policy | Task 1 | Task 2 | Task 3 | Task 4 |
134
  |--------|--------|--------|--------|--------|
135
- | Heuristic Baseline | 0.506 | 0.459 | 0.600 | 0.492 |
136
  | Zero-shot LLM | 0.715 | 0.645 | 0.610 | 0.582 |
137
- | GRPO Fine-tuned LLM | TBD | TBD | TBD | TBD |
 
 
 
138
 
139
- > Scores are episode grade scores (0.0–1.0, clamped open interval). Heuristic = fixed policy with no learning. Zero-shot = pretrained Qwen2.5-7B-Instruct. Fine-tuned = GRPO-trained on GridMind-RL environment.
 
 
 
 
 
 
 
 
 
 
 
 
140
 
141
  ---
142
 
@@ -186,18 +184,6 @@ python scripts/plot_results.py
186
 
187
  ---
188
 
189
- ## Self-Improvement: Curriculum Learning
190
-
191
- The `--curriculum` flag enables automatic task progression:
192
- - Agent starts on Task 1 (easy)
193
- - After 5 episodes with average reward ≥ 0.55, advances to Task 2
194
- - After 5 episodes with average reward ≥ 0.50, advances to Task 3
195
- - After 5 episodes with average reward ≥ 0.45, advances to Task 4
196
-
197
- This directly targets the Self-Improvement hackathon theme.
198
-
199
- ---
200
-
201
  ## Architecture
202
 
203
  ```
@@ -230,11 +216,15 @@ Web Dashboard (dashboard/server.py) → Port 7861
230
  | GET | /state | Get current state |
231
  | GET | /grade | Grade episode (0.0-1.0 score) |
232
  | GET | /tasks | Available tasks |
233
- | GET | /metrics | System metrics |
234
  | GET | /replay | Episode history |
235
  | GET | /feeder | Aggregate fleet state |
236
  | POST | /coordinate | Set price multipliers |
237
  | POST | /simulate | World model prediction |
 
 
 
 
238
 
239
  ---
240
 
@@ -246,6 +236,8 @@ gridmind-rl/
246
  ├── inference.py # Agent entry point (LLM + heuristic)
247
  ├── openenv.yaml # OpenEnv spec
248
  ├── Dockerfile # Container build
 
 
249
  ├── env/
250
  │ ├── environment.go # Physics simulation
251
  │ ├── models.go # Data models
@@ -256,10 +248,14 @@ gridmind-rl/
256
  │ ├── train_unsloth.py # GRPO training
257
  │ ├── plot_results.py # Training curve visualizer
258
  │ ├── multi_building_demo.py # Fleet AI demo
259
- │ └── run_baseline.sh # Baseline scorer
 
 
260
  ├── dashboard/
261
  │ ├── server.py # Web server (port 7861)
262
  │ └── static/ # Frontend assets
 
 
263
  ├── results/ # Training outputs (generated)
264
  └── README.md
265
  ```
@@ -269,9 +265,9 @@ gridmind-rl/
269
  ## Links
270
 
271
  - 🤗 HuggingFace Space: [GridMind-RL](https://prajwal782007-gridmind.hf.space)
272
- - 📝 Blog: [Read the blog post](./HF_BLOG_POST.md)
273
- - 📓 Colab Notebook: [gridmind_grpo_colab.ipynb](https://colab.research.google.com/github/LO-Kyu/gridmind/blob/main/scripts/gridmind_grpo_colab.ipynb)
274
- - 🐙 Code Repository: [GitHub](https://github.com/LO-Kyu/gridmind)
275
 
276
  ---
277
 
 
21
 
22
  ## Why This Environment Is Novel
23
 
24
+ Industrial buildings consume ~40% of global electricity yet rely on naive "always-on" HVAC policies. LLMs can reason about pricing curves, fault alerts, and natural language objectives—but no environment trains them for this. GridMind-RL simulates a full 24-hour building energy system with stochastic electricity prices, equipment faults, and instruction cards, creating a genuinely challenging domain where learned policies translate to real operational value.
25
 
26
  ## Live Demo
27
 
 
38
 
39
  ---
40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  ## Environment
42
 
43
  | | Description |
44
  |---|-------------|
45
+ | **Observation** | 13 fields: temperature, storage, price, stress, carbon, faults, HVAC efficiency, process demand, batch queue, price forecast |
46
  | **Actions** | HVAC level (0-1), thermal charge (-1 to 1), batch slot (0-4), load shed (0-0.5) |
47
  | **Reward** | 9-component weighted sum: cost, temperature, grid, deadline, efficiency, stability, carbon, instruction, fault_mitigation |
48
  | **Episode** | 96 steps = 24 simulated hours @ 15-min resolution |
 
61
  | `batch_deadline` | 0.12 | Production continuity — missing batch deadlines causes downstream losses |
62
  | `efficiency_bonus` | 0.05 | Storage arbitrage — incentivises smart charge/discharge timing |
63
  | `stability_penalty` | -0.05 | Anti-cycling — prevents HVAC thrashing that causes equipment wear |
64
+ | `task_satisfaction` | 0.50* | Task 4 only weighted per the episode's instruction card |
65
+ | `fault_mitigation` | dynamic | Emergency responsecomputed based on fault type and response |
66
 
67
  > *Task 4 instruction reward weight comes from the sampled instruction card, not a fixed value.
68
 
 
72
  |-------|------|-------------|
73
  | indoor_temperature | float | °C |
74
  | thermal_storage_level | float | 0-1 (0=empty, 1=full) |
75
+ | process_demand | float | kW current industrial power demand |
76
  | current_price | float | $/kWh |
77
  | grid_stress_signal | float | 0-1 (>0.7 = critical) |
78
+ | carbon_intensity | float | gCO2/kWh |
79
+ | hour_of_day | int | 0-23 |
80
+ | batch_queue | int[] | pending job deadline slots |
81
+ | cumulative_cost | float | $ total incurred this episode |
82
  | hvac_efficiency | float | 1.0 → degrades to 0.5 over episode |
83
  | active_faults | string[] | Active fault alarm strings |
84
  | instruction_card | object | Task 4 objective only |
85
+ | price_forecast | float[] | 4-step upcoming price preview |
86
 
87
  ### Action Fields
88
 
 
95
 
96
  ---
97
 
98
+ ## Core Capabilities
99
 
100
+ ### Multi-Agent Coordination
101
  A single oversight LLM coordinates multiple buildings through price signals. The coordinator reads `/feeder` to see fleet-wide demand, then sets per-building price multipliers via `/coordinate` to orchestrate behavior.
102
 
103
+ ### Long-Horizon Instruction Following
104
  Task 4 presents a natural language objective card like "Keep total energy cost under $2.50 while maintaining 19-23°C". Agents must plan across all 96 steps—not greedy per-step control.
105
 
106
+ These two capabilities map directly to Theme 1 and Theme 3 of the OpenEnv Hackathon.
 
 
 
 
 
 
 
 
 
 
 
107
 
108
  ---
109
 
110
  ## Results
111
 
112
+ ### What the Agent Learns
113
+
114
+ A naive heuristic runs HVAC at fixed levels based on time-of-day. After GRPO training on GridMind-RL, the agent learns to charge thermal storage during off-peak hours (4¢/kWh) and discharge during peak (32¢/kWh), voluntarily shed load during grid stress signals above 0.7, and adjust HVAC intensity as efficiency degrades over the episode. None of these behaviors are hardcoded — the agent discovers them through the reward signal alone.
115
 
116
  | Policy | Task 1 | Task 2 | Task 3 | Task 4 |
117
  |--------|--------|--------|--------|--------|
118
+ | Heuristic Baseline | 0.494 | 0.471 | 0.748 | 0.478 |
119
  | Zero-shot LLM | 0.715 | 0.645 | 0.610 | 0.582 |
120
+ | GRPO Fine-tuned LLM | | | | |
121
+
122
+ > *GRPO fine-tuned scores updating after full training run on T4 GPU.
123
+ > Training plots below show live progress from the actual run.*
124
 
125
+ ![Reward Curve](curves/train%202/reward_curve.png)
126
+ *Reward vs training step. Blue = per-step reward, red dashed = smoothed average.*
127
+
128
+ ![Loss Curve](curves/train%202/loss_curve.png)
129
+ *Training loss decreasing over steps — confirms the model is updating.*
130
+
131
+ ![Baseline Comparison](curves/train%202/baseline_comparison.png)
132
+ *Grade scores per task: heuristic baseline vs GRPO-trained LLM.*
133
+
134
+ > Scores are episode grade scores (0.0–1.0, clamped open interval). Heuristic = fixed policy with no learning. Zero-shot = Qwen2.5-1.5B-Instruct prompted with task description, no fine-tuning, evaluated over 1 episode per task. Fine-tuned = GRPO-trained on GridMind-RL environment.
135
+
136
+ > 🔄 **Live update:** GRPO fine-tuned scores will be filled in here immediately
137
+ > after the final training run completes on the T4 GPU.
138
 
139
  ---
140
 
 
184
 
185
  ---
186
 
 
 
 
 
 
 
 
 
 
 
 
 
187
  ## Architecture
188
 
189
  ```
 
216
  | GET | /state | Get current state |
217
  | GET | /grade | Grade episode (0.0-1.0 score) |
218
  | GET | /tasks | Available tasks |
219
+ | GET | /metrics | Prometheus metrics |
220
  | GET | /replay | Episode history |
221
  | GET | /feeder | Aggregate fleet state |
222
  | POST | /coordinate | Set price multipliers |
223
  | POST | /simulate | World model prediction |
224
+ | POST | /coordinator/reset | Reset multi-building episode |
225
+ | POST | /coordinator/step | Step with per-building actions |
226
+ | GET | /info | OpenEnv metadata |
227
+ | GET | /ws | WebSocket endpoint |
228
 
229
  ---
230
 
 
236
  ├── inference.py # Agent entry point (LLM + heuristic)
237
  ├── openenv.yaml # OpenEnv spec
238
  ├── Dockerfile # Container build
239
+ ├── HF_BLOG_POST.md # Blog write-up
240
+ ├── baseline_scores.json # Heuristic baseline scores
241
  ├── env/
242
  │ ├── environment.go # Physics simulation
243
  │ ├── models.go # Data models
 
248
  │ ├── train_unsloth.py # GRPO training
249
  │ ├── plot_results.py # Training curve visualizer
250
  │ ├── multi_building_demo.py # Fleet AI demo
251
+ │ └── gridmind_grpo_colab.ipynb # Colab training notebook
252
+ ├── server/
253
+ │ └── app.py # Python fallback server
254
  ├── dashboard/
255
  │ ├── server.py # Web server (port 7861)
256
  │ └── static/ # Frontend assets
257
+ ├── curves/ # Training curves (train N/)
258
+ │ └── train N/ # Per-run plots
259
  ├── results/ # Training outputs (generated)
260
  └── README.md
261
  ```
 
265
  ## Links
266
 
267
  - 🤗 HuggingFace Space: [GridMind-RL](https://prajwal782007-gridmind.hf.space)
268
+ - 📓 Training Notebook: [gridmind_grpo_colab.ipynb](https://colab.research.google.com/github/LO-Kyu/gridmind/blob/main/scripts/gridmind_grpo_colab.ipynb)
269
+ - 📝 Blog Post: [Read the write-up](./HF_BLOG_POST.md)
270
+ - 🐙 GitHub: [Code Repository](https://github.com/LO-Kyu/gridmind)
271
 
272
  ---
273