adityss commited on
Commit
2256ed6
·
1 Parent(s): c70e17d

docs: add HF blog post draft for community posting

Browse files
Files changed (1) hide show
  1. HF_BLOG_POST.md +94 -0
HF_BLOG_POST.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: GridMind-RL: Training LLMs to Manage Industrial Buildings Under Faults and Grid Stress
3
+ description: An OpenEnv-compatible RL environment where LLMs learn to control HVAC, thermal storage, and batch scheduling across multi-building industrial facilities.
4
+ ---
5
+
6
+ **Every industrial building wastes 20–30% of its energy because control systems can't handle real-time pricing, equipment faults, and grid stress simultaneously.** GridMind-RL is an OpenEnv-compatible RL environment that makes LLMs trainable on this problem.
7
+
8
+ ## The Problem
9
+
10
+ Industrial buildings consume ~40% of global electricity. Most still use naive "always-on" HVAC policies. The capability gap is clear:
11
+
12
+ - LLMs can understand complex pricing curves, fault alerts, and natural language instructions
13
+ - But no environment exists to train them on real building energy management
14
+ - Existing RL environments are mostly grid-worlds or toy games — not genuine industrial problems
15
+
16
+ GridMind-RL closes this gap by simulating a complete building energy system where agents must:
17
+
18
+ - Navigate 24-hour price volatility (off-peak vs peak: 4¢ to 32¢/kWh)
19
+ - Maintain comfort (19–23°C) while minimizing cost
20
+ - Respond to grid stress emergencies
21
+ - Handle equipment faults (chiller failure, sensor malfunction, grid outages, tariff spikes)
22
+ - Parse and follow natural language objective cards
23
+
24
+ ## The Environment
25
+
26
+ GridMind-RL is a 96-step episode (24 simulated hours at 15-minute resolution) with:
27
+
28
+ | Field | Value |
29
+ |-------|-------|
30
+ | **Observation** | 13 fields: temperature, storage, price, stress, carbon, faults, HVAC efficiency, instruction card |
31
+ | **Actions** | HVAC level (0–1), thermal charge (−1 to 1), batch slot (0–4), load shed (0–0.5) |
32
+ | **Reward** | 9-component weighted sum: cost, temperature, grid, deadline, efficiency, stability, carbon, instruction, fault_mitigation |
33
+ | **Tasks** | 4 types: cost minimization, temperature management, demand response, instruction following |
34
+
35
+ ### Four Hackathon Themes in One Environment
36
+
37
+ **Track 1 — Multi-Agent Interactions:** A coordinator LLM reads `/feeder` to see fleet-wide demand across 3 buildings, then sets per-building price multipliers via `/coordinate` to orchestrate behavior.
38
+
39
+ **Track 2 — Long-Horizon Planning & Instruction Following:** Task 4 presents a natural language objective card like "Keep total energy cost under $2.50 while maintaining 19–23°C." Agents must plan across all 96 steps.
40
+
41
+ **Track 3 — World Modeling:** The `/simulate` endpoint lets agents ask "what if?" before acting. When HVAC efficiency is low or faults are active, the agent simulates the proposed action and revises if the predicted reward is poor.
42
+
43
+ **Track 4 — Fault Handling:** Four fault types inject unpredictability:
44
+ - **Chiller failure**: HVAC drops to 20% capacity
45
+ - **Grid outage**: Price ×3, stress = 1.0
46
+ - **Sensor fault**: Temperature readings jitter ±5°C
47
+ - **Tariff spike**: Emergency 4× price surge
48
+
49
+ **Track 5 — Self-Improvement:** Curriculum learning auto-advances the agent from task 1 to task 4 when performance thresholds are met.
50
+
51
+ ## Results
52
+
53
+ Heuristic baseline scores (fixed policy, no learning) across all 4 tasks:
54
+
55
+ | Policy | Task 1 | Task 2 | Task 3 | Task 4 |
56
+ |--------|--------|--------|--------|--------|
57
+ | **Heuristic Baseline** | 0.506 | 0.459 | 0.600 | 0.492 |
58
+
59
+ The GRPO fine-tuned model shows improvement over the zero-shot LLM baseline. The training curve below shows the learning trajectory:
60
+
61
+ ![Training Curve](https://raw.githubusercontent.com/LO-Kyu/gridmind/main/results/training_curve.png)
62
+
63
+ ## Training
64
+
65
+ GridMind-RL uses GRPO (Group Relative Policy Optimization) via HuggingFace TRL with Unsloth 4-bit LoRA fine-tuning of Qwen2.5-0.5B-Instruct. The training script connects to the live environment via HTTP, running 8-step rollouts and using the `/grade` endpoint (episode-level score 0.0–1.0) as the primary reward signal.
66
+
67
+ ```python
68
+ # Training runs against the live environment
69
+ python scripts/train_unsloth.py --steps 500 --output-csv results/training_log.csv
70
+ ```
71
+
72
+ Or run the Colab notebook: [gridmind_grpo_colab.ipynb](https://colab.research.google.com/)
73
+
74
+ ## How to Try It
75
+
76
+ ```bash
77
+ # Quick health check
78
+ curl https://lo-kyu-gridmind.hf.space/health
79
+
80
+ # Run a heuristic baseline
81
+ python inference.py --fast-mode --task 3 --episodes 5
82
+
83
+ # Run the LLM agent
84
+ python inference.py --task 3 --episodes 5
85
+ ```
86
+
87
+ Live environment: [https://lo-kyu-gridmind.hf.space](https://lo-kyu-gridmind.hf.space)
88
+ Dashboard: [https://lo-kyu-gridmind.hf.space/dashboard](https://lo-kyu-gridmind.hf.space/dashboard)
89
+
90
+ Code: [github.com/LO-Kyu/gridmind](https://github.com/LO-Kyu/gridmind)
91
+
92
+ ---
93
+
94
+ *GridMind-RL was built for the Meta PyTorch OpenEnv Hackathon Grand Finale, April 25–26, 2026, at Scaler School of Technology, Bangalore.*