Spaces:

Prajwal782007
/

Gridmind

Running

File size: 10,616 Bytes

bacf63d
 
 
 
 
 
 
 
 
 
 
0af208b
1875b13
4c68447
 
 
 
 
2787b1e
b4281fc
2787b1e
8204dc0
 
52635ef
8204dc0
0af208b
84fb786
 
 
a4bc605
 
84fb786
 
 
a4bc605
 
84fb786
 
b4281fc
2787b1e
0af208b
 
 
 
52635ef
0af208b
 
 
 
 
74dc7b5
 
 
 
 
 
 
 
 
 
 
 
 
52635ef
 
74dc7b5
 
 
0af208b
 
 
 
 
 
52635ef
0af208b
 
52635ef
 
 
 
0af208b
 
 
52635ef
0af208b
 
 
 
 
 
 
 
 
a4be35d
 
 
52635ef
a4be35d
52635ef
0af208b
a4be35d
52635ef
0af208b
a4be35d
52635ef
b054ef7
0af208b
b054ef7
0af208b
b054ef7
52635ef
 
 
b054ef7
0af208b
 
52635ef
0af208b
52635ef
 
 
 
2787b1e
52635ef
 
 
 
 
 
 
 
 
 
 
 
 
2787b1e
0af208b
2787b1e
0af208b
574589d
0af208b
b4281fc
4c68447
e3130b4
 
0af208b
574589d
0af208b
a4be35d
0af208b
 
 
 
 
 
 
b4281fc
0af208b
 
2787b1e
0af208b
 
a4be35d
0af208b
 
b054ef7
e3fbc9c
0af208b
 
 
 
2787b1e
0af208b
 
 
 
a4be35d
0af208b
a4be35d
0af208b
a4be35d
2787b1e
b4281fc
2787b1e
0af208b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2787b1e
0af208b
2787b1e
0af208b
2787b1e
0af208b
 
 
 
 
 
 
 
 
52635ef
0af208b
 
 
 
52635ef
 
 
 
1875b13
2787b1e
 
4c68447
574589d
b4281fc
4c68447
0af208b
 
 
 
52635ef
 
0af208b
 
 
 
 
 
 
 
 
 
52635ef
 
 
0af208b
 
 
52635ef
 
0af208b
 
b4281fc
574589d
 
1875b13
0af208b
e3fbc9c
a4bc605
52635ef
 
 
b4281fc
 
e3fbc9c
4c68447
b054ef7
4c68447
b054ef7
 
 
0af208b

---
title: GridMind-RL
emoji: ⚡
colorFrom: green
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
license: mit
---

# GridMind-RL — Train LLMs to manage industrial buildings under faults, grid stress, and natural language objectives.

[![OpenEnv Compatible](https://img.shields.io/badge/OpenEnv-Compatible-blue)](https://openenv.org/)
[![Go 1.21](https://img.shields.io/badge/Go-1.21-00ADD8)](https://golang.org/)
[![Python 3.11](https://img.shields.io/badge/Python-3.11+-3776ab)](https://www.python.org/)
[![Docker Ready](https://img.shields.io/badge/Docker-Ready-2496ED)](https://www.docker.com/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

---

## Why This Environment Is Novel

Industrial buildings consume ~40% of global electricity yet rely on naive "always-on" HVAC policies. LLMs can reason about pricing curves, fault alerts, and natural language objectives—but no environment trains them for this. GridMind-RL simulates a full 24-hour building energy system with stochastic electricity prices, equipment faults, and instruction cards, creating a genuinely challenging domain where learned policies translate to real operational value.

## Live Demo

| | URL |
|--|-----|
| **Environment API** | https://prajwal782007-gridmind.hf.space |
| **Live Dashboard** | https://prajwal782007-gridmind.hf.space/dashboard |

**Quick test:**
```bash
curl https://prajwal782007-gridmind.hf.space/health
curl https://prajwal782007-gridmind.hf.space/tasks
```

---

## Environment

| | Description |
|---|-------------|
| **Observation** | 13 fields: temperature, storage, price, stress, carbon, faults, HVAC efficiency, process demand, batch queue, price forecast |
| **Actions** | HVAC level (0-1), thermal charge (-1 to 1), batch slot (0-4), load shed (0-0.5) |
| **Reward** | 9-component weighted sum: cost, temperature, grid, deadline, efficiency, stability, carbon, instruction, fault_mitigation |
| **Episode** | 96 steps = 24 simulated hours @ 15-min resolution |
| **Tasks** | 4 tasks: (1) cost, (2) temperature, (3) demand_response, (4) instruction_following |

### Reward Weight Rationale

Weights reflect real-world building operator priorities — not arbitrary values:

| Component | Weight | Rationale |
|---|---|---|
| `cost_savings` | 0.28 | Primary operator KPI — energy spend is the main business metric |
| `carbon_reward` | 0.20 | ESG compliance — increasingly mandatory for industrial operators |
| `temp_constraint` | 0.20 | Hard safety constraint — comfort SLA violations incur penalties |
| `grid_response` | 0.20 | Regulatory SLA — demand response programs pay operators to shed load |
| `batch_deadline` | 0.12 | Production continuity — missing batch deadlines causes downstream losses |
| `efficiency_bonus` | 0.05 | Storage arbitrage — incentivises smart charge/discharge timing |
| `stability_penalty` | -0.05 | Anti-cycling — prevents HVAC thrashing that causes equipment wear |
| `task_satisfaction` | 0.50* | Task 4 only — weighted per the episode's instruction card |
| `fault_mitigation` | dynamic | Emergency response — computed based on fault type and response |

> *Task 4 instruction reward weight comes from the sampled instruction card, not a fixed value.

### Observation Fields

| Field | Type | Description |
|-------|------|-------------|
| indoor_temperature | float | °C |
| thermal_storage_level | float | 0-1 (0=empty, 1=full) |
| process_demand | float | kW current industrial power demand |
| current_price | float | $/kWh |
| grid_stress_signal | float | 0-1 (>0.7 = critical) |
| carbon_intensity | float | gCO2/kWh |
| hour_of_day | int | 0-23 |
| batch_queue | int[] | pending job deadline slots |
| cumulative_cost | float | $ total incurred this episode |
| hvac_efficiency | float | 1.0 → degrades to 0.5 over episode |
| active_faults | string[] | Active fault alarm strings |
| instruction_card | object | Task 4 objective only |
| price_forecast | float[] | 4-step upcoming price preview |

### Action Fields

| Field | Type | Range |
|-------|------|-------|
| hvac_power_level | float | 0.0-1.0 |
| thermal_charge_rate | float | -1.0 to 1.0 |
| batch_job_slot | int | 0-4 |
| load_shed_fraction | float | 0.0-0.5 |

---

## Core Capabilities

### Multi-Agent Coordination
A single oversight LLM coordinates multiple buildings through price signals. The coordinator reads `/feeder` to see fleet-wide demand, then sets per-building price multipliers via `/coordinate` to orchestrate behavior.

### Long-Horizon Instruction Following
Task 4 presents a natural language objective card like "Keep total energy cost under $2.50 while maintaining 19-23°C". Agents must plan across all 96 steps—not greedy per-step control.

These two capabilities map directly to Theme 1 and Theme 3 of the OpenEnv Hackathon.

---

## Results

### What the Agent Learns

A naive heuristic runs HVAC at fixed levels based on time-of-day. After GRPO training on GridMind-RL, the agent learns to charge thermal storage during off-peak hours (4¢/kWh) and discharge during peak (32¢/kWh), voluntarily shed load during grid stress signals above 0.7, and adjust HVAC intensity as efficiency degrades over the episode. None of these behaviors are hardcoded — the agent discovers them through the reward signal alone.

| Policy | Task 1 | Task 2 | Task 3 | Task 4 |
|--------|--------|--------|--------|--------|
| Heuristic Baseline | 0.494 | 0.471 | 0.748 | 0.478 |
| Zero-shot LLM | 0.715 | 0.645 | 0.610 | 0.582 |
| GRPO Fine-tuned LLM | — | — | — | — |

> *GRPO fine-tuned scores updating after full training run on T4 GPU.
> Training plots below show live progress from the actual run.*

![Reward Curve](curves/train%202/reward_curve.png)
*Reward vs training step. Blue = per-step reward, red dashed = smoothed average.*

![Loss Curve](curves/train%202/loss_curve.png)
*Training loss decreasing over steps — confirms the model is updating.*

![Baseline Comparison](curves/train%202/baseline_comparison.png)
*Grade scores per task: heuristic baseline vs GRPO-trained LLM.*

> Scores are episode grade scores (0.0–1.0, clamped open interval). Heuristic = fixed policy with no learning. Zero-shot = Qwen2.5-1.5B-Instruct prompted with task description, no fine-tuning, evaluated over 1 episode per task. Fine-tuned = GRPO-trained on GridMind-RL environment.

> 🔄 **Live update:** GRPO fine-tuned scores will be filled in here immediately
> after the final training run completes on the T4 GPU.

---

## How to Run

### Start the environment server
```bash
go run main.go
```

### Run the LLM agent (task 1-4)
```bash
# Set up your API token
cp .env.example .env
# Edit .env with HF_TOKEN

# Task 1: Cost minimization
python inference.py --task 1 --episodes 5

# Task 2: Temperature management  
python inference.py --task 2 --episodes 5

# Task 3: Full demand response
python inference.py --task 3 --episodes 5

# Task 4: Instruction following
python inference.py --task 4 --episodes 5

# Heuristic baseline (fast, no LLM)
python inference.py --fast-mode --task 3 --episodes 5
```

### Run multi-building coordinator demo
```bash
python scripts/multi_building_demo.py
```

### Run training (requires GPU)
```bash
python scripts/train_unsloth.py --steps 500 --output-csv results/training_log.csv
```

### Generate training curve plot
```bash
python scripts/plot_results.py
```

---

## Architecture

```
Agent (python/inference.py)
    → HTTP POST /step, /reset, /grade
    ↓
Go Environment Server (main.go) → Port 7860
    ↓
Physics Engine (env/environment.go) + Rewards (env/rewards.go) + Tasks (env/tasks.go)
    ↓
Web Dashboard (dashboard/server.py) → Port 7861
```

**Design philosophy:**
- **Separation of concerns**: Physics engine (Go) decoupled from policy layer (Python)
- **OpenEnv compliance**: Standardized REST API enables any language agent
- **Deterministic simulation**: Seeded RNG for reproducible experiments
- **Dense rewards**: 9-component reward for effective learning

---

## API Reference

| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | /health | Health check |
| GET | /ping | Liveness probe |
| POST | /reset | Start new episode |
| POST | /step | Take action step |
| GET | /state | Get current state |
| GET | /grade | Grade episode (0.0-1.0 score) |
| GET | /tasks | Available tasks |
| GET | /metrics | Prometheus metrics |
| GET | /replay | Episode history |
| GET | /feeder | Aggregate fleet state |
| POST | /coordinate | Set price multipliers |
| POST | /simulate | World model prediction |
| POST | /coordinator/reset | Reset multi-building episode |
| POST | /coordinator/step | Step with per-building actions |
| GET | /info | OpenEnv metadata |
| GET | /ws | WebSocket endpoint |

---

## Project Structure

```
gridmind-rl/
├── main.go                    # HTTP server & OpenEnv API
├── inference.py              # Agent entry point (LLM + heuristic)
├── openenv.yaml              # OpenEnv spec
├── Dockerfile                # Container build
├── HF_BLOG_POST.md           # Blog write-up
├── baseline_scores.json      # Heuristic baseline scores
├── env/
│   ├── environment.go        # Physics simulation
│   ├── models.go           # Data models
│   ├── rewards.go         # Reward computation
│   ├── tasks.go           # Task grading
│   └── faults.go         # Fault injection
├── scripts/
│   ├── train_unsloth.py   # GRPO training
│   ├── plot_results.py   # Training curve visualizer
│   ├── multi_building_demo.py  # Fleet AI demo
│   └── gridmind_grpo_colab.ipynb  # Colab training notebook
├── server/
│   └── app.py            # Python fallback server
├── dashboard/
│   ├── server.py         # Web server (port 7861)
│   └── static/           # Frontend assets
├── curves/               # Training curves (train N/)
│   └── train N/         # Per-run plots
├── results/              # Training outputs (generated)
└── README.md
```

---

## Links

- 🤗 HuggingFace Space: [GridMind-RL](https://prajwal782007-gridmind.hf.space)
- 📓 Training Notebook: [gridmind_grpo_colab.ipynb](https://colab.research.google.com/github/LO-Kyu/gridmind/blob/main/scripts/gridmind_grpo_colab.ipynb)
- 📝 Blog Post: [Read the write-up](./HF_BLOG_POST.md)
- 🐙 GitHub: [Code Repository](https://github.com/LO-Kyu/gridmind)

---

## License

MIT License. See [LICENSE](LICENSE) file.

---

**Questions?** Open an issue on GitHub.