--- title: GridMind-RL emoji: ⚡ colorFrom: green colorTo: blue sdk: docker app_port: 7860 pinned: false license: mit --- # GridMind-RL — Train LLMs to manage industrial buildings under faults, grid stress, and natural language objectives. [![OpenEnv Compatible](https://img.shields.io/badge/OpenEnv-Compatible-blue)](https://openenv.org/) [![Go 1.21](https://img.shields.io/badge/Go-1.21-00ADD8)](https://golang.org/) [![Python 3.11](https://img.shields.io/badge/Python-3.11+-3776ab)](https://www.python.org/) [![Docker Ready](https://img.shields.io/badge/Docker-Ready-2496ED)](https://www.docker.com/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) --- ## Why This Environment Is Novel Industrial buildings consume ~40% of global electricity yet rely on naive "always-on" HVAC policies. LLMs can reason about pricing curves, fault alerts, and natural language objectives—but no environment trains them for this. GridMind-RL simulates a full 24-hour building energy system with stochastic electricity prices, equipment faults, and instruction cards, creating a genuinely challenging domain where learned policies translate to real operational value. ## Live Demo | | URL | |--|-----| | **Environment API** | https://prajwal782007-gridmind.hf.space | | **Live Dashboard** | https://prajwal782007-gridmind.hf.space/dashboard | **Quick test:** ```bash curl https://prajwal782007-gridmind.hf.space/health curl https://prajwal782007-gridmind.hf.space/tasks ``` --- ## Environment | | Description | |---|-------------| | **Observation** | 13 fields: temperature, storage, price, stress, carbon, faults, HVAC efficiency, process demand, batch queue, price forecast | | **Actions** | HVAC level (0-1), thermal charge (-1 to 1), batch slot (0-4), load shed (0-0.5) | | **Reward** | 9-component weighted sum: cost, temperature, grid, deadline, efficiency, stability, carbon, instruction, fault_mitigation | | **Episode** | 96 steps = 24 simulated hours @ 15-min resolution | | **Tasks** | 4 tasks: (1) cost, (2) temperature, (3) demand_response, (4) instruction_following | ### Reward Weight Rationale Weights reflect real-world building operator priorities — not arbitrary values: | Component | Weight | Rationale | |---|---|---| | `cost_savings` | 0.28 | Primary operator KPI — energy spend is the main business metric | | `carbon_reward` | 0.20 | ESG compliance — increasingly mandatory for industrial operators | | `temp_constraint` | 0.20 | Hard safety constraint — comfort SLA violations incur penalties | | `grid_response` | 0.20 | Regulatory SLA — demand response programs pay operators to shed load | | `batch_deadline` | 0.12 | Production continuity — missing batch deadlines causes downstream losses | | `efficiency_bonus` | 0.05 | Storage arbitrage — incentivises smart charge/discharge timing | | `stability_penalty` | -0.05 | Anti-cycling — prevents HVAC thrashing that causes equipment wear | | `task_satisfaction` | 0.50* | Task 4 only — weighted per the episode's instruction card | | `fault_mitigation` | dynamic | Emergency response — computed based on fault type and response | > *Task 4 instruction reward weight comes from the sampled instruction card, not a fixed value. ### Observation Fields | Field | Type | Description | |-------|------|-------------| | indoor_temperature | float | °C | | thermal_storage_level | float | 0-1 (0=empty, 1=full) | | process_demand | float | kW current industrial power demand | | current_price | float | $/kWh | | grid_stress_signal | float | 0-1 (>0.7 = critical) | | carbon_intensity | float | gCO2/kWh | | hour_of_day | int | 0-23 | | batch_queue | int[] | pending job deadline slots | | cumulative_cost | float | $ total incurred this episode | | hvac_efficiency | float | 1.0 → degrades to 0.5 over episode | | active_faults | string[] | Active fault alarm strings | | instruction_card | object | Task 4 objective only | | price_forecast | float[] | 4-step upcoming price preview | ### Action Fields | Field | Type | Range | |-------|------|-------| | hvac_power_level | float | 0.0-1.0 | | thermal_charge_rate | float | -1.0 to 1.0 | | batch_job_slot | int | 0-4 | | load_shed_fraction | float | 0.0-0.5 | --- ## Core Capabilities ### Multi-Agent Coordination A single oversight LLM coordinates multiple buildings through price signals. The coordinator reads `/feeder` to see fleet-wide demand, then sets per-building price multipliers via `/coordinate` to orchestrate behavior. ### Long-Horizon Instruction Following Task 4 presents a natural language objective card like "Keep total energy cost under $2.50 while maintaining 19-23°C". Agents must plan across all 96 steps—not greedy per-step control. These two capabilities map directly to Theme 1 and Theme 3 of the OpenEnv Hackathon. --- ## Results ### What the Agent Learns A naive heuristic runs HVAC at fixed levels based on time-of-day. After GRPO training on GridMind-RL, the agent learns to charge thermal storage during off-peak hours (4¢/kWh) and discharge during peak (32¢/kWh), voluntarily shed load during grid stress signals above 0.7, and adjust HVAC intensity as efficiency degrades over the episode. None of these behaviors are hardcoded — the agent discovers them through the reward signal alone. | Policy | Task 1 | Task 2 | Task 3 | Task 4 | |--------|--------|--------|--------|--------| | Heuristic Baseline | 0.494 | 0.471 | 0.748 | 0.478 | | Zero-shot LLM | 0.715 | 0.645 | 0.610 | 0.582 | | GRPO Fine-tuned LLM | — | — | — | — | > *GRPO fine-tuned scores updating after full training run on T4 GPU. > Training plots below show live progress from the actual run.* ![Reward Curve](curves/train%202/reward_curve.png) *Reward vs training step. Blue = per-step reward, red dashed = smoothed average.* ![Loss Curve](curves/train%202/loss_curve.png) *Training loss decreasing over steps — confirms the model is updating.* ![Baseline Comparison](curves/train%202/baseline_comparison.png) *Grade scores per task: heuristic baseline vs GRPO-trained LLM.* > Scores are episode grade scores (0.0–1.0, clamped open interval). Heuristic = fixed policy with no learning. Zero-shot = Qwen2.5-1.5B-Instruct prompted with task description, no fine-tuning, evaluated over 1 episode per task. Fine-tuned = GRPO-trained on GridMind-RL environment. > 🔄 **Live update:** GRPO fine-tuned scores will be filled in here immediately > after the final training run completes on the T4 GPU. --- ## How to Run ### Start the environment server ```bash go run main.go ``` ### Run the LLM agent (task 1-4) ```bash # Set up your API token cp .env.example .env # Edit .env with HF_TOKEN # Task 1: Cost minimization python inference.py --task 1 --episodes 5 # Task 2: Temperature management python inference.py --task 2 --episodes 5 # Task 3: Full demand response python inference.py --task 3 --episodes 5 # Task 4: Instruction following python inference.py --task 4 --episodes 5 # Heuristic baseline (fast, no LLM) python inference.py --fast-mode --task 3 --episodes 5 ``` ### Run multi-building coordinator demo ```bash python scripts/multi_building_demo.py ``` ### Run training (requires GPU) ```bash python scripts/train_unsloth.py --steps 500 --output-csv results/training_log.csv ``` ### Generate training curve plot ```bash python scripts/plot_results.py ``` --- ## Architecture ``` Agent (python/inference.py) → HTTP POST /step, /reset, /grade ↓ Go Environment Server (main.go) → Port 7860 ↓ Physics Engine (env/environment.go) + Rewards (env/rewards.go) + Tasks (env/tasks.go) ↓ Web Dashboard (dashboard/server.py) → Port 7861 ``` **Design philosophy:** - **Separation of concerns**: Physics engine (Go) decoupled from policy layer (Python) - **OpenEnv compliance**: Standardized REST API enables any language agent - **Deterministic simulation**: Seeded RNG for reproducible experiments - **Dense rewards**: 9-component reward for effective learning --- ## API Reference | Method | Endpoint | Description | |--------|----------|-------------| | GET | /health | Health check | | GET | /ping | Liveness probe | | POST | /reset | Start new episode | | POST | /step | Take action step | | GET | /state | Get current state | | GET | /grade | Grade episode (0.0-1.0 score) | | GET | /tasks | Available tasks | | GET | /metrics | Prometheus metrics | | GET | /replay | Episode history | | GET | /feeder | Aggregate fleet state | | POST | /coordinate | Set price multipliers | | POST | /simulate | World model prediction | | POST | /coordinator/reset | Reset multi-building episode | | POST | /coordinator/step | Step with per-building actions | | GET | /info | OpenEnv metadata | | GET | /ws | WebSocket endpoint | --- ## Project Structure ``` gridmind-rl/ ├── main.go # HTTP server & OpenEnv API ├── inference.py # Agent entry point (LLM + heuristic) ├── openenv.yaml # OpenEnv spec ├── Dockerfile # Container build ├── HF_BLOG_POST.md # Blog write-up ├── baseline_scores.json # Heuristic baseline scores ├── env/ │ ├── environment.go # Physics simulation │ ├── models.go # Data models │ ├── rewards.go # Reward computation │ ├── tasks.go # Task grading │ └── faults.go # Fault injection ├── scripts/ │ ├── train_unsloth.py # GRPO training │ ├── plot_results.py # Training curve visualizer │ ├── multi_building_demo.py # Fleet AI demo │ └── gridmind_grpo_colab.ipynb # Colab training notebook ├── server/ │ └── app.py # Python fallback server ├── dashboard/ │ ├── server.py # Web server (port 7861) │ └── static/ # Frontend assets ├── curves/ # Training curves (train N/) │ └── train N/ # Per-run plots ├── results/ # Training outputs (generated) └── README.md ``` --- ## Links - 🤗 HuggingFace Space: [GridMind-RL](https://prajwal782007-gridmind.hf.space) - 📓 Training Notebook: [gridmind_grpo_colab.ipynb](https://colab.research.google.com/github/LO-Kyu/gridmind/blob/main/scripts/gridmind_grpo_colab.ipynb) - 📝 Blog Post: [Read the write-up](./HF_BLOG_POST.md) - 🐙 GitHub: [Code Repository](https://github.com/LO-Kyu/gridmind) --- ## License MIT License. See [LICENSE](LICENSE) file. --- **Questions?** Open an issue on GitHub.