Spaces:
Sleeping
Sleeping
| title: GridMind-RL | |
| emoji: ⚡ | |
| colorFrom: green | |
| colorTo: blue | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| license: mit | |
| # GridMind-RL — Train LLMs to manage industrial buildings under faults, grid stress, and natural language objectives. | |
| [](https://openenv.org/) | |
| [](https://golang.org/) | |
| [](https://www.python.org/) | |
| [](https://www.docker.com/) | |
| [](LICENSE) | |
| --- | |
| ## Why This Environment Is Novel | |
| Industrial buildings consume ~40% of global electricity yet rely on naive "always-on" HVAC policies. LLMs can reason about pricing curves, fault alerts, and natural language objectives—but no environment trains them for this. GridMind-RL simulates a full 24-hour building energy system with stochastic electricity prices, equipment faults, and instruction cards, creating a genuinely challenging domain where learned policies translate to real operational value. | |
| ## Live Demo | |
| | | URL | | |
| |--|-----| | |
| | **Environment API** | https://prajwal782007-gridmind.hf.space | | |
| | **Live Dashboard** | https://prajwal782007-gridmind.hf.space/dashboard | | |
| **Quick test:** | |
| ```bash | |
| curl https://prajwal782007-gridmind.hf.space/health | |
| curl https://prajwal782007-gridmind.hf.space/tasks | |
| ``` | |
| --- | |
| ## Environment | |
| | | Description | | |
| |---|-------------| | |
| | **Observation** | 13 fields: temperature, storage, price, stress, carbon, faults, HVAC efficiency, process demand, batch queue, price forecast | | |
| | **Actions** | HVAC level (0-1), thermal charge (-1 to 1), batch slot (0-4), load shed (0-0.5) | | |
| | **Reward** | 9-component weighted sum: cost, temperature, grid, deadline, efficiency, stability, carbon, instruction, fault_mitigation | | |
| | **Episode** | 96 steps = 24 simulated hours @ 15-min resolution | | |
| | **Tasks** | 4 tasks: (1) cost, (2) temperature, (3) demand_response, (4) instruction_following | | |
| ### Reward Weight Rationale | |
| Weights reflect real-world building operator priorities — not arbitrary values: | |
| | Component | Weight | Rationale | | |
| |---|---|---| | |
| | `cost_savings` | 0.28 | Primary operator KPI — energy spend is the main business metric | | |
| | `carbon_reward` | 0.20 | ESG compliance — increasingly mandatory for industrial operators | | |
| | `temp_constraint` | 0.20 | Hard safety constraint — comfort SLA violations incur penalties | | |
| | `grid_response` | 0.20 | Regulatory SLA — demand response programs pay operators to shed load | | |
| | `batch_deadline` | 0.12 | Production continuity — missing batch deadlines causes downstream losses | | |
| | `efficiency_bonus` | 0.05 | Storage arbitrage — incentivises smart charge/discharge timing | | |
| | `stability_penalty` | -0.05 | Anti-cycling — prevents HVAC thrashing that causes equipment wear | | |
| | `task_satisfaction` | 0.50* | Task 4 only — weighted per the episode's instruction card | | |
| | `fault_mitigation` | dynamic | Emergency response — computed based on fault type and response | | |
| > *Task 4 instruction reward weight comes from the sampled instruction card, not a fixed value. | |
| ### Observation Fields | |
| | Field | Type | Description | | |
| |-------|------|-------------| | |
| | indoor_temperature | float | °C | | |
| | thermal_storage_level | float | 0-1 (0=empty, 1=full) | | |
| | process_demand | float | kW current industrial power demand | | |
| | current_price | float | $/kWh | | |
| | grid_stress_signal | float | 0-1 (>0.7 = critical) | | |
| | carbon_intensity | float | gCO2/kWh | | |
| | hour_of_day | int | 0-23 | | |
| | batch_queue | int[] | pending job deadline slots | | |
| | cumulative_cost | float | $ total incurred this episode | | |
| | hvac_efficiency | float | 1.0 → degrades to 0.5 over episode | | |
| | active_faults | string[] | Active fault alarm strings | | |
| | instruction_card | object | Task 4 objective only | | |
| | price_forecast | float[] | 4-step upcoming price preview | | |
| ### Action Fields | |
| | Field | Type | Range | | |
| |-------|------|-------| | |
| | hvac_power_level | float | 0.0-1.0 | | |
| | thermal_charge_rate | float | -1.0 to 1.0 | | |
| | batch_job_slot | int | 0-4 | | |
| | load_shed_fraction | float | 0.0-0.5 | | |
| --- | |
| ## Core Capabilities | |
| ### Multi-Agent Coordination | |
| A single oversight LLM coordinates multiple buildings through price signals. The coordinator reads `/feeder` to see fleet-wide demand, then sets per-building price multipliers via `/coordinate` to orchestrate behavior. | |
| ### Long-Horizon Instruction Following | |
| Task 4 presents a natural language objective card like "Keep total energy cost under $2.50 while maintaining 19-23°C". Agents must plan across all 96 steps—not greedy per-step control. | |
| These two capabilities map directly to Theme 1 and Theme 3 of the OpenEnv Hackathon. | |
| --- | |
| ## Results | |
| ### What the Agent Learns | |
| A naive heuristic runs HVAC at fixed levels based on time-of-day. After GRPO training on GridMind-RL, the agent learns to charge thermal storage during off-peak hours (4¢/kWh) and discharge during peak (32¢/kWh), voluntarily shed load during grid stress signals above 0.7, and adjust HVAC intensity as efficiency degrades over the episode. None of these behaviors are hardcoded — the agent discovers them through the reward signal alone. | |
| | Policy | Task 1 | Task 2 | Task 3 | Task 4 | | |
| |--------|--------|--------|--------|--------| | |
| | Heuristic Baseline | 0.494 | 0.471 | 0.748 | 0.478 | | |
| | Zero-shot LLM | 0.715 | 0.645 | 0.610 | 0.582 | | |
| | GRPO Fine-tuned LLM | — | — | — | — | | |
| > *GRPO fine-tuned scores updating after full training run on T4 GPU. | |
| > Training plots below show live progress from the actual run.* | |
|  | |
| *Reward vs training step. Blue = per-step reward, red dashed = smoothed average.* | |
|  | |
| *Training loss decreasing over steps — confirms the model is updating.* | |
|  | |
| *Grade scores per task: heuristic baseline vs GRPO-trained LLM.* | |
| > Scores are episode grade scores (0.0–1.0, clamped open interval). Heuristic = fixed policy with no learning. Zero-shot = Qwen2.5-1.5B-Instruct prompted with task description, no fine-tuning, evaluated over 1 episode per task. Fine-tuned = GRPO-trained on GridMind-RL environment. | |
| > 🔄 **Live update:** GRPO fine-tuned scores will be filled in here immediately | |
| > after the final training run completes on the T4 GPU. | |
| --- | |
| ## How to Run | |
| ### Start the environment server | |
| ```bash | |
| go run main.go | |
| ``` | |
| ### Run the LLM agent (task 1-4) | |
| ```bash | |
| # Set up your API token | |
| cp .env.example .env | |
| # Edit .env with HF_TOKEN | |
| # Task 1: Cost minimization | |
| python inference.py --task 1 --episodes 5 | |
| # Task 2: Temperature management | |
| python inference.py --task 2 --episodes 5 | |
| # Task 3: Full demand response | |
| python inference.py --task 3 --episodes 5 | |
| # Task 4: Instruction following | |
| python inference.py --task 4 --episodes 5 | |
| # Heuristic baseline (fast, no LLM) | |
| python inference.py --fast-mode --task 3 --episodes 5 | |
| ``` | |
| ### Run multi-building coordinator demo | |
| ```bash | |
| python scripts/multi_building_demo.py | |
| ``` | |
| ### Run training (requires GPU) | |
| ```bash | |
| python scripts/train_unsloth.py --steps 500 --output-csv results/training_log.csv | |
| ``` | |
| ### Generate training curve plot | |
| ```bash | |
| python scripts/plot_results.py | |
| ``` | |
| --- | |
| ## Architecture | |
| ``` | |
| Agent (python/inference.py) | |
| → HTTP POST /step, /reset, /grade | |
| ↓ | |
| Go Environment Server (main.go) → Port 7860 | |
| ↓ | |
| Physics Engine (env/environment.go) + Rewards (env/rewards.go) + Tasks (env/tasks.go) | |
| ↓ | |
| Web Dashboard (dashboard/server.py) → Port 7861 | |
| ``` | |
| **Design philosophy:** | |
| - **Separation of concerns**: Physics engine (Go) decoupled from policy layer (Python) | |
| - **OpenEnv compliance**: Standardized REST API enables any language agent | |
| - **Deterministic simulation**: Seeded RNG for reproducible experiments | |
| - **Dense rewards**: 9-component reward for effective learning | |
| --- | |
| ## API Reference | |
| | Method | Endpoint | Description | | |
| |--------|----------|-------------| | |
| | GET | /health | Health check | | |
| | GET | /ping | Liveness probe | | |
| | POST | /reset | Start new episode | | |
| | POST | /step | Take action step | | |
| | GET | /state | Get current state | | |
| | GET | /grade | Grade episode (0.0-1.0 score) | | |
| | GET | /tasks | Available tasks | | |
| | GET | /metrics | Prometheus metrics | | |
| | GET | /replay | Episode history | | |
| | GET | /feeder | Aggregate fleet state | | |
| | POST | /coordinate | Set price multipliers | | |
| | POST | /simulate | World model prediction | | |
| | POST | /coordinator/reset | Reset multi-building episode | | |
| | POST | /coordinator/step | Step with per-building actions | | |
| | GET | /info | OpenEnv metadata | | |
| | GET | /ws | WebSocket endpoint | | |
| --- | |
| ## Project Structure | |
| ``` | |
| gridmind-rl/ | |
| ├── main.go # HTTP server & OpenEnv API | |
| ├── inference.py # Agent entry point (LLM + heuristic) | |
| ├── openenv.yaml # OpenEnv spec | |
| ├── Dockerfile # Container build | |
| ├── HF_BLOG_POST.md # Blog write-up | |
| ├── baseline_scores.json # Heuristic baseline scores | |
| ├── env/ | |
| │ ├── environment.go # Physics simulation | |
| │ ├── models.go # Data models | |
| │ ├── rewards.go # Reward computation | |
| │ ├── tasks.go # Task grading | |
| │ └── faults.go # Fault injection | |
| ├── scripts/ | |
| │ ├── train_unsloth.py # GRPO training | |
| │ ├── plot_results.py # Training curve visualizer | |
| │ ├── multi_building_demo.py # Fleet AI demo | |
| │ └── gridmind_grpo_colab.ipynb # Colab training notebook | |
| ├── server/ | |
| │ └── app.py # Python fallback server | |
| ├── dashboard/ | |
| │ ├── server.py # Web server (port 7861) | |
| │ └── static/ # Frontend assets | |
| ├── curves/ # Training curves (train N/) | |
| │ └── train N/ # Per-run plots | |
| ├── results/ # Training outputs (generated) | |
| └── README.md | |
| ``` | |
| --- | |
| ## Links | |
| - 🤗 HuggingFace Space: [GridMind-RL](https://prajwal782007-gridmind.hf.space) | |
| - 📓 Training Notebook: [gridmind_grpo_colab.ipynb](https://colab.research.google.com/github/LO-Kyu/gridmind/blob/main/scripts/gridmind_grpo_colab.ipynb) | |
| - 📝 Blog Post: [Read the write-up](./HF_BLOG_POST.md) | |
| - 🐙 GitHub: [Code Repository](https://github.com/LO-Kyu/gridmind) | |
| --- | |
| ## License | |
| MIT License. See [LICENSE](LICENSE) file. | |
| --- | |
| **Questions?** Open an issue on GitHub. |