Spaces:

Prajwal782007
/

Gridmind

Sleeping

App Files Files Community

Gridmind / README.md

ShreeshantXD

Updated Readme for ROund 2

52635ef 22 days ago

preview code

raw

history blame contribute delete

10.6 kB

	---
	title: GridMind-RL
	emoji: ⚡
	colorFrom: green
	colorTo: blue
	sdk: docker
	app_port: 7860
	pinned: false
	license: mit
	---

	# GridMind-RL — Train LLMs to manage industrial buildings under faults, grid stress, and natural language objectives.

	[![OpenEnv Compatible](https://img.shields.io/badge/OpenEnv-Compatible-blue)](https://openenv.org/)
	[![Go 1.21](https://img.shields.io/badge/Go-1.21-00ADD8)](https://golang.org/)
	[![Python 3.11](https://img.shields.io/badge/Python-3.11+-3776ab)](https://www.python.org/)
	[![Docker Ready](https://img.shields.io/badge/Docker-Ready-2496ED)](https://www.docker.com/)
	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

	---

	## Why This Environment Is Novel

	Industrial buildings consume ~40% of global electricity yet rely on naive "always-on" HVAC policies. LLMs can reason about pricing curves, fault alerts, and natural language objectives—but no environment trains them for this. GridMind-RL simulates a full 24-hour building energy system with stochastic electricity prices, equipment faults, and instruction cards, creating a genuinely challenging domain where learned policies translate to real operational value.

	## Live Demo

	\| \| URL \|
	\|--\|-----\|
	\| Environment API \| https://prajwal782007-gridmind.hf.space \|
	\| Live Dashboard \| https://prajwal782007-gridmind.hf.space/dashboard \|

	Quick test:
	```bash
	curl https://prajwal782007-gridmind.hf.space/health
	curl https://prajwal782007-gridmind.hf.space/tasks
	```

	---

	## Environment

	\| \| Description \|
	\|---\|-------------\|
	\| Observation \| 13 fields: temperature, storage, price, stress, carbon, faults, HVAC efficiency, process demand, batch queue, price forecast \|
	\| Actions \| HVAC level (0-1), thermal charge (-1 to 1), batch slot (0-4), load shed (0-0.5) \|
	\| Reward \| 9-component weighted sum: cost, temperature, grid, deadline, efficiency, stability, carbon, instruction, fault_mitigation \|
	\| Episode \| 96 steps = 24 simulated hours @ 15-min resolution \|
	\| Tasks \| 4 tasks: (1) cost, (2) temperature, (3) demand_response, (4) instruction_following \|

	### Reward Weight Rationale

	Weights reflect real-world building operator priorities — not arbitrary values:

	\| Component \| Weight \| Rationale \|
	\|---\|---\|---\|
	\| `cost_savings` \| 0.28 \| Primary operator KPI — energy spend is the main business metric \|
	\| `carbon_reward` \| 0.20 \| ESG compliance — increasingly mandatory for industrial operators \|
	\| `temp_constraint` \| 0.20 \| Hard safety constraint — comfort SLA violations incur penalties \|
	\| `grid_response` \| 0.20 \| Regulatory SLA — demand response programs pay operators to shed load \|
	\| `batch_deadline` \| 0.12 \| Production continuity — missing batch deadlines causes downstream losses \|
	\| `efficiency_bonus` \| 0.05 \| Storage arbitrage — incentivises smart charge/discharge timing \|
	\| `stability_penalty` \| -0.05 \| Anti-cycling — prevents HVAC thrashing that causes equipment wear \|
	\| `task_satisfaction` \| 0.50* \| Task 4 only — weighted per the episode's instruction card \|
	\| `fault_mitigation` \| dynamic \| Emergency response — computed based on fault type and response \|

	> *Task 4 instruction reward weight comes from the sampled instruction card, not a fixed value.

	### Observation Fields

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| indoor_temperature \| float \| °C \|
	\| thermal_storage_level \| float \| 0-1 (0=empty, 1=full) \|
	\| process_demand \| float \| kW current industrial power demand \|
	\| current_price \| float \| $/kWh \|
	\| grid_stress_signal \| float \| 0-1 (>0.7 = critical) \|
	\| carbon_intensity \| float \| gCO2/kWh \|
	\| hour_of_day \| int \| 0-23 \|
	\| batch_queue \| int[] \| pending job deadline slots \|
	\| cumulative_cost \| float \| $ total incurred this episode \|
	\| hvac_efficiency \| float \| 1.0 → degrades to 0.5 over episode \|
	\| active_faults \| string[] \| Active fault alarm strings \|
	\| instruction_card \| object \| Task 4 objective only \|
	\| price_forecast \| float[] \| 4-step upcoming price preview \|

	### Action Fields

	\| Field \| Type \| Range \|
	\|-------\|------\|-------\|
	\| hvac_power_level \| float \| 0.0-1.0 \|
	\| thermal_charge_rate \| float \| -1.0 to 1.0 \|
	\| batch_job_slot \| int \| 0-4 \|
	\| load_shed_fraction \| float \| 0.0-0.5 \|

	---

	## Core Capabilities

	### Multi-Agent Coordination
	A single oversight LLM coordinates multiple buildings through price signals. The coordinator reads `/feeder` to see fleet-wide demand, then sets per-building price multipliers via `/coordinate` to orchestrate behavior.

	### Long-Horizon Instruction Following
	Task 4 presents a natural language objective card like "Keep total energy cost under $2.50 while maintaining 19-23°C". Agents must plan across all 96 steps—not greedy per-step control.

	These two capabilities map directly to Theme 1 and Theme 3 of the OpenEnv Hackathon.

	---

	## Results

	### What the Agent Learns

	A naive heuristic runs HVAC at fixed levels based on time-of-day. After GRPO training on GridMind-RL, the agent learns to charge thermal storage during off-peak hours (4¢/kWh) and discharge during peak (32¢/kWh), voluntarily shed load during grid stress signals above 0.7, and adjust HVAC intensity as efficiency degrades over the episode. None of these behaviors are hardcoded — the agent discovers them through the reward signal alone.

	\| Policy \| Task 1 \| Task 2 \| Task 3 \| Task 4 \|
	\|--------\|--------\|--------\|--------\|--------\|
	\| Heuristic Baseline \| 0.494 \| 0.471 \| 0.748 \| 0.478 \|
	\| Zero-shot LLM \| 0.715 \| 0.645 \| 0.610 \| 0.582 \|
	\| GRPO Fine-tuned LLM \| — \| — \| — \| — \|

	> *GRPO fine-tuned scores updating after full training run on T4 GPU.
	> Training plots below show live progress from the actual run.*

	![Reward Curve](curves/train%202/reward_curve.png)
	Reward vs training step. Blue = per-step reward, red dashed = smoothed average.

	![Loss Curve](curves/train%202/loss_curve.png)
	Training loss decreasing over steps — confirms the model is updating.

	![Baseline Comparison](curves/train%202/baseline_comparison.png)
	Grade scores per task: heuristic baseline vs GRPO-trained LLM.

	> Scores are episode grade scores (0.0–1.0, clamped open interval). Heuristic = fixed policy with no learning. Zero-shot = Qwen2.5-1.5B-Instruct prompted with task description, no fine-tuning, evaluated over 1 episode per task. Fine-tuned = GRPO-trained on GridMind-RL environment.

	> 🔄 Live update: GRPO fine-tuned scores will be filled in here immediately
	> after the final training run completes on the T4 GPU.

	---

	## How to Run

	### Start the environment server
	```bash
	go run main.go
	```

	### Run the LLM agent (task 1-4)
	```bash
	# Set up your API token
	cp .env.example .env
	# Edit .env with HF_TOKEN

	# Task 1: Cost minimization
	python inference.py --task 1 --episodes 5

	# Task 2: Temperature management
	python inference.py --task 2 --episodes 5

	# Task 3: Full demand response
	python inference.py --task 3 --episodes 5

	# Task 4: Instruction following
	python inference.py --task 4 --episodes 5

	# Heuristic baseline (fast, no LLM)
	python inference.py --fast-mode --task 3 --episodes 5
	```

	### Run multi-building coordinator demo
	```bash
	python scripts/multi_building_demo.py
	```

	### Run training (requires GPU)
	```bash
	python scripts/train_unsloth.py --steps 500 --output-csv results/training_log.csv
	```

	### Generate training curve plot
	```bash
	python scripts/plot_results.py
	```

	---

	## Architecture

	```
	Agent (python/inference.py)
	→ HTTP POST /step, /reset, /grade
	↓
	Go Environment Server (main.go) → Port 7860
	↓
	Physics Engine (env/environment.go) + Rewards (env/rewards.go) + Tasks (env/tasks.go)
	↓
	Web Dashboard (dashboard/server.py) → Port 7861
	```

	Design philosophy:
	- Separation of concerns: Physics engine (Go) decoupled from policy layer (Python)
	- OpenEnv compliance: Standardized REST API enables any language agent
	- Deterministic simulation: Seeded RNG for reproducible experiments
	- Dense rewards: 9-component reward for effective learning

	---

	## API Reference

	\| Method \| Endpoint \| Description \|
	\|--------\|----------\|-------------\|
	\| GET \| /health \| Health check \|
	\| GET \| /ping \| Liveness probe \|
	\| POST \| /reset \| Start new episode \|
	\| POST \| /step \| Take action step \|
	\| GET \| /state \| Get current state \|
	\| GET \| /grade \| Grade episode (0.0-1.0 score) \|
	\| GET \| /tasks \| Available tasks \|
	\| GET \| /metrics \| Prometheus metrics \|
	\| GET \| /replay \| Episode history \|
	\| GET \| /feeder \| Aggregate fleet state \|
	\| POST \| /coordinate \| Set price multipliers \|
	\| POST \| /simulate \| World model prediction \|
	\| POST \| /coordinator/reset \| Reset multi-building episode \|
	\| POST \| /coordinator/step \| Step with per-building actions \|
	\| GET \| /info \| OpenEnv metadata \|
	\| GET \| /ws \| WebSocket endpoint \|

	---

	## Project Structure

	```
	gridmind-rl/
	├── main.go # HTTP server & OpenEnv API
	├── inference.py # Agent entry point (LLM + heuristic)
	├── openenv.yaml # OpenEnv spec
	├── Dockerfile # Container build
	├── HF_BLOG_POST.md # Blog write-up
	├── baseline_scores.json # Heuristic baseline scores
	├── env/
	│ ├── environment.go # Physics simulation
	│ ├── models.go # Data models
	│ ├── rewards.go # Reward computation
	│ ├── tasks.go # Task grading
	│ └── faults.go # Fault injection
	├── scripts/
	│ ├── train_unsloth.py # GRPO training
	│ ├── plot_results.py # Training curve visualizer
	│ ├── multi_building_demo.py # Fleet AI demo
	│ └── gridmind_grpo_colab.ipynb # Colab training notebook
	├── server/
	│ └── app.py # Python fallback server
	├── dashboard/
	│ ├── server.py # Web server (port 7861)
	│ └── static/ # Frontend assets
	├── curves/ # Training curves (train N/)
	│ └── train N/ # Per-run plots
	├── results/ # Training outputs (generated)
	└── README.md
	```

	---

	## Links

	- 🤗 HuggingFace Space: [GridMind-RL](https://prajwal782007-gridmind.hf.space)
	- 📓 Training Notebook: [gridmind_grpo_colab.ipynb](https://colab.research.google.com/github/LO-Kyu/gridmind/blob/main/scripts/gridmind_grpo_colab.ipynb)
	- 📝 Blog Post: [Read the write-up](./HF_BLOG_POST.md)
	- 🐙 GitHub: [Code Repository](https://github.com/LO-Kyu/gridmind)

	---

	## License

	MIT License. See [LICENSE](LICENSE) file.

	---

	Questions? Open an issue on GitHub.