Spaces:

LO-Kyu
/

gridmind

Running

App Files Files Community

ShreeshantXD commited on 15 days ago

Commit

b054ef7

1 Parent(s): 926806a

feat: add example environment variables, update README, and enhance inference script for better error handling

Browse files

Files changed (5) hide show

.env.example +21 -0
README.md +367 -149
baseline_scores.json +20 -20
inference.py +32 -0
python/inference.py +84 -40

.env.example ADDED Viewed

	@@ -0,0 +1,21 @@

+# Example environment variables for GridMind-RL
+# Copy this to .env and fill in real keys for local testing.
+# Mandatory hackathon secret (set this in HF Space secrets too)
+HF_TOKEN=your_provider_api_key_here
+# OpenAI-compatible endpoint (default: OpenRouter free-tier)
+API_BASE_URL=https://openrouter.ai/api/v1
+# Model to use (change to smaller model if you need lower latency/cost)
+MODEL_NAME=your_chosen_model_name_here
+# Optional: provider-specific API key fallback for development
+OPENAI_API_KEY=your_api_key_here
+# Environment server URL (local Docker)
+ENV_URL=http://localhost:7860
+# Inference script flags
+# --fast-mode : run heuristic (no LLM calls) for deterministic, instant runs
+# --episodes N : number of episodes per task

README.md CHANGED Viewed

@@ -1,165 +1,241 @@
-# ⚡ GridMind-RL
-**A real-world RL environment for building energy management** — control HVAC systems, thermal storage, batch job scheduling, and demand response under stochastic electricity prices and grid stress events.
-Built on the [OpenEnv](https://github.com/meta-pytorch/OpenEnv) specification. Containerized. Ready for Hugging Face Spaces.
 ---
-## 🎯 Why GridMind-RL?
-Optimizing building energy use is a **real problem** that utilities, building managers, and industrial operators face every day. An agent must balance:
-- **Cost** — buy electricity when it's cheap, avoid peak pricing
-- **Comfort** — keep indoor temperature within comfortable bounds
-- **Grid compliance** — shed load when the grid signals demand-response events
-- **Scheduling** — complete batch processing jobs before their deadlines
-- **Carbon** — minimize carbon emissions by timing consumption to clean-grid periods
-This isn't a toy or a game. It's a simulation of decisions that **humans actually make** in industrial energy management, packaged as an RL environment where agents can learn to do it better.
 ---
-## 📐 Observation Space
-Each timestep (15 minutes of simulated time), the agent receives:
-| Field | Type | Range | Description |
-|-------|------|-------|-------------|
-| `indoor_temperature` | float | 10–40 °C | Current building temperature |
-| `thermal_storage_level` | float | 0.0–1.0 | Thermal tank fill level (0=empty, 1=full) |
-| `process_demand` | float | ≥ 0 kW | Current industrial power demand |
-| `current_price` | float | > 0 $/kWh | Real-time electricity price |
-| `grid_stress_signal` | float | 0.0–1.0 | Utility demand-response urgency (>0.7 = critical) |
-| `carbon_intensity` | float | ≥ 0 gCO₂/kWh | Grid carbon intensity |
-| `hour_of_day` | int | 0–23 | Current hour |
-| `batch_queue` | int[] | — | Deadline slots of pending batch jobs |
-| `cumulative_cost` | float | ≥ 0 $ | Total energy cost so far this episode |
-| `step` | int | 0–95 | Current timestep (96 steps = 24 hours) |
-| `building_id` | int | 0+ | Building index in multi-building mode |
-## 🕹️ Action Space
-Each timestep, the agent sends:
-| Field | Type | Range | Description |
-|-------|------|-------|-------------|
-| `hvac_power_level` | float | 0.0–1.0 | Fraction of max HVAC power (0=off, 1=full) |
-| `thermal_charge_rate` | float | -1.0–1.0 | Charge (+) or discharge (-) thermal storage |
-| `batch_job_slot` | int | 0–4 | Schedule next batch job: 0=now, 1–4=defer |
-| `load_shed_fraction` | float | 0.0–0.5 | Fraction of non-critical load to shed |
-| `building_id` | int | 0+ | Which building this action targets |
-## 💰 Reward Structure
-The environment provides a **dense, multi-component reward** every step — not just a binary win/lose at the end. Each step returns a scalar `reward` (the sum) plus a detailed `reward_components` breakdown:
-| Component | Key | Description |
-|-----------|-----|-------------|
-| Cost Savings | `cost_savings` | Rewards reducing energy spend vs baseline |
-| Temperature | `temp_constraint` | Gaussian bonus near setpoint, penalty outside bounds |
-| Grid Response | `grid_response` | Bonus for shedding load during grid stress |
-| Efficiency | `efficiency_bonus` | Thermal storage arbitrage + balanced usage |
-| Stability | `stability_penalty` | Rewards smooth control, penalizes oscillation |
-| Deadlines | `deadline_penalty` | Penalty for missed batch jobs |
-| Carbon | `carbon_reward` | Bonus for low-carbon operation |
 ---
-## 📋 Tasks (3 difficulty levels)
-Each task defines a concrete objective with a **deterministic programmatic grader** that scores performance from **0.0 to 1.0**.
-| ID | Difficulty | Name | What the Agent Must Do | Grader Weights |
-|----|:----------:|------|------------------------|----------------|
-| 1 | 🟢 Easy | **Cost Minimization** | Minimize total energy cost over 24 hours. No temperature or scheduling constraints. | cost: 100% |
-| 2 | 🟡 Medium | **Constrained Temperature** | Minimize cost **and** keep temperature within 19–23°C at all times. | cost: 60%, temperature: 40% |
-| 3 | 🔴 Hard | **Full Demand Response** | Minimize cost, maintain temperature, respond to grid stress, complete batch jobs on time, minimize carbon. | cost: 28%, temperature: 20%, grid: 20%, batch: 12%, carbon: 20% |
-**Graders are deterministic**: given the same seed, the same actions always produce the same score.
 ---
-## 🚀 Getting Started (Step by Step)
 ### Prerequisites
-- **Docker** — [Install Docker Desktop](https://www.docker.com/products/docker-desktop/)
 - **Python 3.9+** — [Download Python](https://www.python.org/downloads/)
 - **Git** — [Download Git](https://git-scm.com/downloads)
-### Step 1: Clone the Repository
 ```bash
-git clone https://github.com/LO-Kyu/gridmind.git
-cd gridmind
 ```
-### Step 2: Build and Start the Environment Server
 ```bash
 docker build -t gridmind-rl .
 docker run --rm -d -p 7860:7860 -p 7861:7861 --name gridmind gridmind-rl
 ```
-This starts the GridMind-RL environment server on port **7860**. Verify it's running:
 ```bash
-# Linux/macOS
 curl http://localhost:7860/health
-# Windows (PowerShell)
-Invoke-RestMethod -Uri http://localhost:7860/health
 ```
-You should see: `{"status":"ok","version":"1.0.0"}`
-### Step 3: Install Python Dependencies
-Open a **new terminal** (keep Docker running) and install:
 ```bash
 pip install -r python/requirements.txt
 ```
-### Step 4: Get a Free API Key
-The inference script uses an LLM to make decisions. You need a **free** API key:
-1. Go to [openrouter.ai/keys](https://openrouter.ai/keys)
-2. Sign in with Google or GitHub (free)
-3. Click **"Create Key"** and copy it
-### Step 5: Configure Your API Key
-Open the `.env` file in the project root and paste your key:
-```env
-API_BASE_URL=https://openrouter.ai/api/v1
-MODEL_NAME=meta-llama/llama-3.1-8b-instruct:free
-OPENAI_API_KEY=sk-or-v1-paste-your-actual-key-here
-ENV_URL=http://localhost:7860
 ```
-> **Note:** The model `meta-llama/llama-3.1-8b-instruct:free` is **completely free** on OpenRouter. No credit card needed.
-### Step 6: Run the Baseline Inference
-```bash
-# Run LLM agent on all 3 tasks
-python inference.py --episodes 1
-# Or run without LLM (fast heuristic mode — no API key needed)
-python inference.py --fast-mode --episodes 1
-```
-The script will:
-1. Connect to the environment server
-2. Run the agent on Task 1 (easy), Task 2 (medium), Task 3 (hard)
-3. Print `[START]`, `[STEP1]`...`[STEP96]`, `[END]` for each episode
-4. Save results to `baseline_scores.json`
-### Step 7: Stop the Server (When Done)
 ```bash
 docker stop gridmind
@@ -167,56 +243,149 @@ docker stop gridmind
 ---
-## 📊 Baseline Scores
-Produced by running `python inference.py --fast-mode --episodes 1` (heuristic policy):
-| Task | Difficulty | Score | Details |
-|------|:----------:|:-----:|---------|
-| 1 — Cost Minimization | 🟢 Easy | **0.7063** | cost: 0.706 |
-| 2 — Temperature Management | 🟡 Medium | **0.6333** | cost: 0.701, temperature: 0.531 |
-| 3 — Full Demand Response | 🔴 Hard | **0.5966** | cost: 0.670, temp: 0.573, grid: 0.214, batch: 1.000, carbon: 0.657 |
-| **Overall Average** | | **0.6454** | |
-Scores are in the **0.0–1.0** range. Higher is better.
 ---
-## 🔌 HTTP API Reference
-Base URL: `http://localhost:7860`
-| Method | Path | Purpose |
-|--------|------|---------|
-| `GET` | `/health` | Health check → `{"status":"ok","version":"1.0.0"}` |
-| `GET` | `/ping` | Lightweight liveness check |
-| `POST` | `/reset` | Start a new episode. Body: `{"task_id": 1, "seed": 42}` |
-| `POST` | `/step` | Take one action. Body: action JSON (see Action Space above) |
-| `GET` | `/state` | Full environment state snapshot |
-| `GET` | `/grade` | Episode score (0.0–1.0) with sub-scores |
-| `GET` | `/replay` | Full step-by-step replay of the episode |
-| `GET` | `/tasks` | List all task definitions and grader weights |
-| `GET` | `/metrics` | Prometheus-format operational metrics |
-### Example API Calls
 ```bash
-# Reset to Task 1 (easy) with seed 42
 curl -X POST http://localhost:7860/reset \
   -H "Content-Type: application/json" \
   -d '{"task_id": 1, "seed": 42}'
-# Take one step
 curl -X POST http://localhost:7860/step \
   -H "Content-Type: application/json" \
-  -d '{"hvac_power_level": 0.5, "thermal_charge_rate": 0.1, "batch_job_slot": 1, "load_shed_fraction": 0.0}'
-# Check score after episode
 curl http://localhost:7860/grade
 ```
 ---
 ## 🏗️ Architecture
 ```
@@ -385,21 +554,70 @@ Each episode emits structured markers for automated evaluation:
 ---
-## 📎 OpenEnv Spec Compliance
-| Requirement | Status |
-|-------------|--------|
-| `openenv.yaml` with metadata | ✅ |
-| Typed Pydantic models (Observation, Action, Reward) | ✅ |
-| `step(action)` → observation, reward, done, info | ✅ |
-| `reset()` → initial observation | ✅ |
-| `state()` → current state | ✅ |
-| 3 tasks with programmatic graders (0.0–1.0) | ✅ |
-| Dense reward function (not binary) | ✅ |
-| Baseline inference using OpenAI client | ✅ |
-| Working Dockerfile | ✅ |
-| Deterministic with seed | ✅ |
-| Exploit detection | ✅ |
 ---

+# 🏢 GridMind-RL — Energy Management Reinforcement Learning Environment
+**A real-world RL environment for intelligent building energy optimization.** Control HVAC systems, thermal storage, batch job scheduling, and demand-response under stochastic electricity prices and grid stress events.
+Built on the [OpenEnv](https://github.com/meta-pytorch/OpenEnv) specification. Containerized. Ready for Hugging Face Spaces deployment.
 ---
+## 📖 Overview & Motivation
+Building energy management is a **real-world optimization problem** facing utilities, facility operators, and industrial sites globally. Traditional rule-based controls waste billions in energy costs and miss opportunities for grid participation.
+**GridMind-RL** simulates decisions that facility operators must make daily:
+- **Cost Optimization** — Buy electricity when prices are low, avoid peak surcharges
+- **Comfort & Safety** — Maintain indoor temperature within acceptable ranges while managing thermal inertia
+- **Grid Participation** — Respond to demand-response signals and grid stress events
+- **Batch Scheduling** — Coordinate industrial process timings to meet deadlines and minimize energy cost
+- **Carbon Minimization** — Shift consumption to periods when grid carbon intensity is low
+**Why this matters:** An RL agent trained in this environment can learn strategies that would be difficult or impossible for humans to hand-craft. The combination of continuous control (HVAC power, thermal storage), discrete decisions (batch scheduling), and multiple simultaneous objectives (cost, comfort, grid, deadlines, carbon) creates a realistic, challenging benchmark.
+**Episode Length:** 96 steps = 24 hours at 15-minute resolution. A complete episode requires strategic decision-making across a full day-night cycle.
 ---
+## � Observation Space
+At each timestep, the environment provides the following observations. **Episode length: 96 steps** (15-minute intervals = 24 hours).
+| Field | Data Type | Range / Values | Description |
+|-------|-----------|-----------------|-------------|
+| `indoor_temperature` | float | 10–40 °C | Current building interior temperature |
+| `thermal_storage_level` | float | 0.0–1.0 | Thermal tank charge state (0 = empty, 1 = full) |
+| `process_demand` | float | ≥ 0 kW | Current industrial batch process power draw |
+| `current_price` | float | > 0 $/kWh | Real-time spot electricity price |
+| `grid_stress_signal` | float | 0.0–1.0 | Utility demand-response urgency (0.7+ = critical) |
+| `carbon_intensity` | float | ≥ 0 gCO₂/kWh | Current grid carbon intensity |
+| `hour_of_day` | int | 0–23 | Time-of-day context |
+| `batch_queue` | int array | — | Pending batch jobs with deadline slots |
+| `cumulative_cost` | float | ≥ 0 $ | Energy cost accumulated in current episode so far |
+| `step` | int | 0–95 | Current timestep (96 total = 24 hours) |
+| `building_id` | int | 0+ | Building identifier (for multi-building scenarios) |
+**Observation Properties:**
+- Observations are **deterministic** given the seed — same seed produces identical sequences
+- All fields are **normalized or bounded** for stable learning
+- Prices follow realistic time-of-use patterns; carbon intensity varies with grid mix
+- Batch queue starts empty; jobs appear stochastically based on the task/seed
+---
+## 🎮 Action Space
+At each step, the agent sends an action controlling four independent subsystems:
+| Field | Data Type | Range | Description |
+|-------|-----------|-------|-------------|
+| `hvac_power_level` | float | 0.0–1.0 | HVAC system power (0 = off, 1 = full) |
+| `thermal_charge_rate` | float | -1.0–1.0 | Thermal storage control (+charge, -discharge) |
+| `batch_job_slot` | int | 0–4 | Schedule next batch job: 0=immediate, 1–4=defer |
+| `load_shed_fraction` | float | 0.0–0.5 | Non-critical load reduction (0–50%) for demand-response |
+| `building_id` | int | 0+ | Building identifier (routing) |
+**Action Space Properties:**
+- **Continuous** (HVAC, thermal charging, load shedding) + **discrete** (batch scheduling) → hybrid control
+- Actions are applied every 15-minute step
+- Load shedding is capped at 50% to ensure safety/habitability
+- Batch scheduling decisions affect energy cost and deadline compliance
 ---
+## 💡 Reward Function
+The environment provides **dense rewards every step** (not sparse, not binary). Each step returns:
+- A scalar reward (sum of components)
+- A dictionary of 7 weighted sub-components for transparency
+| Component | Purpose | Possible Values |
+|-----------|---------|-----------------|
+| **cost_savings** | Minimize energy bill | Negative (cost increases) to positive (savings vs baseline) |
+| **temp_constraint** | Maintain comfort | Gaussian bonus near 21°C, penalty outside 19–23°C bounds |
+| **grid_response** | Shift load during stress | Bonus proportional to shed fraction when grid signal > 0.7 |
+| **efficiency_bonus** | Exploit thermal storage | Reward charge/discharge timing and thermal arbitrage |
+| **stability_penalty** | Smooth control | Small penalty for rapid oscillations in HVAC/storage |
+| **deadline_penalty** | Meet job deadlines | Large penalty if batch job finishes after deadline |
+| **carbon_reward** | Low-carbon consumption | Bonus for consuming during low-carbon grid periods |
+**Example Reward Calculation:**
+If an agent takes a well-timed action during high-price, high-stress period:
+- Large positive `cost_savings` (avoided expensive hour)
+- Positive `grid_response` (shed load successfully)
+- Possible positive `carbon_reward` (if grid is clean)
+- **Total step reward** = weighted sum of all components
+This multi-objective reward structure encourages **learning tradeoffs** between cost, comfort, grid support, and carbon efficiency.
+---
 ---
+## 📋 Tasks & Difficulty Levels
+Three independent tasks with **deterministic programmatic graders**. Scores range **0.0–1.0**; higher is better.
+### Task 1 — Cost Minimization (🟢 Easy)
+**Objective:** Minimize total energy cost in 24 hours with no other constraints.
+**Difficulty Rationale:** Only one objective (cost) to optimize; temperature and grid constraints are relaxed.
+**Grader Metrics:**
+- **Cost score (100%)** — Compares total episode energy cost to a deterministic baseline. Higher savings → higher score.
+**Baseline Score:** **0.7063**
+---
+### Task 2 — Constrained Temperature Control (🟡 Medium)
+**Objective:** Minimize cost while maintaining indoor temperature between **19–23°C** throughout the episode.
+**Difficulty Rationale:** Introduces a hard constraint (temperature bounds). Agent must use thermal storage strategically to meet both cost and comfort goals.
+**Grader Metrics:**
+- **Cost score (60%)** — Total energy cost vs baseline
+- **Temperature score (40%)** — Fraction of steps within bounds (hard penalty for violations)
+**Notes:** A naive agent might achieve low cost by disabling HVAC, but then temperatures drift out of bounds (0 score). Trade-off learning is required.
+**Baseline Score:** **0.6333**
+---
+### Task 3 — Full Demand Response (🔴 Hard)
+**Objective:** Minimize cost, maintain temperature, respond to grid events, complete batch jobs on time, and minimize carbon emissions. This is a **multi-objective constraint satisfaction** problem.
+**Difficulty Rationale:** Most realistic. Agent must balance five competing objectives simultaneously; any single failure is costly.
+**Grader Metrics:**
+- **Cost score (28%)** — Energy cost
+- **Temperature score (20%)** — Time within comfort bounds
+- **Grid response score (20%)** — Load shed during demand-response events (signal > 0.7)
+- **Batch deadline score (12%)** — Fraction of jobs completed before deadline
+- **Carbon reward score (20%)** — Shift load to low-carbon periods
+**Baseline Breakdown:**
+- Cost: 0.670, Temperature: 0.573, Grid: 0.214, Batch: 1.000, Carbon: 0.657
+- **Overall: 0.5966**
+**Challenge:** Grid response score (~0.21) shows that the baseline heuristic rarely sheds load opportunistically. Learning agents should discover that quick load shedding during high-price, high-stress periods yields significant cost savings.
+**Grader Determinism:** Same seed always produces identical evaluations. Episodes are seeded internally; reproducible batches of evaluations can be generated for benchmark comparisons.
+---
+## 🚀 Setup & Usage
 ### Prerequisites
+- **Docker** — [Download Docker Desktop](https://www.docker.com/products/docker-desktop/)
 - **Python 3.9+** — [Download Python](https://www.python.org/downloads/)
 - **Git** — [Download Git](https://git-scm.com/downloads)
+### Quick Start (5 minutes)
+#### 1. Clone the Repository
 ```bash
+git clone https://github.com/LO-Kyu/gridmind-rl.git
+cd gridmind-rl
 ```
+#### 2. Build and Start the Environment Server
 ```bash
 docker build -t gridmind-rl .
 docker run --rm -d -p 7860:7860 -p 7861:7861 --name gridmind gridmind-rl
 ```
+Verify the server is running:
 ```bash
+# Check health endpoint
 curl http://localhost:7860/health
+# Expected: {"status":"ok","version":"1.0.0"}
 ```
+#### 3. Install Python Dependencies
+Open a **new terminal** and install:
 ```bash
 pip install -r python/requirements.txt
 ```
+#### 4. Run Inference (No LLM — Fast)
+Run a fast, deterministic baseline using heuristic policy:
+```bash
+python inference.py --fast-mode --episodes 1
 ```
+Expected output (sample):
+```
+[START] task=Cost_Minimization env=gridmind model=heuristic
+[STEP1] step=1 action={...} reward=10.5 done=false
+[STEP2] step=2 action={...} reward=12.3 done=false
+...
+[STEP96] step=96 action={...} reward=8.9 done=true
+[END] success=true steps=96 rewards=[10.5, 12.3, ..., 8.9]
+```
+Results saved to: `baseline_scores.json`
+#### 5. (Optional) Run with LLM
+To use an LLM agent for decision-making:
+1. Get a **free API key** from [openrouter.ai/keys](https://openrouter.ai/keys) (no credit card needed)
+2. Create `.env` file (copy from `.env.example`):
+   ```bash
+   cp .env.example .env
+   ```
+3. Edit `.env` and add your API key:
+   ```env
+   HF_TOKEN=sk-or-v1-your-key-here
+   # or
+   OPENAI_API_KEY=sk-or-v1-your-key-here
+   ```
+4. Run with LLM:
+   ```bash
+   python inference.py --episodes 1
+   ```
+#### 6. Stop the Server (When Done)
 ```bash
 docker stop gridmind
 ---
+### Inference Script Reference
+The `inference.py` script (project root) is the **hackathon submission entrypoint**.
+**Environment Variables:**
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `HF_TOKEN` | (required for submission) | API key for LLM provider or HF Spaces |
+| `OPENAI_API_KEY` | (optional fallback) | Alternative OpenAI-compatible key |
+| `API_BASE_URL` | `https://openrouter.ai/api/v1` | LLM endpoint URL |
+| `MODEL_NAME` | `meta-llama/llama-3.3-70b-instruct:free` | Model identifier |
+| `ENV_URL` | `http://localhost:7860` | Environment server address |
+**Command-Line Flags:**
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--episodes N` | 1 | Episodes per task (runs tasks 1, 2, 3 in sequence) |
+| `--fast-mode` | off | Don't call LLM; use heuristic policy only (reproducible, no API calls) |
+| `--llm-every N` | 4 | Reuse each LLM decision for N steps (reduces API calls) |
+| `--max-steps N` | 96 | Stop episode early after N steps |
+| `--env-url URL` | from env var | Override environment server URL |
+| `--output FILE` | `baseline_scores.json` | Output results filename |
+| `--verbose` | off | Print detailed logs for each step |
+**Examples:**
+```bash
+# Run all 3 tasks with LLM (1 episode each)
+python inference.py --episodes 1
+# Reproduce baseline fast (no LLM)
+python inference.py --fast-mode --episodes 1
+# Only Task 2, heuristic, verbose output
+python inference.py --fast-mode --episodes 1 --verbose
+# Run 5 episodes per task with custom environment
+python inference.py --episodes 5 --env-url http://my-server:7860
+```
 ---
+### HTTP API Reference
+**Base URL:** `http://localhost:7860`
+| Endpoint | Method | Purpose | Example Body |
+|----------|--------|---------|---------------|
+| `/health` | GET | Liveness check | — |
+| `/ping` | GET | Lightweight ping | — |
+| `/reset` | POST | Reset episode for a task | `{"task_id": 1, "seed": 42}` |
+| `/step` | POST | Apply action, get next observation | `{"hvac_power_level": 0.5, "thermal_charge_rate": 0.1, ...}` |
+| `/state` | GET | Current full state snapshot | — |
+| `/grade` | GET | Episode score (0.0–1.0) with sub-scores | — |
+| `/replay` | GET | Full step-by-step trajectory | — |
+| `/tasks` | GET | Task definitions and grader weights | — |
+| `/metrics` | GET | Prometheus-format metrics | — |
+**Example Workflow:**
 ```bash
+# 1. Reset to Task 1 with seed 42
 curl -X POST http://localhost:7860/reset \
   -H "Content-Type: application/json" \
   -d '{"task_id": 1, "seed": 42}'
+# 2. Get initial observation
+curl http://localhost:7860/state
+# 3. Take an action
 curl -X POST http://localhost:7860/step \
   -H "Content-Type: application/json" \
+  -d '{
+    "hvac_power_level": 0.5,
+    "thermal_charge_rate": 0.1,
+    "batch_job_slot": 1,
+    "load_shed_fraction": 0.0
+  }'
+# 4. Check final score after episode completes
 curl http://localhost:7860/grade
 ```
 ---
+## 📊 Baseline Performance Scores
+The baseline is a **heuristic policy** (rule-based, no LLM) representing a reasonable but non-optimized control strategy. Your RL agent should aim to exceed these scores.
+**Baseline Run:** `python inference.py --fast-mode --episodes 1`
+### Summary Scores
+| Task | Difficulty | Score | Model |
+|------|:----------:|:-----:|-------|
+| Task 1 — Cost Minimization | 🟢 Easy | **0.7063** | Heuristic |
+| Task 2 — Temperature Control | 🟡 Medium | **0.6333** | Heuristic |
+| Task 3 — Full Demand Response | 🔴 Hard | **0.5966** | Heuristic |
+| **Overall Average** | — | **0.6454** | Heuristic |
+### Detailed Breakdown
+#### Task 1 Results
+- **Task:** Cost minimization (96 hours × 15 min = 24 hours)
+- **Score:** 0.7063
+- **Sub-score:** Cost = 0.706
+- **Interpretation:** Heuristic achieves ~70% of optimal cost reduction vs baseline
+#### Task 2 Results
+- **Task:** Minimize cost while maintaining temperature 19–23°C
+- **Score:** 0.6333
+- **Sub-scores:**
+  - Cost: 0.701
+  - Temperature constraint: 0.531 (agent violated comfort bounds ~47% of the time)
+- **Interpretation:** Temperature management is challenging for the heuristic. Tighter thermal control could improve this score significantly.
+#### Task 3 Results (Most Interesting)
+- **Task:** Multi-objective: cost, temperature, grid response, batch deadlines, carbon
+- **Score:** 0.5966
+- **Sub-scores:**
+  - Cost: 0.670
+  - Temperature: 0.573 (similar temperature control challenge as Task 2)
+  - **Grid response: 0.214** ← Heuristic rarely participates in demand-response
+  - Batch deadline: 1.000 (heuristic always completes jobs on time)
+  - Carbon: 0.657
+**Key Insight:** The heuristic's low grid response score (0.21) suggests that learned agents have significant room for improvement by:
+1. Recognizing high-price + high-stress periods
+2. Proactively shedding load to reduce cost
+3. Using thermal storage to recover comfort afterward
+This multi-objective setting is where RL agents typically exceed heuristic baselines.
+### Reproducibility & Evaluation
+- **Deterministic:** Baseline scores are **deterministic** — same seed always produces identical actions and rewards
+- **Seeding:** Each task uses a fixed base seed (1100, 1200, 1300) for reproducible evaluation
+- **Your Submissions:** Your agent will be evaluated on the same seed distribution; compare your scores directly to baseline
+---
 ## 🏗️ Architecture
 ```
 ---
+## ✅ OpenEnv Specification Compliance
+GridMind-RL fully implements the OpenEnv specification for standardized RL environments. All components are present and tested:
+| Requirement | Status | Notes |
+|-------------|:------:|-------|
+| Manifest (`openenv.yaml`) | ✅ | All metadata, schema definitions, and version info |
+| Observation Schema | ✅ | 11-field object: temperature, storage, price, grid signal, carbon, hour, batch queue, cost, step, building_id |
+| Action Schema | ✅ | 5-field object: HVAC, thermal rate, batch slot, load shed, building_id |
+| HTTP Endpoints | ✅ | `/reset`, `/step`, `/state`, `/grade`, `/replay`, `/tasks`, `/health`, `/metrics` |
+| Determinism | ✅ | Seeded episode generation; identical seeds produce identical trajectories |
+| Typed Models | ✅ | Pydantic models (Python) mirror Go structs exactly |
+| Dense Rewards | ✅ | 7-component reward breakdown every step |
+| Graders | ✅ | 3 tasks with programmatic, deterministic graders (0.0–1.0 range) |
+| Exploit Detection | ✅ | Built into grading pipeline to flag unrealistic scores |
+---
+## ❓ FAQ
+**Q: Can I use a different model?**
+A: Yes. Set `MODEL_NAME` environment variable to any OpenAI-compatible model. The default (`meta-llama/llama-3.3-70b-instruct:free`) is free on OpenRouter with no credit card.
+**Q: How do I avoid rate limiting?**
+A: (1) Use `--fast-mode` for local testing (no API calls), (2) Set `--llm-every 4` to reuse decisions, (3) Use a paid API tier for submission, or (4) Train & submit an offline policy.
+**Q: Will my API key be exposed in submissions?**
+A: No. Store your API key in `.env` (git-ignored). On HF Spaces, set secrets via the Space settings UI; keys are never committed to the repo.
+**Q: What's the difference between `HF_TOKEN` and `OPENAI_API_KEY`?**
+A: `HF_TOKEN` is used in HF Space deployments and external evaluations. `OPENAI_API_KEY` is a fallback for local development. The code tries `HF_TOKEN` first, then `OPENAI_API_KEY`. At least one must be set.
+**Q: Can I submit an offline/trained policy?**
+A: Yes. Modify `python/inference.py` to use your trained agent instead of LLM calls. Ensure you still output the required `[START]`, `[STEP]`, `[END]` format.
+**Q: What if my submission times out?**
+A: Each episode is 96 steps. The environment runs 3 episodes (one per task). Optimize for latency: reduce LLM calls (use `--llm-every`), use a faster model, or submit a heuristic/trained offline policy.
+---
+## 🎯 Submission Checklist
+Before submitting, verify:
+- [ ] Clone repo, build Docker, run `docker run -p 7860:7860 -p 7861:7861 gridmind-rl`
+- [ ] Run `python inference.py --fast-mode --episodes 1` locally — should produce `baseline_scores.json`
+- [ ] Check `[START]`, `[STEP]`, `[END]` markers in stdout
+- [ ] Set `HF_TOKEN` or `OPENAI_API_KEY` in `.env` for LLM runs
+- [ ] Test with LLM: `python inference.py --episodes 1`
+- [ ] Verify Dockerfile builds without errors: `docker build -t gridmind-rl .`
+- [ ] Create HF Space (Docker SDK, CPU Basic)
+- [ ] Push repo to HF Space: `git push hf main`
+- [ ] Set secrets in HF Space UI: `HF_TOKEN`, `API_BASE_URL` (optional), `MODEL_NAME` (optional)
+- [ ] Verify Space is running: `curl https://YOUR_USERNAME-gridmind-rl.hf.space/health`
+- [ ] Submit Space URL to hackathon organizers
+---
+## 📚 Additional Resources
+- **OpenEnv Spec:** https://github.com/meta-pytorch/OpenEnv
+- **OpenRouter Free Models:** https://openrouter.ai/keys
+- **HF Spaces Docs:** https://huggingface.co/docs/hub/spaces
+- **GridMind Repository:** https://github.com/LO-Kyu/gridmind-rl
 ---

baseline_scores.json CHANGED Viewed

@@ -3,54 +3,54 @@
   "api_base": "https://openrouter.ai/api/v1",
   "episodes_per_task": 1,
   "seed_base": 1000,
-  "fast_mode": false,
   "llm_every": 4,
   "max_steps": null,
   "task_averages": {
-    "1": 0.6841,
-    "2": 0.6354,
-    "3": 0.5762
   },
-  "overall_average": 0.6319,
   "all_results": [
     {
       "task_id": 1,
       "seed": 1100,
-      "total_reward": 249.86282530532836,
       "total_steps": 96,
-      "elapsed_sec": 89.6078360080719,
-      "score": 0.6841,
       "sub_scores": {
-        "cost": 0.684121466192583
       },
       "exploit_detected": false
     },
     {
       "task_id": 2,
       "seed": 1200,
-      "total_reward": 249.49791911192042,
       "total_steps": 96,
-      "elapsed_sec": 1.0353844165802002,
-      "score": 0.6354,
       "sub_scores": {
-        "cost": 0.6979584216616654,
-        "temperature": 0.5416666666666666
       },
       "exploit_detected": false
     },
     {
       "task_id": 3,
       "seed": 1300,
-      "total_reward": 247.87201304624966,
       "total_steps": 96,
-      "elapsed_sec": 0.9833629131317139,
-      "score": 0.5762,
       "sub_scores": {
         "batch_deadline": 1,
-        "carbon": 0.6156696345619006,
-        "cost": 0.6199066683519812,
         "grid_response": 0.21428571428571427,
-        "temperature": 0.5833333333333334
       },
       "exploit_detected": false
     }

   "api_base": "https://openrouter.ai/api/v1",
   "episodes_per_task": 1,
   "seed_base": 1000,
+  "fast_mode": true,
   "llm_every": 4,
   "max_steps": null,
   "task_averages": {
+    "1": 0.7063,
+    "2": 0.6333,
+    "3": 0.5966
   },
+  "overall_average": 0.6454,
   "all_results": [
     {
       "task_id": 1,
       "seed": 1100,
+      "total_reward": 251.40178983938813,
       "total_steps": 96,
+      "elapsed_sec": 1.8465147018432617,
+      "score": 0.7063,
       "sub_scores": {
+        "cost": 0.7063441549865395
       },
       "exploit_detected": false
     },
     {
       "task_id": 2,
       "seed": 1200,
+      "total_reward": 246.40262234598185,
       "total_steps": 96,
+      "elapsed_sec": 1.826324224472046,
+      "score": 0.6333,
       "sub_scores": {
+        "cost": 0.7014155357169216,
+        "temperature": 0.53125
       },
       "exploit_detected": false
     },
     {
       "task_id": 3,
       "seed": 1300,
+      "total_reward": 255.60231973463087,
       "total_steps": 96,
+      "elapsed_sec": 1.8300776481628418,
+      "score": 0.5966,
       "sub_scores": {
         "batch_deadline": 1,
+        "carbon": 0.6574530318382599,
+        "cost": 0.670084941969173,
         "grid_response": 0.21428571428571427,
+        "temperature": 0.5729166666666666
       },
       "exploit_detected": false
     }

inference.py CHANGED Viewed

@@ -1,11 +1,43 @@
 """
 Hackathon entrypoint: run from repo root with:
   python inference.py
 Delegates to python/inference.py (single source of truth).
 """
 import runpy
 from pathlib import Path
 if __name__ == "__main__":
     impl = Path(__file__).resolve().parent / "python" / "inference.py"
     runpy.run_path(str(impl), run_name="__main__")

 """
 Hackathon entrypoint: run from repo root with:
   python inference.py
+Reads environment variables:
+  - API_BASE_URL (default: https://openrouter.ai/api/v1)
+  - MODEL_NAME (default: meta-llama/llama-3.3-70b-instruct:free)
+  - HF_TOKEN (mandatory, no default)
+Emits hackathon-compliant stdout format:
+  [START] task=<name> env=gridmind model=<model>
+  [STEP] step=<n> action=<json> reward=<0.00> done=<true|false> error=<msg|null>
+  [END] success=<true|false> steps=<n> rewards=<r1,r2,...>
 Delegates to python/inference.py (single source of truth).
 """
+import os
+import sys
 import runpy
 from pathlib import Path
 if __name__ == "__main__":
+    # Load .env file FIRST (if present)
+    try:
+        from dotenv import load_dotenv
+        load_dotenv()  # reads .env from current directory or project root
+    except ImportError:
+        pass  # python-dotenv not installed — env vars must be set manually
+    # Now validate HF_TOKEN after .env is loaded
+    hf_token = os.getenv("HF_TOKEN")
+    if not hf_token:
+        # Allow OPENAI_API_KEY as fallback for development
+        if not os.getenv("OPENAI_API_KEY"):
+            print(
+                "[ERROR] HF_TOKEN environment variable is required "
+                "(or OPENAI_API_KEY for development)",
+                file=sys.stderr
+            )
+            sys.exit(1)
     impl = Path(__file__).resolve().parent / "python" / "inference.py"
     runpy.run_path(str(impl), run_name="__main__")

python/inference.py CHANGED Viewed

@@ -47,8 +47,9 @@ except ImportError:
 ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
 MODEL_NAME = os.getenv("MODEL_NAME", "meta-llama/llama-3.3-70b-instruct:free")
 API_BASE_URL = os.getenv("API_BASE_URL", "https://openrouter.ai/api/v1")
-# Accept OPENAI_API_KEY (hackathon standard) or HF_TOKEN (HuggingFace convention)
-OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "") or os.getenv("HF_TOKEN", "")
 DEFAULT_EPISODES = 1
 DEFAULT_SEED_BASE = 1000
 MAX_RETRIES = 3
@@ -277,60 +278,103 @@ def run_episode(
     max_steps: int | None,
     verbose: bool = False,
 ) -> dict[str, Any]:
-    """Run a single episode and return grade + metadata. Prints [START], [STEPn], [END]."""
     reset_resp = env_client.reset(task_id=task_id, seed=seed)
     obs = reset_resp["observations"][0]
-    print("[START]", flush=True)
     total_reward = 0.0
     total_steps = 0
     start_time = time.time()
     step_resp: dict[str, Any] = {}
     step_limit = EPISODE_STEPS if max_steps is None else min(max_steps, EPISODE_STEPS)
     llm_reuse_remaining = 0
     cached_action = agent._default_action()
     while not step_resp.get("done", False):
         if total_steps >= step_limit:
             break
-        if fast_mode:
-            action = agent._heuristic_action(obs)
-        else:
-            if llm_reuse_remaining <= 0:
-                cached_action = agent.choose_action(obs, task_id)
-                llm_reuse_remaining = max(1, llm_every)
-            action = cached_action
-        step_resp = env_client.step(action)
-        if step_resp is None or "observation" not in step_resp:
-            print(f"  [WARN] step {total_steps}: invalid step response", flush=True)
-            break
-        if not fast_mode:
-            llm_reuse_remaining -= 1
-        obs = step_resp["observation"]
-        total_reward += float(step_resp["reward"])
-        total_steps += 1
-        print(f"[STEP{total_steps}]", flush=True)
-        if verbose and total_steps % 16 == 0:
             print(
-                f"    step={total_steps:02d} price=${obs['current_price']:.3f} "
-                f"temp={obs['indoor_temperature']:.1f}°C "
-                f"stress={obs['grid_stress_signal']:.2f} "
-                f"cost=${obs['cumulative_cost']:.2f} "
-                f"reward={step_resp['reward']:.3f}",
-                flush=True,
             )
     elapsed = time.time() - start_time
     grade = env_client.grade()
-    print("[END]", flush=True)
     return {
         "task_id": task_id,
         "seed": seed,

 ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
 MODEL_NAME = os.getenv("MODEL_NAME", "meta-llama/llama-3.3-70b-instruct:free")
 API_BASE_URL = os.getenv("API_BASE_URL", "https://openrouter.ai/api/v1")
+# Hackathon spec: HF_TOKEN is mandatory. Accept OPENAI_API_KEY as secondary fallback for dev.
+HF_TOKEN = os.getenv("HF_TOKEN")
+OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "") or HF_TOKEN or ""
 DEFAULT_EPISODES = 1
 DEFAULT_SEED_BASE = 1000
 MAX_RETRIES = 3
     max_steps: int | None,
     verbose: bool = False,
 ) -> dict[str, Any]:
+    """Run a single episode and emit hackathon-compliant stdout format.
+    Emits:
+      [START] task=<name> env=gridmind model=<model>
+      [STEP] step=<n> action=<json> reward=<0.00> done=<true|false> error=<msg|null>
+      ...
+      [END] success=<true|false> steps=<n> rewards=<r1,r2,...>
+    """
     reset_resp = env_client.reset(task_id=task_id, seed=seed)
     obs = reset_resp["observations"][0]
+    task_name = f"gridmind-task-{task_id}"
+    # Emit [START] with required fields
+    print(f"[START] task={task_name} env=gridmind model={MODEL_NAME}", flush=True)
     total_reward = 0.0
     total_steps = 0
     start_time = time.time()
     step_resp: dict[str, Any] = {}
     step_limit = EPISODE_STEPS if max_steps is None else min(max_steps, EPISODE_STEPS)
     llm_reuse_remaining = 0
     cached_action = agent._default_action()
+    step_rewards: list[float] = []
+    last_error: str | None = None
     while not step_resp.get("done", False):
         if total_steps >= step_limit:
             break
+        try:
+            if fast_mode:
+                action = agent._heuristic_action(obs)
+            else:
+                if llm_reuse_remaining <= 0:
+                    cached_action = agent.choose_action(obs, task_id)
+                    llm_reuse_remaining = max(1, llm_every)
+                action = cached_action
+            step_resp = env_client.step(action)
+            if step_resp is None or "observation" not in step_resp:
+                last_error = "invalid step response"
+                break
+            if not fast_mode:
+                llm_reuse_remaining -= 1
+            obs = step_resp["observation"]
+            reward = float(step_resp["reward"])
+            total_reward += reward
+            step_rewards.append(reward)
+            total_steps += 1
+            done = bool(step_resp.get("done", False))
+            # Emit [STEP] with required fields (action as compact JSON, reward to 2 decimals)
+            action_json = json.dumps(action, separators=(',', ':'))
+            error_field = "null" if last_error is None else f'"{last_error}"'
             print(
+                f"[STEP] step={total_steps} action={action_json} "
+                f"reward={reward:.2f} done={'true' if done else 'false'} error={error_field}",
+                flush=True
             )
+            last_error = None  # Clear error after successful step
+            if verbose and total_steps % 16 == 0:
+                print(
+                    f"    step={total_steps:02d} price=${obs['current_price']:.3f} "
+                    f"temp={obs['indoor_temperature']:.1f}°C "
+                    f"stress={obs['grid_stress_signal']:.2f} "
+                    f"cost=${obs['cumulative_cost']:.2f}",
+                    flush=True,
+                )
+        except Exception as e:
+            last_error = str(e)
+            print(
+                f"[STEP] step={total_steps + 1} action=null "
+                f"reward=0.00 done=true error=\"{last_error}\"",
+                flush=True
+            )
+            break
     elapsed = time.time() - start_time
     grade = env_client.grade()
+    success = (total_steps > 0 and total_steps >= step_limit) or last_error is None
+    rewards_str = ",".join(f"{r:.2f}" for r in step_rewards)
+    # Emit [END] with required fields
+    print(
+        f"[END] success={'true' if success else 'false'} steps={total_steps} rewards={rewards_str}",
+        flush=True
+    )
     return {
         "task_id": task_id,
         "seed": seed,