RL Environment for Datacenter Cooling and Operations

The Problem

A shared AI compute cluster has a hard 900 kW power budget. Two research teams compete every scheduling window. Team A is honest — true priority, accurate deadlines, genuine carbon preferences. Team B games the system: inflating priority by 1–2 levels, always claiming urgent deadlines, and hiding carbon flexibility 60% of the time.

A naive scheduler trusting stated claims over-allocates to Team B, crowds out legitimate work, and misses carbon deferral opportunities. The goal: train an LLM scheduler that learns — from environment reward alone — to detect and discount systematic misrepresentation.

This environment bridges Round 1 (physics-based datacenter cooling, evaluated zero-shot) with the Finale (operational scheduling layer built on the same physics engine, trained end-to-end via GRPO).

Architecture at a Glance

🧠 LLM Scheduler (GRPO)

Qwen2.5-3B-Instruct, 4-bit, LoRA r=16. Acts once per window. Reads stated job metadata, team history, oversight flags, power headroom, and carbon forecast. Issues ACCEPT / REJECT / DEFER per job request.

🤖 PPO Cooling Controller

SB3 MLP policy, pre-trained across all three cooling scenarios including mid-episode chiller failure. Runs 18 steps per window, controlling fan speeds (0–100%) and chiller setpoint (6–15 °C). Invisible to the LLM scheduler.

🔍 Oversight Monitor

4 rule-based detectors run after every window using ground-truth job metadata (hidden from the scheduler). Priority inflation (conf. 0.62–0.97), deadline compression, carbon gaming, and pattern escalation (≥3 windows). Flags injected into the next observation.

🏭 Physics Engine

Thermal mass model per zone: ΔT = (heat_in − heat_out) / thermal_mass. Chiller COP degrades with outside temperature. Optional chiller fault at window 5. Carbon grid schedule varies: low→high→low across the 8-window episode.

Run	Hardware	Iterations	Peak Reward	Parse Fails
Colab notebook	T4 GPU	30	+0.1937	0% by iter 5
HF Space	L40S GPU	50	+0.2406	0% from iter 25, final 26 iters
Rule-based baseline	—	—	+0.28 (target)	—

Run

Hardware

Iterations

Peak Reward

Parse Fails

Colab notebook

T4 GPU

+0.1937

0% by iter 5

HF Space

L40S GPU

+0.2406

0% from iter 25, final 26 iters

Rule-based baseline

—

+0.28 (target)

—

POST /reset ← start a new episode → returns WindowState observation POST /step ← submit admission decisions → returns (WindowState, reward, done, info) GET /state ← current environment state (no side effects) GET /health ← liveness probe

Quick Start

from openenv import EnvClient from server.agents.baseline_scheduler import priority_weighted_threshold client = EnvClient("https://mephisto2412-datacenter-env.hf.space") obs = client.reset(seed=42) for window in range(8): decisions = priority_weighted_threshold(obs) # or your trained agent obs, reward, done, info = client.step(decisions) print(f"Window {window} reward={reward:+.4f} flags={len(obs.oversight_flags)}") if done: break

RL Environment for Datacenter Cooling and Operations

The Problem

Architecture at a Glance

🧠 LLM Scheduler (GRPO)

🤖 PPO Cooling Controller

🔍 Oversight Monitor

🏭 Physics Engine

Reward Function

Training Results

OpenEnv HTTP API

Quick Start

Links