# FastAPI server exposing the Data Centre OpenEnv environment (EnvClient-compatible). from fastapi.responses import HTMLResponse from openenv.core.env_server.http_server import create_app from .environment import DCEnvironment from .models import DCAction, DCObservation app = create_app( DCEnvironment, DCAction, DCObservation, env_name="datacenter_env", max_concurrent_envs=1, ) @app.get("/", response_class=HTMLResponse) async def root(): return """
A shared AI compute cluster has a hard 900 kW power budget. Two research teams compete every scheduling window. Team A is honest — true priority, accurate deadlines, genuine carbon preferences. Team B games the system: inflating priority by 1–2 levels, always claiming urgent deadlines, and hiding carbon flexibility 60% of the time.
A naive scheduler trusting stated claims over-allocates to Team B, crowds out legitimate work, and misses carbon deferral opportunities. The goal: train an LLM scheduler that learns — from environment reward alone — to detect and discount systematic misrepresentation.
This environment bridges Round 1 (physics-based datacenter cooling, evaluated zero-shot) with the Finale (operational scheduling layer built on the same physics engine, trained end-to-end via GRPO).
Qwen2.5-3B-Instruct, 4-bit, LoRA r=16. Acts once per window. Reads stated job metadata, team history, oversight flags, power headroom, and carbon forecast. Issues ACCEPT / REJECT / DEFER per job request.
SB3 MLP policy, pre-trained across all three cooling scenarios including mid-episode chiller failure. Runs 18 steps per window, controlling fan speeds (0–100%) and chiller setpoint (6–15 °C). Invisible to the LLM scheduler.
4 rule-based detectors run after every window using ground-truth job metadata (hidden from the scheduler). Priority inflation (conf. 0.62–0.97), deadline compression, carbon gaming, and pattern escalation (≥3 windows). Flags injected into the next observation.
Thermal mass model per zone: ΔT = (heat_in − heat_out) / thermal_mass. Chiller COP degrades with outside temperature. Optional chiller fault at window 5. Carbon grid schedule varies: low→high→low across the 8-window episode.
| Run | Hardware | Iterations | Peak Reward | Parse Fails |
|---|---|---|---|---|
| Colab notebook | T4 GPU | 30 | +0.1937 | 0% by iter 5 |
| HF Space | L40S GPU | 50 | +0.2406 | 0% from iter 25, final 26 iters |
| Rule-based baseline | — | — | +0.28 (target) | — |
POST /reset ← start a new episode → returns WindowState observation POST /step ← submit admission decisions → returns (WindowState, reward, done, info) GET /state ← current environment state (no side effects) GET /health ← liveness probe
from openenv import EnvClient
from server.agents.baseline_scheduler import priority_weighted_threshold
client = EnvClient("https://mephisto2412-datacenter-env.hf.space")
obs = client.reset(seed=42)
for window in range(8):
decisions = priority_weighted_threshold(obs) # or your trained agent
obs, reward, done, info = client.step(decisions)
print(f"Window {window} reward={reward:+.4f} flags={len(obs.oversight_flags)}")
if done:
break