Spaces:
Running
title: DC-Ops Environment Server
emoji: ⚡
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
- reinforcement-learning
- datacenter
- simulation
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/6578359f277dceb6056f1646/WSrOC1cq1OHFyKV469PIQ.png
DC-Ops Environment
A physics-based datacenter operations environment for training LLM agents, built on Meta's OpenEnv framework.
The agent reads a text-based NOC dashboard and issues natural-language operator commands — exactly as a human datacenter operator would.
Quick Start
Prerequisites
- Python 3.10+
- uv (recommended) or pip
- Docker (for containerized deployment)
Install & Run Locally
# Clone the repository
git clone <repo-url>
cd dc_ops_env
# Install dependencies
uv sync
# Run the test suite (256 tests, <10s)
uv run pytest tests/ -v
# Start the server
uv run server
The server starts at http://localhost:8000 with:
- Web UI →
http://localhost:8000/web - API docs →
http://localhost:8000/docs - Health check →
http://localhost:8000/health - WebSocket →
ws://localhost:8000/ws
Run with Docker
# Build the image
docker build -t dc-ops:latest -f server/Dockerfile .
# Run the container
docker run -d -p 8000:8000 dc-ops:latest
# Verify it's running
curl http://localhost:8000/health
OpenEnv Integration
DC-Ops is a fully compliant OpenEnv environment. OpenEnv provides:
- MCP tool-based interactions for LLM agents (WebSocket
/ws) - HTTP orchestration layer for training pipelines (
/reset,/step,/state) - HuggingFace Spaces deployment via
openenv push - TRL/GRPO integration for RL training with
GRPOTrainer
Action & Observation Models
DcOpsAction — the agent's command:
class DcOpsAction(Action):
command: str # e.g., "diagnose CRAC-3", "adjust_setpoint CRAC-1 20"
reasoning: str # Optional chain-of-thought
DcOpsObservation — what the agent sees:
class DcOpsObservation(Observation):
dashboard: str # Text-rendered monitoring dashboard
available_actions: list # Valid commands the agent can issue
alert: str # Current active alert message
scenario_type: str # "thermal", "power", etc.
steps_remaining: int # Steps left in episode budget
action_result: str # Feedback from last action
Available Commands
| Command | Format | Description |
|---|---|---|
diagnose |
diagnose <unit_id> |
Inspect a CRAC/UPS/PDU for faults |
adjust_setpoint |
adjust_setpoint <crac_id> <temp_c> |
Change CRAC supply air setpoint |
set_fan_speed |
set_fan_speed <crac_id> <pct> |
Set CRAC fan speed (0-100%) |
set_rack_load |
set_rack_load <rack_id> <kw> |
Adjust rack IT load (migrate workload) |
start_crac |
start_crac <crac_id> |
Start a standby CRAC unit |
stop_crac |
stop_crac <crac_id> |
Put a CRAC into standby |
start_generator |
start_generator |
Manually start the diesel generator |
stop_generator |
stop_generator |
Initiate generator cooldown |
set_ups_mode |
set_ups_mode <ups_id> <mode> |
Set UPS mode (eco/double_conversion/bypass) |
refuel_generator |
refuel_generator [liters] |
Refuel (default: full tank) |
acknowledge_alarm |
acknowledge_alarm |
Acknowledge current alert |
check_status |
check_status |
Request full status report |
escalate |
escalate |
Escalate to senior engineer |
wait |
wait |
Take no action this step |
Using the Client
Programmatic Usage (Python)
from dc_ops_env import DcOpsAction, DcOpsEnv
# Connect to a running server
async with DcOpsEnv(base_url="http://localhost:8000") as env:
# Reset with a specific scenario
result = await env.reset(scenario="A2")
print(result.observation.dashboard)
# Agent loop
while not result.done:
result = await env.step(
DcOpsAction(
command="diagnose CRAC-3",
reasoning="CRAC-3 shows compressor failure, need to investigate"
)
)
print(f"Reward: {result.reward}")
print(result.observation.dashboard)
From Docker Image
from dc_ops_env import DcOpsAction, DcOpsEnv
# Start environment from Docker (auto-manages container lifecycle)
env = DcOpsEnv.from_docker_image("dc-ops:latest")
try:
result = env.reset(scenario="A2")
for _ in range(15):
result = env.step(DcOpsAction(command="check_status"))
if result.done:
break
finally:
env.close()
Concurrent Sessions
The server supports multiple concurrent WebSocket sessions for parallel training:
# In server/app.py — adjust max_concurrent_envs
app = create_app(
DcOpsEnvironment,
DcOpsAction,
DcOpsObservation,
max_concurrent_envs=16, # Scale up for parallel RL
)
from concurrent.futures import ThreadPoolExecutor
from dc_ops_env import DcOpsAction, DcOpsEnv
def run_episode(scenario_id: str):
with DcOpsEnv(base_url="http://localhost:8000") as env:
result = env.reset(scenario=scenario_id)
total_reward = 0.0
while not result.done:
result = env.step(DcOpsAction(command="check_status"))
total_reward += result.reward
return scenario_id, total_reward
# Run 8 episodes concurrently
scenarios = ["A1", "A2", "A4", "B1", "B3", "B4", "A2", "B4"]
with ThreadPoolExecutor(max_workers=8) as executor:
results = list(executor.map(run_episode, scenarios))
Scenarios
6 operational scenarios across 3 difficulty levels:
| ID | Scenario | Difficulty | Type | Fault |
|---|---|---|---|---|
| A1 | Cooling Setpoint Optimization | Easy | Thermal | CRACs at 15°C (wasteful) |
| A2 | Thermal Event Response | Medium | Thermal | CRAC-3 compressor failure |
| A4 | CRAC Failure Cascade | Hard | Thermal | CRAC-1 compressor + CRAC-3 fan |
| B1 | UPS Alarm Response | Medium | Power | UPS transferred to battery |
| B3 | Generator Test Protocol | Easy | Power | None (routine test) |
| B4 | Power Failure Cascade | Hard | Power | Utility loss + extended gen warmup |
Reset with a specific scenario:
result = env.reset(scenario="A2") # By ID
result = env.reset(random_scenario=True) # Random
result = env.reset(random_scenario=True, difficulty="hard") # Random hard
Configuration
Built-in Facility Configs
Three YAML configurations are included:
| Config | Zones | Racks | IT Load | CRACs | Use Case |
|---|---|---|---|---|---|
default |
2 | 20 | 160 kW | 4 × 70 kW | Standard facility |
small |
1 | 10 | 80 kW | 2 × 70 kW | Edge / branch office |
large |
4 | 60 | 600 kW | 8 × 100 kW | Multi-zone + GPU (H1) |
from dc_ops_env.config import load_datacenter_config
# Load a built-in config
config = load_datacenter_config("small")
# Load a custom YAML file
config = load_datacenter_config("/path/to/my_datacenter.yaml")
# Use with environment
result = env.reset(scenario="A2", config=config)
Custom YAML Configuration
Create your own datacenter layout:
name: "My Custom Facility"
outside_temp_c: 35.0
outside_humidity_rh: 0.40
simulation_dt_s: 1.0
zones:
- zone_id: zone_a
containment_type: cold_aisle
recirculation_factor: 0.08
air_volume_m3: 500.0
envelope_r_kw: 0.02
initial_cold_aisle_temp_c: 20.0
ashrae_class: A2
racks:
- { rack_id: A-01, row: A, position: 1, it_load_kw: 8.0,
num_servers_2u: 20, server_thermal_mass_jk: 11100.0,
airflow_cfm_per_kw: 160.0 }
# ... more racks
crac_units:
- { unit_id: CRAC-1, rated_capacity_kw: 70.0,
rated_return_temp_c: 24.0, capacity_slope_per_c: 0.03,
max_airflow_cfm: 12000.0, fan_rated_power_kw: 5.0,
cop_rated: 3.5, initial_setpoint_c: 18.0,
initial_fan_speed_pct: 100.0, supply_temp_lag_s: 30.0 }
power:
utility_voltage_v: 480.0
utility_available: true
ups_units:
- { unit_id: UPS-1, rated_capacity_kw: 500.0,
loss_c0: 0.013, loss_c1: 0.006, loss_c2: 0.011,
battery_capacity_kwh: 8.3, battery_discharge_efficiency: 0.90,
battery_aging_factor: 0.85, recharge_rate_kw: 5.0,
initial_mode: double_conversion }
pdus:
- { pdu_id: PDU-A-01, voltage_ll_v: 208.0,
max_current_per_phase_a: 24.0, num_phases: 3,
efficiency: 0.98, continuous_derating: 0.80 }
generator:
gen_id: GEN-1
rated_capacity_kw: 750.0
start_delay_s: 4.0
crank_time_s: 5.0
warmup_time_s: 8.0
fuel_tank_liters: 2000.0
consumption_lph_full: 180.0
cooldown_time_s: 300.0
ats:
ats_id: ATS-1
transfer_time_ms: 100.0
retransfer_delay_s: 300.0
See data/datacenter_configs/ for complete examples.
TRL / GRPO Training Integration
DC-Ops integrates directly with HuggingFace TRL's GRPOTrainer via the OpenEnv environment_factory pattern:
from trl import GRPOTrainer, GRPOConfig
from dc_ops_env import DcOpsAction, DcOpsEnv
def dc_ops_environment_factory():
"""Factory that returns a DC-Ops environment instance."""
env = DcOpsEnv(base_url="http://localhost:8000")
return env
config = GRPOConfig(
model_name_or_path="your-base-model",
# ... training hyperparameters
)
trainer = GRPOTrainer(
config=config,
environments=dc_ops_environment_factory,
# ... other args
)
trainer.train()
For multi-environment parallel training, run multiple servers or increase max_concurrent_envs and spawn concurrent clients.
Deploy to HuggingFace Spaces
Using OpenEnv CLI
The simplest way to deploy:
# From the dc_ops_env/ directory (where openenv.yaml is located)
cd dc_ops_env
# Login to HuggingFace (if not already)
huggingface-cli login
# Push to HuggingFace Spaces
openenv push
# Or with options
openenv push --repo-id your-username/dc-ops-env --private
openenv push --namespace your-org
What Gets Deployed
The openenv push command:
- Validates the
openenv.yamlmanifest - Builds a Docker Space on HuggingFace
- Uploads all environment code
Your deployed Space will be available at:
https://huggingface.co/spaces/<repo-id>
The Space includes:
- Web Interface at
/web— Interactive scenario browser and dashboard viewer - API Documentation at
/docs— Full OpenAPI/Swagger interface - Health Check at
/health— Container health monitoring - WebSocket at
/ws— Persistent session endpoint for agent connections
Connecting to a Deployed Space
from dc_ops_env import DcOpsAction, DcOpsEnv
# Connect to your HuggingFace Space
space_url = "https://your-username-dc-ops-env.hf.space"
async with DcOpsEnv(base_url=space_url) as env:
result = await env.reset(scenario="A2")
print(result.observation.dashboard)
CLI Options
| Option | Description |
|---|---|
--directory, -d |
Directory containing the OpenEnv environment (default: current) |
--repo-id, -r |
Repository ID username/repo-name (default: from openenv.yaml) |
--base-image, -b |
Override base Docker image |
--private |
Deploy as a private Space |
--namespace |
HuggingFace namespace (user or org) |
Development
Running Tests
# All tests (256 tests)
uv run pytest tests/ -v
# Specific test modules
uv run pytest tests/test_thermal.py -v # Thermal physics
uv run pytest tests/test_power.py -v # Power systems
uv run pytest tests/test_actions.py -v # Command parser
uv run pytest tests/test_rewards.py -v # Reward function
uv run pytest tests/test_scenarios.py -v # Scenario framework
uv run pytest tests/test_integration.py -v # End-to-end episodes
# With coverage
uv run pytest tests/ --cov=dc_ops_env --cov-report=term-missing
Direct Environment Testing (No Server)
Test the environment logic without the HTTP/WebSocket layer:
from dc_ops_env.server.dc_ops_env_environment import DcOpsEnvironment
from dc_ops_env.models import DcOpsAction
env = DcOpsEnvironment()
obs = env.reset(scenario="A2")
print(obs.dashboard)
obs = env.step(DcOpsAction(command="diagnose CRAC-3"))
print(f"Reward: {obs.reward}")
print(obs.dashboard)
Running the Server Locally
# Via entry point (recommended)
uv run server
# With custom port
uv run server --port 8001
# Via uvicorn directly (with auto-reload for development)
uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
# Production (multi-worker)
uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4
Project Structure
dc_ops_env/
├── openenv.yaml # OpenEnv manifest
├── pyproject.toml # Dependencies and metadata
├── README.md # This file (HF Space README)
├── __init__.py # Exports: DcOpsEnv, DcOpsAction, DcOpsObservation
├── config.py # Physical constants, ASHRAE limits, YAML loader
├── models.py # Pydantic Action/Observation models
├── client.py # DcOpsEnv (EnvClient subclass)
├── simulation/
│ ├── thermal.py # RC thermal network (zones, racks, CRACs)
│ ├── power.py # UPS, PDU, generator, ATS models
│ └── types.py # Runtime state dataclasses
├── scenarios/
│ ├── base.py # Abstract Scenario + ProcedureRule
│ ├── registry.py # Scenario registration and selection
│ ├── thermal_scenarios.py # A1, A2, A4
│ └── power_scenarios.py # B1, B3, B4
├── rewards/
│ └── reward_function.py # 6-component composite reward
├── rendering/
│ └── dashboard.py # State → text dashboard
├── actions/
│ └── parser.py # Deterministic command parser
├── server/
│ ├── dc_ops_env_environment.py # OpenEnv Environment implementation
│ ├── app.py # FastAPI application
│ └── Dockerfile # Container image
├── data/
│ └── datacenter_configs/ # YAML facility definitions
│ ├── default.yaml # 2 zones, 20 racks, 160 kW
│ ├── small_facility.yaml # 1 zone, 10 racks, 80 kW
│ └── large_facility.yaml # 4 zones, 60 racks, 600 kW
└── tests/ # 256 tests across 6 modules
├── test_thermal.py
├── test_power.py
├── test_actions.py
├── test_rewards.py
├── test_scenarios.py
└── test_integration.py
License
BSD-style license. See LICENSE for details.