dc_ops_env / README.md
Melikshah's picture
Update README.md
896b01a verified
metadata
title: DC-Ops Environment Server
emoji: 
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
  - reinforcement-learning
  - datacenter
  - simulation
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/6578359f277dceb6056f1646/WSrOC1cq1OHFyKV469PIQ.png

DC-Ops Environment

A physics-based datacenter operations environment for training LLM agents, built on Meta's OpenEnv framework.

The agent reads a text-based NOC dashboard and issues natural-language operator commands — exactly as a human datacenter operator would.

Quick Start

Prerequisites

  • Python 3.10+
  • uv (recommended) or pip
  • Docker (for containerized deployment)

Install & Run Locally

# Clone the repository
git clone <repo-url>
cd dc_ops_env

# Install dependencies
uv sync

# Run the test suite (256 tests, <10s)
uv run pytest tests/ -v

# Start the server
uv run server

The server starts at http://localhost:8000 with:

  • Web UIhttp://localhost:8000/web
  • API docshttp://localhost:8000/docs
  • Health checkhttp://localhost:8000/health
  • WebSocketws://localhost:8000/ws

Run with Docker

# Build the image
docker build -t dc-ops:latest -f server/Dockerfile .

# Run the container
docker run -d -p 8000:8000 dc-ops:latest

# Verify it's running
curl http://localhost:8000/health

OpenEnv Integration

DC-Ops is a fully compliant OpenEnv environment. OpenEnv provides:

  • MCP tool-based interactions for LLM agents (WebSocket /ws)
  • HTTP orchestration layer for training pipelines (/reset, /step, /state)
  • HuggingFace Spaces deployment via openenv push
  • TRL/GRPO integration for RL training with GRPOTrainer

Action & Observation Models

DcOpsAction — the agent's command:

class DcOpsAction(Action):
    command: str    # e.g., "diagnose CRAC-3", "adjust_setpoint CRAC-1 20"
    reasoning: str  # Optional chain-of-thought

DcOpsObservation — what the agent sees:

class DcOpsObservation(Observation):
    dashboard: str           # Text-rendered monitoring dashboard
    available_actions: list  # Valid commands the agent can issue
    alert: str               # Current active alert message
    scenario_type: str       # "thermal", "power", etc.
    steps_remaining: int     # Steps left in episode budget
    action_result: str       # Feedback from last action

Available Commands

Command Format Description
diagnose diagnose <unit_id> Inspect a CRAC/UPS/PDU for faults
adjust_setpoint adjust_setpoint <crac_id> <temp_c> Change CRAC supply air setpoint
set_fan_speed set_fan_speed <crac_id> <pct> Set CRAC fan speed (0-100%)
set_rack_load set_rack_load <rack_id> <kw> Adjust rack IT load (migrate workload)
start_crac start_crac <crac_id> Start a standby CRAC unit
stop_crac stop_crac <crac_id> Put a CRAC into standby
start_generator start_generator Manually start the diesel generator
stop_generator stop_generator Initiate generator cooldown
set_ups_mode set_ups_mode <ups_id> <mode> Set UPS mode (eco/double_conversion/bypass)
refuel_generator refuel_generator [liters] Refuel (default: full tank)
acknowledge_alarm acknowledge_alarm Acknowledge current alert
check_status check_status Request full status report
escalate escalate Escalate to senior engineer
wait wait Take no action this step

Using the Client

Programmatic Usage (Python)

from dc_ops_env import DcOpsAction, DcOpsEnv

# Connect to a running server
async with DcOpsEnv(base_url="http://localhost:8000") as env:
    # Reset with a specific scenario
    result = await env.reset(scenario="A2")
    print(result.observation.dashboard)

    # Agent loop
    while not result.done:
        result = await env.step(
            DcOpsAction(
                command="diagnose CRAC-3",
                reasoning="CRAC-3 shows compressor failure, need to investigate"
            )
        )
        print(f"Reward: {result.reward}")
        print(result.observation.dashboard)

From Docker Image

from dc_ops_env import DcOpsAction, DcOpsEnv

# Start environment from Docker (auto-manages container lifecycle)
env = DcOpsEnv.from_docker_image("dc-ops:latest")

try:
    result = env.reset(scenario="A2")
    for _ in range(15):
        result = env.step(DcOpsAction(command="check_status"))
        if result.done:
            break
finally:
    env.close()

Concurrent Sessions

The server supports multiple concurrent WebSocket sessions for parallel training:

# In server/app.py — adjust max_concurrent_envs
app = create_app(
    DcOpsEnvironment,
    DcOpsAction,
    DcOpsObservation,
    max_concurrent_envs=16,  # Scale up for parallel RL
)
from concurrent.futures import ThreadPoolExecutor
from dc_ops_env import DcOpsAction, DcOpsEnv

def run_episode(scenario_id: str):
    with DcOpsEnv(base_url="http://localhost:8000") as env:
        result = env.reset(scenario=scenario_id)
        total_reward = 0.0
        while not result.done:
            result = env.step(DcOpsAction(command="check_status"))
            total_reward += result.reward
        return scenario_id, total_reward

# Run 8 episodes concurrently
scenarios = ["A1", "A2", "A4", "B1", "B3", "B4", "A2", "B4"]
with ThreadPoolExecutor(max_workers=8) as executor:
    results = list(executor.map(run_episode, scenarios))

Scenarios

6 operational scenarios across 3 difficulty levels:

ID Scenario Difficulty Type Fault
A1 Cooling Setpoint Optimization Easy Thermal CRACs at 15°C (wasteful)
A2 Thermal Event Response Medium Thermal CRAC-3 compressor failure
A4 CRAC Failure Cascade Hard Thermal CRAC-1 compressor + CRAC-3 fan
B1 UPS Alarm Response Medium Power UPS transferred to battery
B3 Generator Test Protocol Easy Power None (routine test)
B4 Power Failure Cascade Hard Power Utility loss + extended gen warmup

Reset with a specific scenario:

result = env.reset(scenario="A2")           # By ID
result = env.reset(random_scenario=True)    # Random
result = env.reset(random_scenario=True, difficulty="hard")  # Random hard

Configuration

Built-in Facility Configs

Three YAML configurations are included:

Config Zones Racks IT Load CRACs Use Case
default 2 20 160 kW 4 × 70 kW Standard facility
small 1 10 80 kW 2 × 70 kW Edge / branch office
large 4 60 600 kW 8 × 100 kW Multi-zone + GPU (H1)
from dc_ops_env.config import load_datacenter_config

# Load a built-in config
config = load_datacenter_config("small")

# Load a custom YAML file
config = load_datacenter_config("/path/to/my_datacenter.yaml")

# Use with environment
result = env.reset(scenario="A2", config=config)

Custom YAML Configuration

Create your own datacenter layout:

name: "My Custom Facility"
outside_temp_c: 35.0
outside_humidity_rh: 0.40
simulation_dt_s: 1.0

zones:
  - zone_id: zone_a
    containment_type: cold_aisle
    recirculation_factor: 0.08
    air_volume_m3: 500.0
    envelope_r_kw: 0.02
    initial_cold_aisle_temp_c: 20.0
    ashrae_class: A2
    racks:
      - { rack_id: A-01, row: A, position: 1, it_load_kw: 8.0,
          num_servers_2u: 20, server_thermal_mass_jk: 11100.0,
          airflow_cfm_per_kw: 160.0 }
      # ... more racks
    crac_units:
      - { unit_id: CRAC-1, rated_capacity_kw: 70.0,
          rated_return_temp_c: 24.0, capacity_slope_per_c: 0.03,
          max_airflow_cfm: 12000.0, fan_rated_power_kw: 5.0,
          cop_rated: 3.5, initial_setpoint_c: 18.0,
          initial_fan_speed_pct: 100.0, supply_temp_lag_s: 30.0 }

power:
  utility_voltage_v: 480.0
  utility_available: true
  ups_units:
    - { unit_id: UPS-1, rated_capacity_kw: 500.0,
        loss_c0: 0.013, loss_c1: 0.006, loss_c2: 0.011,
        battery_capacity_kwh: 8.3, battery_discharge_efficiency: 0.90,
        battery_aging_factor: 0.85, recharge_rate_kw: 5.0,
        initial_mode: double_conversion }
  pdus:
    - { pdu_id: PDU-A-01, voltage_ll_v: 208.0,
        max_current_per_phase_a: 24.0, num_phases: 3,
        efficiency: 0.98, continuous_derating: 0.80 }
  generator:
    gen_id: GEN-1
    rated_capacity_kw: 750.0
    start_delay_s: 4.0
    crank_time_s: 5.0
    warmup_time_s: 8.0
    fuel_tank_liters: 2000.0
    consumption_lph_full: 180.0
    cooldown_time_s: 300.0
  ats:
    ats_id: ATS-1
    transfer_time_ms: 100.0
    retransfer_delay_s: 300.0

See data/datacenter_configs/ for complete examples.


TRL / GRPO Training Integration

DC-Ops integrates directly with HuggingFace TRL's GRPOTrainer via the OpenEnv environment_factory pattern:

from trl import GRPOTrainer, GRPOConfig
from dc_ops_env import DcOpsAction, DcOpsEnv

def dc_ops_environment_factory():
    """Factory that returns a DC-Ops environment instance."""
    env = DcOpsEnv(base_url="http://localhost:8000")
    return env

config = GRPOConfig(
    model_name_or_path="your-base-model",
    # ... training hyperparameters
)

trainer = GRPOTrainer(
    config=config,
    environments=dc_ops_environment_factory,
    # ... other args
)

trainer.train()

For multi-environment parallel training, run multiple servers or increase max_concurrent_envs and spawn concurrent clients.


Deploy to HuggingFace Spaces

Using OpenEnv CLI

The simplest way to deploy:

# From the dc_ops_env/ directory (where openenv.yaml is located)
cd dc_ops_env

# Login to HuggingFace (if not already)
huggingface-cli login

# Push to HuggingFace Spaces
openenv push

# Or with options
openenv push --repo-id your-username/dc-ops-env --private
openenv push --namespace your-org

What Gets Deployed

The openenv push command:

  1. Validates the openenv.yaml manifest
  2. Builds a Docker Space on HuggingFace
  3. Uploads all environment code

Your deployed Space will be available at: https://huggingface.co/spaces/<repo-id>

The Space includes:

  • Web Interface at /web — Interactive scenario browser and dashboard viewer
  • API Documentation at /docs — Full OpenAPI/Swagger interface
  • Health Check at /health — Container health monitoring
  • WebSocket at /ws — Persistent session endpoint for agent connections

Connecting to a Deployed Space

from dc_ops_env import DcOpsAction, DcOpsEnv

# Connect to your HuggingFace Space
space_url = "https://your-username-dc-ops-env.hf.space"

async with DcOpsEnv(base_url=space_url) as env:
    result = await env.reset(scenario="A2")
    print(result.observation.dashboard)

CLI Options

Option Description
--directory, -d Directory containing the OpenEnv environment (default: current)
--repo-id, -r Repository ID username/repo-name (default: from openenv.yaml)
--base-image, -b Override base Docker image
--private Deploy as a private Space
--namespace HuggingFace namespace (user or org)

Development

Running Tests

# All tests (256 tests)
uv run pytest tests/ -v

# Specific test modules
uv run pytest tests/test_thermal.py -v      # Thermal physics
uv run pytest tests/test_power.py -v        # Power systems
uv run pytest tests/test_actions.py -v      # Command parser
uv run pytest tests/test_rewards.py -v      # Reward function
uv run pytest tests/test_scenarios.py -v    # Scenario framework
uv run pytest tests/test_integration.py -v  # End-to-end episodes

# With coverage
uv run pytest tests/ --cov=dc_ops_env --cov-report=term-missing

Direct Environment Testing (No Server)

Test the environment logic without the HTTP/WebSocket layer:

from dc_ops_env.server.dc_ops_env_environment import DcOpsEnvironment
from dc_ops_env.models import DcOpsAction

env = DcOpsEnvironment()
obs = env.reset(scenario="A2")
print(obs.dashboard)

obs = env.step(DcOpsAction(command="diagnose CRAC-3"))
print(f"Reward: {obs.reward}")
print(obs.dashboard)

Running the Server Locally

# Via entry point (recommended)
uv run server

# With custom port
uv run server --port 8001

# Via uvicorn directly (with auto-reload for development)
uvicorn server.app:app --reload --host 0.0.0.0 --port 8000

# Production (multi-worker)
uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4

Project Structure

dc_ops_env/
├── openenv.yaml                    # OpenEnv manifest
├── pyproject.toml                  # Dependencies and metadata
├── README.md                       # This file (HF Space README)
├── __init__.py                     # Exports: DcOpsEnv, DcOpsAction, DcOpsObservation
├── config.py                       # Physical constants, ASHRAE limits, YAML loader
├── models.py                       # Pydantic Action/Observation models
├── client.py                       # DcOpsEnv (EnvClient subclass)
├── simulation/
│   ├── thermal.py                  # RC thermal network (zones, racks, CRACs)
│   ├── power.py                    # UPS, PDU, generator, ATS models
│   └── types.py                    # Runtime state dataclasses
├── scenarios/
│   ├── base.py                     # Abstract Scenario + ProcedureRule
│   ├── registry.py                 # Scenario registration and selection
│   ├── thermal_scenarios.py        # A1, A2, A4
│   └── power_scenarios.py          # B1, B3, B4
├── rewards/
│   └── reward_function.py          # 6-component composite reward
├── rendering/
│   └── dashboard.py                # State → text dashboard
├── actions/
│   └── parser.py                   # Deterministic command parser
├── server/
│   ├── dc_ops_env_environment.py   # OpenEnv Environment implementation
│   ├── app.py                      # FastAPI application
│   └── Dockerfile                  # Container image
├── data/
│   └── datacenter_configs/         # YAML facility definitions
│       ├── default.yaml            # 2 zones, 20 racks, 160 kW
│       ├── small_facility.yaml     # 1 zone, 10 racks, 80 kW
│       └── large_facility.yaml     # 4 zones, 60 racks, 600 kW
└── tests/                          # 256 tests across 6 modules
    ├── test_thermal.py
    ├── test_power.py
    ├── test_actions.py
    ├── test_rewards.py
    ├── test_scenarios.py
    └── test_integration.py

License

BSD-style license. See LICENSE for details.