--- title: DC-Ops Environment Server emoji: ⚡ colorFrom: blue colorTo: green sdk: docker pinned: false app_port: 8000 base_path: /web tags: - openenv - reinforcement-learning - datacenter - simulation thumbnail: >- https://cdn-uploads.huggingface.co/production/uploads/6578359f277dceb6056f1646/WSrOC1cq1OHFyKV469PIQ.png --- # DC-Ops Environment A physics-based datacenter operations environment for training LLM agents, built on Meta's [OpenEnv](https://github.com/meta-pytorch/OpenEnv) framework. The agent reads a text-based NOC dashboard and issues natural-language operator commands — exactly as a human datacenter operator would. ## Quick Start ### Prerequisites - Python 3.10+ - [uv](https://docs.astral.sh/uv/) (recommended) or pip - Docker (for containerized deployment) ### Install & Run Locally ```bash # Clone the repository git clone cd dc_ops_env # Install dependencies uv sync # Run the test suite (256 tests, <10s) uv run pytest tests/ -v # Start the server uv run server ``` The server starts at `http://localhost:8000` with: - **Web UI** → `http://localhost:8000/web` - **API docs** → `http://localhost:8000/docs` - **Health check** → `http://localhost:8000/health` - **WebSocket** → `ws://localhost:8000/ws` ### Run with Docker ```bash # Build the image docker build -t dc-ops:latest -f server/Dockerfile . # Run the container docker run -d -p 8000:8000 dc-ops:latest # Verify it's running curl http://localhost:8000/health ``` --- ## OpenEnv Integration DC-Ops is a fully compliant [OpenEnv](https://github.com/meta-pytorch/OpenEnv) environment. OpenEnv provides: - **MCP tool-based interactions** for LLM agents (WebSocket `/ws`) - **HTTP orchestration layer** for training pipelines (`/reset`, `/step`, `/state`) - **HuggingFace Spaces deployment** via `openenv push` - **TRL/GRPO integration** for RL training with `GRPOTrainer` ### Action & Observation Models **DcOpsAction** — the agent's command: ```python class DcOpsAction(Action): command: str # e.g., "diagnose CRAC-3", "adjust_setpoint CRAC-1 20" reasoning: str # Optional chain-of-thought ``` **DcOpsObservation** — what the agent sees: ```python class DcOpsObservation(Observation): dashboard: str # Text-rendered monitoring dashboard available_actions: list # Valid commands the agent can issue alert: str # Current active alert message scenario_type: str # "thermal", "power", etc. steps_remaining: int # Steps left in episode budget action_result: str # Feedback from last action ``` ### Available Commands | Command | Format | Description | |---------|--------|-------------| | `diagnose` | `diagnose ` | Inspect a CRAC/UPS/PDU for faults | | `adjust_setpoint` | `adjust_setpoint ` | Change CRAC supply air setpoint | | `set_fan_speed` | `set_fan_speed ` | Set CRAC fan speed (0-100%) | | `set_rack_load` | `set_rack_load ` | Adjust rack IT load (migrate workload) | | `start_crac` | `start_crac ` | Start a standby CRAC unit | | `stop_crac` | `stop_crac ` | Put a CRAC into standby | | `start_generator` | `start_generator` | Manually start the diesel generator | | `stop_generator` | `stop_generator` | Initiate generator cooldown | | `set_ups_mode` | `set_ups_mode ` | Set UPS mode (eco/double_conversion/bypass) | | `refuel_generator` | `refuel_generator [liters]` | Refuel (default: full tank) | | `acknowledge_alarm` | `acknowledge_alarm` | Acknowledge current alert | | `check_status` | `check_status` | Request full status report | | `escalate` | `escalate` | Escalate to senior engineer | | `wait` | `wait` | Take no action this step | --- ## Using the Client ### Programmatic Usage (Python) ```python from dc_ops_env import DcOpsAction, DcOpsEnv # Connect to a running server async with DcOpsEnv(base_url="http://localhost:8000") as env: # Reset with a specific scenario result = await env.reset(scenario="A2") print(result.observation.dashboard) # Agent loop while not result.done: result = await env.step( DcOpsAction( command="diagnose CRAC-3", reasoning="CRAC-3 shows compressor failure, need to investigate" ) ) print(f"Reward: {result.reward}") print(result.observation.dashboard) ``` ### From Docker Image ```python from dc_ops_env import DcOpsAction, DcOpsEnv # Start environment from Docker (auto-manages container lifecycle) env = DcOpsEnv.from_docker_image("dc-ops:latest") try: result = env.reset(scenario="A2") for _ in range(15): result = env.step(DcOpsAction(command="check_status")) if result.done: break finally: env.close() ``` ### Concurrent Sessions The server supports multiple concurrent WebSocket sessions for parallel training: ```python # In server/app.py — adjust max_concurrent_envs app = create_app( DcOpsEnvironment, DcOpsAction, DcOpsObservation, max_concurrent_envs=16, # Scale up for parallel RL ) ``` ```python from concurrent.futures import ThreadPoolExecutor from dc_ops_env import DcOpsAction, DcOpsEnv def run_episode(scenario_id: str): with DcOpsEnv(base_url="http://localhost:8000") as env: result = env.reset(scenario=scenario_id) total_reward = 0.0 while not result.done: result = env.step(DcOpsAction(command="check_status")) total_reward += result.reward return scenario_id, total_reward # Run 8 episodes concurrently scenarios = ["A1", "A2", "A4", "B1", "B3", "B4", "A2", "B4"] with ThreadPoolExecutor(max_workers=8) as executor: results = list(executor.map(run_episode, scenarios)) ``` --- ## Scenarios 6 operational scenarios across 3 difficulty levels: | ID | Scenario | Difficulty | Type | Fault | |----|----------|------------|------|-------| | A1 | Cooling Setpoint Optimization | Easy | Thermal | CRACs at 15°C (wasteful) | | A2 | Thermal Event Response | Medium | Thermal | CRAC-3 compressor failure | | A4 | CRAC Failure Cascade | Hard | Thermal | CRAC-1 compressor + CRAC-3 fan | | B1 | UPS Alarm Response | Medium | Power | UPS transferred to battery | | B3 | Generator Test Protocol | Easy | Power | None (routine test) | | B4 | Power Failure Cascade | Hard | Power | Utility loss + extended gen warmup | Reset with a specific scenario: ```python result = env.reset(scenario="A2") # By ID result = env.reset(random_scenario=True) # Random result = env.reset(random_scenario=True, difficulty="hard") # Random hard ``` --- ## Configuration ### Built-in Facility Configs Three YAML configurations are included: | Config | Zones | Racks | IT Load | CRACs | Use Case | |--------|-------|-------|---------|-------|----------| | `default` | 2 | 20 | 160 kW | 4 × 70 kW | Standard facility | | `small` | 1 | 10 | 80 kW | 2 × 70 kW | Edge / branch office | | `large` | 4 | 60 | 600 kW | 8 × 100 kW | Multi-zone + GPU (H1) | ```python from dc_ops_env.config import load_datacenter_config # Load a built-in config config = load_datacenter_config("small") # Load a custom YAML file config = load_datacenter_config("/path/to/my_datacenter.yaml") # Use with environment result = env.reset(scenario="A2", config=config) ``` ### Custom YAML Configuration Create your own datacenter layout: ```yaml name: "My Custom Facility" outside_temp_c: 35.0 outside_humidity_rh: 0.40 simulation_dt_s: 1.0 zones: - zone_id: zone_a containment_type: cold_aisle recirculation_factor: 0.08 air_volume_m3: 500.0 envelope_r_kw: 0.02 initial_cold_aisle_temp_c: 20.0 ashrae_class: A2 racks: - { rack_id: A-01, row: A, position: 1, it_load_kw: 8.0, num_servers_2u: 20, server_thermal_mass_jk: 11100.0, airflow_cfm_per_kw: 160.0 } # ... more racks crac_units: - { unit_id: CRAC-1, rated_capacity_kw: 70.0, rated_return_temp_c: 24.0, capacity_slope_per_c: 0.03, max_airflow_cfm: 12000.0, fan_rated_power_kw: 5.0, cop_rated: 3.5, initial_setpoint_c: 18.0, initial_fan_speed_pct: 100.0, supply_temp_lag_s: 30.0 } power: utility_voltage_v: 480.0 utility_available: true ups_units: - { unit_id: UPS-1, rated_capacity_kw: 500.0, loss_c0: 0.013, loss_c1: 0.006, loss_c2: 0.011, battery_capacity_kwh: 8.3, battery_discharge_efficiency: 0.90, battery_aging_factor: 0.85, recharge_rate_kw: 5.0, initial_mode: double_conversion } pdus: - { pdu_id: PDU-A-01, voltage_ll_v: 208.0, max_current_per_phase_a: 24.0, num_phases: 3, efficiency: 0.98, continuous_derating: 0.80 } generator: gen_id: GEN-1 rated_capacity_kw: 750.0 start_delay_s: 4.0 crank_time_s: 5.0 warmup_time_s: 8.0 fuel_tank_liters: 2000.0 consumption_lph_full: 180.0 cooldown_time_s: 300.0 ats: ats_id: ATS-1 transfer_time_ms: 100.0 retransfer_delay_s: 300.0 ``` See [data/datacenter_configs/](data/datacenter_configs/) for complete examples. --- ## TRL / GRPO Training Integration DC-Ops integrates directly with HuggingFace TRL's `GRPOTrainer` via the OpenEnv `environment_factory` pattern: ```python from trl import GRPOTrainer, GRPOConfig from dc_ops_env import DcOpsAction, DcOpsEnv def dc_ops_environment_factory(): """Factory that returns a DC-Ops environment instance.""" env = DcOpsEnv(base_url="http://localhost:8000") return env config = GRPOConfig( model_name_or_path="your-base-model", # ... training hyperparameters ) trainer = GRPOTrainer( config=config, environments=dc_ops_environment_factory, # ... other args ) trainer.train() ``` For multi-environment parallel training, run multiple servers or increase `max_concurrent_envs` and spawn concurrent clients. --- ## Deploy to HuggingFace Spaces ### Using OpenEnv CLI The simplest way to deploy: ```bash # From the dc_ops_env/ directory (where openenv.yaml is located) cd dc_ops_env # Login to HuggingFace (if not already) huggingface-cli login # Push to HuggingFace Spaces openenv push # Or with options openenv push --repo-id your-username/dc-ops-env --private openenv push --namespace your-org ``` ### What Gets Deployed The `openenv push` command: 1. Validates the `openenv.yaml` manifest 2. Builds a Docker Space on HuggingFace 3. Uploads all environment code Your deployed Space will be available at: `https://huggingface.co/spaces/` The Space includes: - **Web Interface** at `/web` — Interactive scenario browser and dashboard viewer - **API Documentation** at `/docs` — Full OpenAPI/Swagger interface - **Health Check** at `/health` — Container health monitoring - **WebSocket** at `/ws` — Persistent session endpoint for agent connections ### Connecting to a Deployed Space ```python from dc_ops_env import DcOpsAction, DcOpsEnv # Connect to your HuggingFace Space space_url = "https://your-username-dc-ops-env.hf.space" async with DcOpsEnv(base_url=space_url) as env: result = await env.reset(scenario="A2") print(result.observation.dashboard) ``` ### CLI Options | Option | Description | |--------|-------------| | `--directory`, `-d` | Directory containing the OpenEnv environment (default: current) | | `--repo-id`, `-r` | Repository ID `username/repo-name` (default: from openenv.yaml) | | `--base-image`, `-b` | Override base Docker image | | `--private` | Deploy as a private Space | | `--namespace` | HuggingFace namespace (user or org) | --- ## Development ### Running Tests ```bash # All tests (256 tests) uv run pytest tests/ -v # Specific test modules uv run pytest tests/test_thermal.py -v # Thermal physics uv run pytest tests/test_power.py -v # Power systems uv run pytest tests/test_actions.py -v # Command parser uv run pytest tests/test_rewards.py -v # Reward function uv run pytest tests/test_scenarios.py -v # Scenario framework uv run pytest tests/test_integration.py -v # End-to-end episodes # With coverage uv run pytest tests/ --cov=dc_ops_env --cov-report=term-missing ``` ### Direct Environment Testing (No Server) Test the environment logic without the HTTP/WebSocket layer: ```python from dc_ops_env.server.dc_ops_env_environment import DcOpsEnvironment from dc_ops_env.models import DcOpsAction env = DcOpsEnvironment() obs = env.reset(scenario="A2") print(obs.dashboard) obs = env.step(DcOpsAction(command="diagnose CRAC-3")) print(f"Reward: {obs.reward}") print(obs.dashboard) ``` ### Running the Server Locally ```bash # Via entry point (recommended) uv run server # With custom port uv run server --port 8001 # Via uvicorn directly (with auto-reload for development) uvicorn server.app:app --reload --host 0.0.0.0 --port 8000 # Production (multi-worker) uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4 ``` --- ## Project Structure ``` dc_ops_env/ ├── openenv.yaml # OpenEnv manifest ├── pyproject.toml # Dependencies and metadata ├── README.md # This file (HF Space README) ├── __init__.py # Exports: DcOpsEnv, DcOpsAction, DcOpsObservation ├── config.py # Physical constants, ASHRAE limits, YAML loader ├── models.py # Pydantic Action/Observation models ├── client.py # DcOpsEnv (EnvClient subclass) ├── simulation/ │ ├── thermal.py # RC thermal network (zones, racks, CRACs) │ ├── power.py # UPS, PDU, generator, ATS models │ └── types.py # Runtime state dataclasses ├── scenarios/ │ ├── base.py # Abstract Scenario + ProcedureRule │ ├── registry.py # Scenario registration and selection │ ├── thermal_scenarios.py # A1, A2, A4 │ └── power_scenarios.py # B1, B3, B4 ├── rewards/ │ └── reward_function.py # 6-component composite reward ├── rendering/ │ └── dashboard.py # State → text dashboard ├── actions/ │ └── parser.py # Deterministic command parser ├── server/ │ ├── dc_ops_env_environment.py # OpenEnv Environment implementation │ ├── app.py # FastAPI application │ └── Dockerfile # Container image ├── data/ │ └── datacenter_configs/ # YAML facility definitions │ ├── default.yaml # 2 zones, 20 racks, 160 kW │ ├── small_facility.yaml # 1 zone, 10 racks, 80 kW │ └── large_facility.yaml # 4 zones, 60 racks, 600 kW └── tests/ # 256 tests across 6 modules ├── test_thermal.py ├── test_power.py ├── test_actions.py ├── test_rewards.py ├── test_scenarios.py └── test_integration.py ``` ## License BSD-style license. See [LICENSE](../LICENSE) for details.