dc_ops_env / README.md
Melikshah's picture
Update README.md
896b01a verified
---
title: DC-Ops Environment Server
emoji:
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
- reinforcement-learning
- datacenter
- simulation
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/6578359f277dceb6056f1646/WSrOC1cq1OHFyKV469PIQ.png
---
# DC-Ops Environment
A physics-based datacenter operations environment for training LLM agents, built on Meta's [OpenEnv](https://github.com/meta-pytorch/OpenEnv) framework.
The agent reads a text-based NOC dashboard and issues natural-language operator commands — exactly as a human datacenter operator would.
## Quick Start
### Prerequisites
- Python 3.10+
- [uv](https://docs.astral.sh/uv/) (recommended) or pip
- Docker (for containerized deployment)
### Install & Run Locally
```bash
# Clone the repository
git clone <repo-url>
cd dc_ops_env
# Install dependencies
uv sync
# Run the test suite (256 tests, <10s)
uv run pytest tests/ -v
# Start the server
uv run server
```
The server starts at `http://localhost:8000` with:
- **Web UI**`http://localhost:8000/web`
- **API docs**`http://localhost:8000/docs`
- **Health check**`http://localhost:8000/health`
- **WebSocket**`ws://localhost:8000/ws`
### Run with Docker
```bash
# Build the image
docker build -t dc-ops:latest -f server/Dockerfile .
# Run the container
docker run -d -p 8000:8000 dc-ops:latest
# Verify it's running
curl http://localhost:8000/health
```
---
## OpenEnv Integration
DC-Ops is a fully compliant [OpenEnv](https://github.com/meta-pytorch/OpenEnv) environment. OpenEnv provides:
- **MCP tool-based interactions** for LLM agents (WebSocket `/ws`)
- **HTTP orchestration layer** for training pipelines (`/reset`, `/step`, `/state`)
- **HuggingFace Spaces deployment** via `openenv push`
- **TRL/GRPO integration** for RL training with `GRPOTrainer`
### Action & Observation Models
**DcOpsAction** — the agent's command:
```python
class DcOpsAction(Action):
command: str # e.g., "diagnose CRAC-3", "adjust_setpoint CRAC-1 20"
reasoning: str # Optional chain-of-thought
```
**DcOpsObservation** — what the agent sees:
```python
class DcOpsObservation(Observation):
dashboard: str # Text-rendered monitoring dashboard
available_actions: list # Valid commands the agent can issue
alert: str # Current active alert message
scenario_type: str # "thermal", "power", etc.
steps_remaining: int # Steps left in episode budget
action_result: str # Feedback from last action
```
### Available Commands
| Command | Format | Description |
|---------|--------|-------------|
| `diagnose` | `diagnose <unit_id>` | Inspect a CRAC/UPS/PDU for faults |
| `adjust_setpoint` | `adjust_setpoint <crac_id> <temp_c>` | Change CRAC supply air setpoint |
| `set_fan_speed` | `set_fan_speed <crac_id> <pct>` | Set CRAC fan speed (0-100%) |
| `set_rack_load` | `set_rack_load <rack_id> <kw>` | Adjust rack IT load (migrate workload) |
| `start_crac` | `start_crac <crac_id>` | Start a standby CRAC unit |
| `stop_crac` | `stop_crac <crac_id>` | Put a CRAC into standby |
| `start_generator` | `start_generator` | Manually start the diesel generator |
| `stop_generator` | `stop_generator` | Initiate generator cooldown |
| `set_ups_mode` | `set_ups_mode <ups_id> <mode>` | Set UPS mode (eco/double_conversion/bypass) |
| `refuel_generator` | `refuel_generator [liters]` | Refuel (default: full tank) |
| `acknowledge_alarm` | `acknowledge_alarm` | Acknowledge current alert |
| `check_status` | `check_status` | Request full status report |
| `escalate` | `escalate` | Escalate to senior engineer |
| `wait` | `wait` | Take no action this step |
---
## Using the Client
### Programmatic Usage (Python)
```python
from dc_ops_env import DcOpsAction, DcOpsEnv
# Connect to a running server
async with DcOpsEnv(base_url="http://localhost:8000") as env:
# Reset with a specific scenario
result = await env.reset(scenario="A2")
print(result.observation.dashboard)
# Agent loop
while not result.done:
result = await env.step(
DcOpsAction(
command="diagnose CRAC-3",
reasoning="CRAC-3 shows compressor failure, need to investigate"
)
)
print(f"Reward: {result.reward}")
print(result.observation.dashboard)
```
### From Docker Image
```python
from dc_ops_env import DcOpsAction, DcOpsEnv
# Start environment from Docker (auto-manages container lifecycle)
env = DcOpsEnv.from_docker_image("dc-ops:latest")
try:
result = env.reset(scenario="A2")
for _ in range(15):
result = env.step(DcOpsAction(command="check_status"))
if result.done:
break
finally:
env.close()
```
### Concurrent Sessions
The server supports multiple concurrent WebSocket sessions for parallel training:
```python
# In server/app.py — adjust max_concurrent_envs
app = create_app(
DcOpsEnvironment,
DcOpsAction,
DcOpsObservation,
max_concurrent_envs=16, # Scale up for parallel RL
)
```
```python
from concurrent.futures import ThreadPoolExecutor
from dc_ops_env import DcOpsAction, DcOpsEnv
def run_episode(scenario_id: str):
with DcOpsEnv(base_url="http://localhost:8000") as env:
result = env.reset(scenario=scenario_id)
total_reward = 0.0
while not result.done:
result = env.step(DcOpsAction(command="check_status"))
total_reward += result.reward
return scenario_id, total_reward
# Run 8 episodes concurrently
scenarios = ["A1", "A2", "A4", "B1", "B3", "B4", "A2", "B4"]
with ThreadPoolExecutor(max_workers=8) as executor:
results = list(executor.map(run_episode, scenarios))
```
---
## Scenarios
6 operational scenarios across 3 difficulty levels:
| ID | Scenario | Difficulty | Type | Fault |
|----|----------|------------|------|-------|
| A1 | Cooling Setpoint Optimization | Easy | Thermal | CRACs at 15°C (wasteful) |
| A2 | Thermal Event Response | Medium | Thermal | CRAC-3 compressor failure |
| A4 | CRAC Failure Cascade | Hard | Thermal | CRAC-1 compressor + CRAC-3 fan |
| B1 | UPS Alarm Response | Medium | Power | UPS transferred to battery |
| B3 | Generator Test Protocol | Easy | Power | None (routine test) |
| B4 | Power Failure Cascade | Hard | Power | Utility loss + extended gen warmup |
Reset with a specific scenario:
```python
result = env.reset(scenario="A2") # By ID
result = env.reset(random_scenario=True) # Random
result = env.reset(random_scenario=True, difficulty="hard") # Random hard
```
---
## Configuration
### Built-in Facility Configs
Three YAML configurations are included:
| Config | Zones | Racks | IT Load | CRACs | Use Case |
|--------|-------|-------|---------|-------|----------|
| `default` | 2 | 20 | 160 kW | 4 × 70 kW | Standard facility |
| `small` | 1 | 10 | 80 kW | 2 × 70 kW | Edge / branch office |
| `large` | 4 | 60 | 600 kW | 8 × 100 kW | Multi-zone + GPU (H1) |
```python
from dc_ops_env.config import load_datacenter_config
# Load a built-in config
config = load_datacenter_config("small")
# Load a custom YAML file
config = load_datacenter_config("/path/to/my_datacenter.yaml")
# Use with environment
result = env.reset(scenario="A2", config=config)
```
### Custom YAML Configuration
Create your own datacenter layout:
```yaml
name: "My Custom Facility"
outside_temp_c: 35.0
outside_humidity_rh: 0.40
simulation_dt_s: 1.0
zones:
- zone_id: zone_a
containment_type: cold_aisle
recirculation_factor: 0.08
air_volume_m3: 500.0
envelope_r_kw: 0.02
initial_cold_aisle_temp_c: 20.0
ashrae_class: A2
racks:
- { rack_id: A-01, row: A, position: 1, it_load_kw: 8.0,
num_servers_2u: 20, server_thermal_mass_jk: 11100.0,
airflow_cfm_per_kw: 160.0 }
# ... more racks
crac_units:
- { unit_id: CRAC-1, rated_capacity_kw: 70.0,
rated_return_temp_c: 24.0, capacity_slope_per_c: 0.03,
max_airflow_cfm: 12000.0, fan_rated_power_kw: 5.0,
cop_rated: 3.5, initial_setpoint_c: 18.0,
initial_fan_speed_pct: 100.0, supply_temp_lag_s: 30.0 }
power:
utility_voltage_v: 480.0
utility_available: true
ups_units:
- { unit_id: UPS-1, rated_capacity_kw: 500.0,
loss_c0: 0.013, loss_c1: 0.006, loss_c2: 0.011,
battery_capacity_kwh: 8.3, battery_discharge_efficiency: 0.90,
battery_aging_factor: 0.85, recharge_rate_kw: 5.0,
initial_mode: double_conversion }
pdus:
- { pdu_id: PDU-A-01, voltage_ll_v: 208.0,
max_current_per_phase_a: 24.0, num_phases: 3,
efficiency: 0.98, continuous_derating: 0.80 }
generator:
gen_id: GEN-1
rated_capacity_kw: 750.0
start_delay_s: 4.0
crank_time_s: 5.0
warmup_time_s: 8.0
fuel_tank_liters: 2000.0
consumption_lph_full: 180.0
cooldown_time_s: 300.0
ats:
ats_id: ATS-1
transfer_time_ms: 100.0
retransfer_delay_s: 300.0
```
See [data/datacenter_configs/](data/datacenter_configs/) for complete examples.
---
## TRL / GRPO Training Integration
DC-Ops integrates directly with HuggingFace TRL's `GRPOTrainer` via the OpenEnv `environment_factory` pattern:
```python
from trl import GRPOTrainer, GRPOConfig
from dc_ops_env import DcOpsAction, DcOpsEnv
def dc_ops_environment_factory():
"""Factory that returns a DC-Ops environment instance."""
env = DcOpsEnv(base_url="http://localhost:8000")
return env
config = GRPOConfig(
model_name_or_path="your-base-model",
# ... training hyperparameters
)
trainer = GRPOTrainer(
config=config,
environments=dc_ops_environment_factory,
# ... other args
)
trainer.train()
```
For multi-environment parallel training, run multiple servers or increase `max_concurrent_envs` and spawn concurrent clients.
---
## Deploy to HuggingFace Spaces
### Using OpenEnv CLI
The simplest way to deploy:
```bash
# From the dc_ops_env/ directory (where openenv.yaml is located)
cd dc_ops_env
# Login to HuggingFace (if not already)
huggingface-cli login
# Push to HuggingFace Spaces
openenv push
# Or with options
openenv push --repo-id your-username/dc-ops-env --private
openenv push --namespace your-org
```
### What Gets Deployed
The `openenv push` command:
1. Validates the `openenv.yaml` manifest
2. Builds a Docker Space on HuggingFace
3. Uploads all environment code
Your deployed Space will be available at:
`https://huggingface.co/spaces/<repo-id>`
The Space includes:
- **Web Interface** at `/web` — Interactive scenario browser and dashboard viewer
- **API Documentation** at `/docs` — Full OpenAPI/Swagger interface
- **Health Check** at `/health` — Container health monitoring
- **WebSocket** at `/ws` — Persistent session endpoint for agent connections
### Connecting to a Deployed Space
```python
from dc_ops_env import DcOpsAction, DcOpsEnv
# Connect to your HuggingFace Space
space_url = "https://your-username-dc-ops-env.hf.space"
async with DcOpsEnv(base_url=space_url) as env:
result = await env.reset(scenario="A2")
print(result.observation.dashboard)
```
### CLI Options
| Option | Description |
|--------|-------------|
| `--directory`, `-d` | Directory containing the OpenEnv environment (default: current) |
| `--repo-id`, `-r` | Repository ID `username/repo-name` (default: from openenv.yaml) |
| `--base-image`, `-b` | Override base Docker image |
| `--private` | Deploy as a private Space |
| `--namespace` | HuggingFace namespace (user or org) |
---
## Development
### Running Tests
```bash
# All tests (256 tests)
uv run pytest tests/ -v
# Specific test modules
uv run pytest tests/test_thermal.py -v # Thermal physics
uv run pytest tests/test_power.py -v # Power systems
uv run pytest tests/test_actions.py -v # Command parser
uv run pytest tests/test_rewards.py -v # Reward function
uv run pytest tests/test_scenarios.py -v # Scenario framework
uv run pytest tests/test_integration.py -v # End-to-end episodes
# With coverage
uv run pytest tests/ --cov=dc_ops_env --cov-report=term-missing
```
### Direct Environment Testing (No Server)
Test the environment logic without the HTTP/WebSocket layer:
```python
from dc_ops_env.server.dc_ops_env_environment import DcOpsEnvironment
from dc_ops_env.models import DcOpsAction
env = DcOpsEnvironment()
obs = env.reset(scenario="A2")
print(obs.dashboard)
obs = env.step(DcOpsAction(command="diagnose CRAC-3"))
print(f"Reward: {obs.reward}")
print(obs.dashboard)
```
### Running the Server Locally
```bash
# Via entry point (recommended)
uv run server
# With custom port
uv run server --port 8001
# Via uvicorn directly (with auto-reload for development)
uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
# Production (multi-worker)
uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4
```
---
## Project Structure
```
dc_ops_env/
├── openenv.yaml # OpenEnv manifest
├── pyproject.toml # Dependencies and metadata
├── README.md # This file (HF Space README)
├── __init__.py # Exports: DcOpsEnv, DcOpsAction, DcOpsObservation
├── config.py # Physical constants, ASHRAE limits, YAML loader
├── models.py # Pydantic Action/Observation models
├── client.py # DcOpsEnv (EnvClient subclass)
├── simulation/
│ ├── thermal.py # RC thermal network (zones, racks, CRACs)
│ ├── power.py # UPS, PDU, generator, ATS models
│ └── types.py # Runtime state dataclasses
├── scenarios/
│ ├── base.py # Abstract Scenario + ProcedureRule
│ ├── registry.py # Scenario registration and selection
│ ├── thermal_scenarios.py # A1, A2, A4
│ └── power_scenarios.py # B1, B3, B4
├── rewards/
│ └── reward_function.py # 6-component composite reward
├── rendering/
│ └── dashboard.py # State → text dashboard
├── actions/
│ └── parser.py # Deterministic command parser
├── server/
│ ├── dc_ops_env_environment.py # OpenEnv Environment implementation
│ ├── app.py # FastAPI application
│ └── Dockerfile # Container image
├── data/
│ └── datacenter_configs/ # YAML facility definitions
│ ├── default.yaml # 2 zones, 20 racks, 160 kW
│ ├── small_facility.yaml # 1 zone, 10 racks, 80 kW
│ └── large_facility.yaml # 4 zones, 60 racks, 600 kW
└── tests/ # 256 tests across 6 modules
├── test_thermal.py
├── test_power.py
├── test_actions.py
├── test_rewards.py
├── test_scenarios.py
└── test_integration.py
```
## License
BSD-style license. See [LICENSE](../LICENSE) for details.