Spaces:

Melikshah
/

dc_ops_env

Running

App Files Files Community

dc_ops_env / README.md

Melikshah

Update README.md

896b01a verified 5 days ago

preview code

raw

history blame contribute delete

15.2 kB

	---
	title: DC-Ops Environment Server
	emoji: ⚡
	colorFrom: blue
	colorTo: green
	sdk: docker
	pinned: false
	app_port: 8000
	base_path: /web
	tags:
	- openenv
	- reinforcement-learning
	- datacenter
	- simulation
	thumbnail: >-
	https://cdn-uploads.huggingface.co/production/uploads/6578359f277dceb6056f1646/WSrOC1cq1OHFyKV469PIQ.png
	---

	# DC-Ops Environment

	A physics-based datacenter operations environment for training LLM agents, built on Meta's [OpenEnv](https://github.com/meta-pytorch/OpenEnv) framework.

	The agent reads a text-based NOC dashboard and issues natural-language operator commands — exactly as a human datacenter operator would.

	## Quick Start

	### Prerequisites

	- Python 3.10+
	- [uv](https://docs.astral.sh/uv/) (recommended) or pip
	- Docker (for containerized deployment)

	### Install & Run Locally

	```bash
	# Clone the repository
	git clone <repo-url>
	cd dc_ops_env

	# Install dependencies
	uv sync

	# Run the test suite (256 tests, <10s)
	uv run pytest tests/ -v

	# Start the server
	uv run server
	```

	The server starts at `http://localhost:8000` with:
	- Web UI → `http://localhost:8000/web`
	- API docs → `http://localhost:8000/docs`
	- Health check → `http://localhost:8000/health`
	- WebSocket → `ws://localhost:8000/ws`

	### Run with Docker

	```bash
	# Build the image
	docker build -t dc-ops:latest -f server/Dockerfile .

	# Run the container
	docker run -d -p 8000:8000 dc-ops:latest

	# Verify it's running
	curl http://localhost:8000/health
	```

	---

	## OpenEnv Integration

	DC-Ops is a fully compliant [OpenEnv](https://github.com/meta-pytorch/OpenEnv) environment. OpenEnv provides:
	- MCP tool-based interactions for LLM agents (WebSocket `/ws`)
	- HTTP orchestration layer for training pipelines (`/reset`, `/step`, `/state`)
	- HuggingFace Spaces deployment via `openenv push`
	- TRL/GRPO integration for RL training with `GRPOTrainer`

	### Action & Observation Models

	DcOpsAction — the agent's command:
	```python
	class DcOpsAction(Action):
	command: str # e.g., "diagnose CRAC-3", "adjust_setpoint CRAC-1 20"
	reasoning: str # Optional chain-of-thought
	```

	DcOpsObservation — what the agent sees:
	```python
	class DcOpsObservation(Observation):
	dashboard: str # Text-rendered monitoring dashboard
	available_actions: list # Valid commands the agent can issue
	alert: str # Current active alert message
	scenario_type: str # "thermal", "power", etc.
	steps_remaining: int # Steps left in episode budget
	action_result: str # Feedback from last action
	```

	### Available Commands

	\| Command \| Format \| Description \|
	\|---------\|--------\|-------------\|
	\| `diagnose` \| `diagnose <unit_id>` \| Inspect a CRAC/UPS/PDU for faults \|
	\| `adjust_setpoint` \| `adjust_setpoint <crac_id> <temp_c>` \| Change CRAC supply air setpoint \|
	\| `set_fan_speed` \| `set_fan_speed <crac_id> <pct>` \| Set CRAC fan speed (0-100%) \|
	\| `set_rack_load` \| `set_rack_load <rack_id> <kw>` \| Adjust rack IT load (migrate workload) \|
	\| `start_crac` \| `start_crac <crac_id>` \| Start a standby CRAC unit \|
	\| `stop_crac` \| `stop_crac <crac_id>` \| Put a CRAC into standby \|
	\| `start_generator` \| `start_generator` \| Manually start the diesel generator \|
	\| `stop_generator` \| `stop_generator` \| Initiate generator cooldown \|
	\| `set_ups_mode` \| `set_ups_mode <ups_id> <mode>` \| Set UPS mode (eco/double_conversion/bypass) \|
	\| `refuel_generator` \| `refuel_generator [liters]` \| Refuel (default: full tank) \|
	\| `acknowledge_alarm` \| `acknowledge_alarm` \| Acknowledge current alert \|
	\| `check_status` \| `check_status` \| Request full status report \|
	\| `escalate` \| `escalate` \| Escalate to senior engineer \|
	\| `wait` \| `wait` \| Take no action this step \|

	---

	## Using the Client

	### Programmatic Usage (Python)

	```python
	from dc_ops_env import DcOpsAction, DcOpsEnv

	# Connect to a running server
	async with DcOpsEnv(base_url="http://localhost:8000") as env:
	# Reset with a specific scenario
	result = await env.reset(scenario="A2")
	print(result.observation.dashboard)

	# Agent loop
	while not result.done:
	result = await env.step(
	DcOpsAction(
	command="diagnose CRAC-3",
	reasoning="CRAC-3 shows compressor failure, need to investigate"
	)
	)
	print(f"Reward: {result.reward}")
	print(result.observation.dashboard)
	```

	### From Docker Image

	```python
	from dc_ops_env import DcOpsAction, DcOpsEnv

	# Start environment from Docker (auto-manages container lifecycle)
	env = DcOpsEnv.from_docker_image("dc-ops:latest")

	try:
	result = env.reset(scenario="A2")
	for _ in range(15):
	result = env.step(DcOpsAction(command="check_status"))
	if result.done:
	break
	finally:
	env.close()
	```

	### Concurrent Sessions

	The server supports multiple concurrent WebSocket sessions for parallel training:

	```python
	# In server/app.py — adjust max_concurrent_envs
	app = create_app(
	DcOpsEnvironment,
	DcOpsAction,
	DcOpsObservation,
	max_concurrent_envs=16, # Scale up for parallel RL
	)
	```

	```python
	from concurrent.futures import ThreadPoolExecutor
	from dc_ops_env import DcOpsAction, DcOpsEnv

	def run_episode(scenario_id: str):
	with DcOpsEnv(base_url="http://localhost:8000") as env:
	result = env.reset(scenario=scenario_id)
	total_reward = 0.0
	while not result.done:
	result = env.step(DcOpsAction(command="check_status"))
	total_reward += result.reward
	return scenario_id, total_reward

	# Run 8 episodes concurrently
	scenarios = ["A1", "A2", "A4", "B1", "B3", "B4", "A2", "B4"]
	with ThreadPoolExecutor(max_workers=8) as executor:
	results = list(executor.map(run_episode, scenarios))
	```

	---

	## Scenarios

	6 operational scenarios across 3 difficulty levels:

	\| ID \| Scenario \| Difficulty \| Type \| Fault \|
	\|----\|----------\|------------\|------\|-------\|
	\| A1 \| Cooling Setpoint Optimization \| Easy \| Thermal \| CRACs at 15°C (wasteful) \|
	\| A2 \| Thermal Event Response \| Medium \| Thermal \| CRAC-3 compressor failure \|
	\| A4 \| CRAC Failure Cascade \| Hard \| Thermal \| CRAC-1 compressor + CRAC-3 fan \|
	\| B1 \| UPS Alarm Response \| Medium \| Power \| UPS transferred to battery \|
	\| B3 \| Generator Test Protocol \| Easy \| Power \| None (routine test) \|
	\| B4 \| Power Failure Cascade \| Hard \| Power \| Utility loss + extended gen warmup \|

	Reset with a specific scenario:
	```python
	result = env.reset(scenario="A2") # By ID
	result = env.reset(random_scenario=True) # Random
	result = env.reset(random_scenario=True, difficulty="hard") # Random hard
	```

	---

	## Configuration

	### Built-in Facility Configs

	Three YAML configurations are included:

	\| Config \| Zones \| Racks \| IT Load \| CRACs \| Use Case \|
	\|--------\|-------\|-------\|---------\|-------\|----------\|
	\| `default` \| 2 \| 20 \| 160 kW \| 4 × 70 kW \| Standard facility \|
	\| `small` \| 1 \| 10 \| 80 kW \| 2 × 70 kW \| Edge / branch office \|
	\| `large` \| 4 \| 60 \| 600 kW \| 8 × 100 kW \| Multi-zone + GPU (H1) \|

	```python
	from dc_ops_env.config import load_datacenter_config

	# Load a built-in config
	config = load_datacenter_config("small")

	# Load a custom YAML file
	config = load_datacenter_config("/path/to/my_datacenter.yaml")

	# Use with environment
	result = env.reset(scenario="A2", config=config)
	```

	### Custom YAML Configuration

	Create your own datacenter layout:

	```yaml
	name: "My Custom Facility"
	outside_temp_c: 35.0
	outside_humidity_rh: 0.40
	simulation_dt_s: 1.0

	zones:
	- zone_id: zone_a
	containment_type: cold_aisle
	recirculation_factor: 0.08
	air_volume_m3: 500.0
	envelope_r_kw: 0.02
	initial_cold_aisle_temp_c: 20.0
	ashrae_class: A2
	racks:
	- { rack_id: A-01, row: A, position: 1, it_load_kw: 8.0,
	num_servers_2u: 20, server_thermal_mass_jk: 11100.0,
	airflow_cfm_per_kw: 160.0 }
	# ... more racks
	crac_units:
	- { unit_id: CRAC-1, rated_capacity_kw: 70.0,
	rated_return_temp_c: 24.0, capacity_slope_per_c: 0.03,
	max_airflow_cfm: 12000.0, fan_rated_power_kw: 5.0,
	cop_rated: 3.5, initial_setpoint_c: 18.0,
	initial_fan_speed_pct: 100.0, supply_temp_lag_s: 30.0 }

	power:
	utility_voltage_v: 480.0
	utility_available: true
	ups_units:
	- { unit_id: UPS-1, rated_capacity_kw: 500.0,
	loss_c0: 0.013, loss_c1: 0.006, loss_c2: 0.011,
	battery_capacity_kwh: 8.3, battery_discharge_efficiency: 0.90,
	battery_aging_factor: 0.85, recharge_rate_kw: 5.0,
	initial_mode: double_conversion }
	pdus:
	- { pdu_id: PDU-A-01, voltage_ll_v: 208.0,
	max_current_per_phase_a: 24.0, num_phases: 3,
	efficiency: 0.98, continuous_derating: 0.80 }
	generator:
	gen_id: GEN-1
	rated_capacity_kw: 750.0
	start_delay_s: 4.0
	crank_time_s: 5.0
	warmup_time_s: 8.0
	fuel_tank_liters: 2000.0
	consumption_lph_full: 180.0
	cooldown_time_s: 300.0
	ats:
	ats_id: ATS-1
	transfer_time_ms: 100.0
	retransfer_delay_s: 300.0
	```

	See [data/datacenter_configs/](data/datacenter_configs/) for complete examples.

	---

	## TRL / GRPO Training Integration

	DC-Ops integrates directly with HuggingFace TRL's `GRPOTrainer` via the OpenEnv `environment_factory` pattern:

	```python
	from trl import GRPOTrainer, GRPOConfig
	from dc_ops_env import DcOpsAction, DcOpsEnv

	def dc_ops_environment_factory():
	"""Factory that returns a DC-Ops environment instance."""
	env = DcOpsEnv(base_url="http://localhost:8000")
	return env

	config = GRPOConfig(
	model_name_or_path="your-base-model",
	# ... training hyperparameters
	)

	trainer = GRPOTrainer(
	config=config,
	environments=dc_ops_environment_factory,
	# ... other args
	)

	trainer.train()
	```

	For multi-environment parallel training, run multiple servers or increase `max_concurrent_envs` and spawn concurrent clients.

	---

	## Deploy to HuggingFace Spaces

	### Using OpenEnv CLI

	The simplest way to deploy:

	```bash
	# From the dc_ops_env/ directory (where openenv.yaml is located)
	cd dc_ops_env

	# Login to HuggingFace (if not already)
	huggingface-cli login

	# Push to HuggingFace Spaces
	openenv push

	# Or with options
	openenv push --repo-id your-username/dc-ops-env --private
	openenv push --namespace your-org
	```

	### What Gets Deployed

	The `openenv push` command:
	1. Validates the `openenv.yaml` manifest
	2. Builds a Docker Space on HuggingFace
	3. Uploads all environment code

	Your deployed Space will be available at:
	`https://huggingface.co/spaces/<repo-id>`

	The Space includes:
	- Web Interface at `/web` — Interactive scenario browser and dashboard viewer
	- API Documentation at `/docs` — Full OpenAPI/Swagger interface
	- Health Check at `/health` — Container health monitoring
	- WebSocket at `/ws` — Persistent session endpoint for agent connections

	### Connecting to a Deployed Space

	```python
	from dc_ops_env import DcOpsAction, DcOpsEnv

	# Connect to your HuggingFace Space
	space_url = "https://your-username-dc-ops-env.hf.space"

	async with DcOpsEnv(base_url=space_url) as env:
	result = await env.reset(scenario="A2")
	print(result.observation.dashboard)
	```

	### CLI Options

	\| Option \| Description \|
	\|--------\|-------------\|
	\| `--directory`, `-d` \| Directory containing the OpenEnv environment (default: current) \|
	\| `--repo-id`, `-r` \| Repository ID `username/repo-name` (default: from openenv.yaml) \|
	\| `--base-image`, `-b` \| Override base Docker image \|
	\| `--private` \| Deploy as a private Space \|
	\| `--namespace` \| HuggingFace namespace (user or org) \|

	---

	## Development

	### Running Tests

	```bash
	# All tests (256 tests)
	uv run pytest tests/ -v

	# Specific test modules
	uv run pytest tests/test_thermal.py -v # Thermal physics
	uv run pytest tests/test_power.py -v # Power systems
	uv run pytest tests/test_actions.py -v # Command parser
	uv run pytest tests/test_rewards.py -v # Reward function
	uv run pytest tests/test_scenarios.py -v # Scenario framework
	uv run pytest tests/test_integration.py -v # End-to-end episodes

	# With coverage
	uv run pytest tests/ --cov=dc_ops_env --cov-report=term-missing
	```

	### Direct Environment Testing (No Server)

	Test the environment logic without the HTTP/WebSocket layer:

	```python
	from dc_ops_env.server.dc_ops_env_environment import DcOpsEnvironment
	from dc_ops_env.models import DcOpsAction

	env = DcOpsEnvironment()
	obs = env.reset(scenario="A2")
	print(obs.dashboard)

	obs = env.step(DcOpsAction(command="diagnose CRAC-3"))
	print(f"Reward: {obs.reward}")
	print(obs.dashboard)
	```

	### Running the Server Locally

	```bash
	# Via entry point (recommended)
	uv run server

	# With custom port
	uv run server --port 8001

	# Via uvicorn directly (with auto-reload for development)
	uvicorn server.app:app --reload --host 0.0.0.0 --port 8000

	# Production (multi-worker)
	uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4
	```

	---

	## Project Structure

	```
	dc_ops_env/
	├── openenv.yaml # OpenEnv manifest
	├── pyproject.toml # Dependencies and metadata
	├── README.md # This file (HF Space README)
	├── __init__.py # Exports: DcOpsEnv, DcOpsAction, DcOpsObservation
	├── config.py # Physical constants, ASHRAE limits, YAML loader
	├── models.py # Pydantic Action/Observation models
	├── client.py # DcOpsEnv (EnvClient subclass)
	├── simulation/
	│ ├── thermal.py # RC thermal network (zones, racks, CRACs)
	│ ├── power.py # UPS, PDU, generator, ATS models
	│ └── types.py # Runtime state dataclasses
	├── scenarios/
	│ ├── base.py # Abstract Scenario + ProcedureRule
	│ ├── registry.py # Scenario registration and selection
	│ ├── thermal_scenarios.py # A1, A2, A4
	│ └── power_scenarios.py # B1, B3, B4
	├── rewards/
	│ └── reward_function.py # 6-component composite reward
	├── rendering/
	│ └── dashboard.py # State → text dashboard
	├── actions/
	│ └── parser.py # Deterministic command parser
	├── server/
	│ ├── dc_ops_env_environment.py # OpenEnv Environment implementation
	│ ├── app.py # FastAPI application
	│ └── Dockerfile # Container image
	├── data/
	│ └── datacenter_configs/ # YAML facility definitions
	│ ├── default.yaml # 2 zones, 20 racks, 160 kW
	│ ├── small_facility.yaml # 1 zone, 10 racks, 80 kW
	│ └── large_facility.yaml # 4 zones, 60 racks, 600 kW
	└── tests/ # 256 tests across 6 modules
	├── test_thermal.py
	├── test_power.py
	├── test_actions.py
	├── test_rewards.py
	├── test_scenarios.py
	└── test_integration.py
	```

	## License

	BSD-style license. See [LICENSE](../LICENSE) for details.