Spaces:

sunil18p31a0101
/

cloud_resource_env

Sleeping

App Files Files Community

cloud_resource_env / README.md

sunil18p31a0101

FEAT: " Added for BOth GPU and CPU utilization with thermal control and best allocation."

fa65b6c 2 months ago

preview code

raw

history blame contribute delete

7.14 kB

	---
	title: Cloud GPU+CPU Resource Management Environment
	emoji: ☁️🔥
	colorFrom: blue
	colorTo: purple
	sdk: docker
	pinned: false
	app_port: 8000
	base_path: /web
	tags:
	- openenv-0.2.3
	- openenv
	---

	## Hugging Face Space Deployment

	This Space runs the Cloud GPU+CPU Resource Management OpenEnv environment.

	- OpenEnv pinned ref: `0.2.3`
	- Hub tag: `openenv`
	- Runs on HF Spaces free tier (2 vCPUs, 16GB RAM — no physical GPU needed)

	### Connecting from Code

	```python
	from cloud_resource_env import CloudResourceClient

	env = CloudResourceClient(base_url="https://huggingface.co/spaces/<your-username>/cloud_resource_env")
	```

	# Cloud GPU+CPU Resource Management Environment

	A real-world OpenEnv environment that simulates **cloud GPU and CPU resource
	management** with three progressively harder tasks covering allocation,
	thermal management, and heuristic fragmentation.

	## Environment Overview

	\| Component \| Description \|
	\|---\|---\|
	\| Domain \| Cloud GPU+CPU infrastructure management \|
	\| State \| GPU/CPU utilisation, VRAM, temperature, fragmentation, cost \|
	\| Actions \| Task-specific (see below) — 4 actions per task \|
	\| Reward \| Multi-objective: utilisation efficiency + thermal safety + fragmentation + cost \|
	\| Score \| Normalised cumulative reward ∈ [0.0, 1.0] \|

	## Tasks (3 difficulty levels)

	\| Task \| Difficulty \| Nodes \| Steps \| Focus \|
	\|---\|---\|---\|---\|---\|
	\| `gpu_cpu_allocation` \| Easy \| 3 \| 8 \| GPU+CPU allocation with cost optimisation \|
	\| `thermal_management` \| Medium \| 4 \| 10 \| Temperature monitoring, cooling, load migration \|
	\| `heuristic_fragmentation` \| Hard \| 5 \| 12 \| Fragmented GPU placement + defragmentation \|

	### Task 1: GPU+CPU Allocation (`gpu_cpu_allocation`)

	Manage a cluster of mixed GPU nodes (T4, A100, H100, V100, L4) with both GPU
	and CPU resources. Optimise throughput while staying within budget.

	\| Action \| Effect \|
	\|---\|---\|
	\| `allocate_high` \| Capacity × 1.5, Cost × 1.5 \|
	\| `allocate_low` \| Capacity ÷ 1.5, Cost ÷ 1.5 \|
	\| `maintain` \| No change \|
	\| `migrate` \| Move 30% load to other nodes \|

	### Task 2: Thermal Management (`thermal_management`)

	Monitor GPU temperatures and ambient temperature. Prevent thermal throttling
	by redistributing load or adjusting cooling levels.

	\| Action \| Effect \|
	\|---\|---\|
	\| `increase_cooling` \| Cooling level +1 (max 3), reduces GPU temp ~5°C \|
	\| `decrease_cooling` \| Cooling level -1 (min 0), saves energy \|
	\| `migrate_load` \| Move 40% load to coolest node \|
	\| `maintain` \| No change \|

	Temperature zones:
	- 🟢 Safe: 55°C – 75°C
	- 🟡 Warning: 75°C – max threshold
	- 🔴 Critical: Above max threshold → thermal throttle!

	### Task 3: Heuristic Fragmentation (`heuristic_fragmentation`)

	Place workloads in a fragmented GPU cluster. Each node has 8 GPU slots;
	workloads need contiguous blocks (1, 2, 4, or 8 slots).

	\| Action \| Effect \|
	\|---\|---\|
	\| `best_fit` \| Place in node with smallest sufficient free block \|
	\| `first_fit` \| Place in first node with enough space \|
	\| `compact` \| Defragment first, then best-fit (10% overhead) \|
	\| `split_workload` \| Split across nodes if needed \|

	## MCP Tools

	\| Tool \| Description \|
	\|---\|---\|
	\| `get_cluster_state()` \| Returns metrics for all GPU+CPU nodes \|
	\| `get_task_info()` \| Returns task description, objectives, valid actions \|
	\| `take_action(decisions)` \| Applies decisions, advances timestep, returns reward \|

	## Observation Space (per node)

	\| Field \| Description \|
	\|---\|---\|
	\| `gpu_utilization_pct` \| GPU compute utilisation (%) \|
	\| `cpu_utilization_pct` \| CPU utilisation (%) \|
	\| `gpu_vram_used_gb` / `gpu_vram_capacity_gb` \| VRAM usage \|
	\| `cpu_usage` / `cpu_capacity` \| CPU cores usage \|
	\| `memory_usage_gb` / `memory_capacity_gb` \| RAM usage \|
	\| `gpu_temp_celsius` \| Current GPU temperature \|
	\| `ambient_temp_celsius` \| Outside/data center temperature \|
	\| `cooling_level` \| Cooling intensity (0-3) \|
	\| `thermal_throttle` \| Whether GPU is throttling \|
	\| `fragmentation_score` \| How fragmented free GPU slots are (0-1) \|
	\| `cost_per_step` \| Running cost \|
	\| `power_draw_watts` \| Power consumption \|

	## Quick Start (Async)

	```python
	import asyncio
	from cloud_resource_env import CloudResourceClient

	async def main():
	client = await CloudResourceClient.from_docker_image("cloud-resource-env:latest")
	async with client:
	# Task 1: GPU+CPU allocation
	await client.reset(task="gpu_cpu_allocation")
	state = await client.call_tool("get_cluster_state")
	result = await client.call_tool(
	"take_action",
	decisions='{"node_0": "allocate_high", "node_1": "maintain", "node_2": "migrate"}'
	)

	# Task 2: Thermal management
	await client.reset(task="thermal_management")
	state = await client.call_tool("get_cluster_state")
	result = await client.call_tool(
	"take_action",
	decisions='{"node_0": "increase_cooling", "node_1": "migrate_load", "node_2": "maintain", "node_3": "maintain"}'
	)

	asyncio.run(main())
	```

	## Quick Start (Sync)

	```python
	from cloud_resource_env import CloudResourceClient

	with CloudResourceClient(base_url="http://localhost:8000").sync() as env:
	env.reset(task="heuristic_fragmentation")
	state = env.call_tool("get_cluster_state")
	result = env.call_tool(
	"take_action",
	decisions='{"node_0": "best_fit", "node_1": "compact", "node_2": "first_fit", "node_3": "best_fit", "node_4": "split_workload"}'
	)
	```

	## GPU Node Types

	\| Node \| GPU \| VRAM \| CPU \| RAM \| Cost/step \| TDP \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| T4-node \| T4 \| 16 GB \| 4 cores \| 16 GB \| $8 \| 70W \|
	\| A100-node \| A100 \| 40 GB \| 8 cores \| 64 GB \| $30 \| 250W \|
	\| H100-node \| H100 \| 80 GB \| 16 cores \| 128 GB \| $55 \| 350W \|
	\| V100-node \| V100 \| 32 GB \| 8 cores \| 32 GB \| $18 \| 300W \|
	\| L4-node \| L4 \| 24 GB \| 4 cores \| 32 GB \| $12 \| 72W \|

	## Setup & Installation

	```bash
	# Install the environment
	pip install -e .

	# Run the server locally
	uvicorn server.app:app --host 0.0.0.0 --port 8000

	# Or build and run Docker
	docker build -t cloud-resource-env:latest .
	docker run -p 8000:8000 cloud-resource-env:latest

	# Train with PPO
	python train.py --task gpu_cpu_allocation --timesteps 5000
	python train.py --task all # train on all tasks
	```

	## Project Structure

	```
	cloud_resource_env/
	├── __init__.py # Package exports
	├── models.py # CloudAction, CloudObservation
	├── client.py # CloudResourceClient (MCPToolClient)
	├── cloud_env.py # Gymnasium wrapper for RL training
	├── openenv.yaml # OpenEnv manifest
	├── pyproject.toml # Dependencies
	├── Dockerfile # Container image (HF Spaces compatible)
	├── inference.py # LLM inference with task-specific prompts
	├── train.py # PPO training script
	├── README.md # This file
	└── server/
	├── __init__.py # Server exports
	├── app.py # FastAPI application
	└── cloud_environment.py # Core environment logic (3 tasks)
	```