Spaces:
Sleeping
Sleeping
metadata
title: Cloud GPU+CPU Resource Management Environment
emoji: ☁️🔥
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv-0.2.3
- openenv
Hugging Face Space Deployment
This Space runs the Cloud GPU+CPU Resource Management OpenEnv environment.
- OpenEnv pinned ref:
0.2.3 - Hub tag:
openenv - Runs on HF Spaces free tier (2 vCPUs, 16GB RAM — no physical GPU needed)
Connecting from Code
from cloud_resource_env import CloudResourceClient
env = CloudResourceClient(base_url="https://huggingface.co/spaces/<your-username>/cloud_resource_env")
Cloud GPU+CPU Resource Management Environment
A real-world OpenEnv environment that simulates cloud GPU and CPU resource management with three progressively harder tasks covering allocation, thermal management, and heuristic fragmentation.
Environment Overview
| Component | Description |
|---|---|
| Domain | Cloud GPU+CPU infrastructure management |
| State | GPU/CPU utilisation, VRAM, temperature, fragmentation, cost |
| Actions | Task-specific (see below) — 4 actions per task |
| Reward | Multi-objective: utilisation efficiency + thermal safety + fragmentation + cost |
| Score | Normalised cumulative reward ∈ [0.0, 1.0] |
Tasks (3 difficulty levels)
| Task | Difficulty | Nodes | Steps | Focus |
|---|---|---|---|---|
gpu_cpu_allocation |
Easy | 3 | 8 | GPU+CPU allocation with cost optimisation |
thermal_management |
Medium | 4 | 10 | Temperature monitoring, cooling, load migration |
heuristic_fragmentation |
Hard | 5 | 12 | Fragmented GPU placement + defragmentation |
Task 1: GPU+CPU Allocation (gpu_cpu_allocation)
Manage a cluster of mixed GPU nodes (T4, A100, H100, V100, L4) with both GPU and CPU resources. Optimise throughput while staying within budget.
| Action | Effect |
|---|---|
allocate_high |
Capacity × 1.5, Cost × 1.5 |
allocate_low |
Capacity ÷ 1.5, Cost ÷ 1.5 |
maintain |
No change |
migrate |
Move 30% load to other nodes |
Task 2: Thermal Management (thermal_management)
Monitor GPU temperatures and ambient temperature. Prevent thermal throttling by redistributing load or adjusting cooling levels.
| Action | Effect |
|---|---|
increase_cooling |
Cooling level +1 (max 3), reduces GPU temp ~5°C |
decrease_cooling |
Cooling level -1 (min 0), saves energy |
migrate_load |
Move 40% load to coolest node |
maintain |
No change |
Temperature zones:
- 🟢 Safe: 55°C – 75°C
- 🟡 Warning: 75°C – max threshold
- 🔴 Critical: Above max threshold → thermal throttle!
Task 3: Heuristic Fragmentation (heuristic_fragmentation)
Place workloads in a fragmented GPU cluster. Each node has 8 GPU slots; workloads need contiguous blocks (1, 2, 4, or 8 slots).
| Action | Effect |
|---|---|
best_fit |
Place in node with smallest sufficient free block |
first_fit |
Place in first node with enough space |
compact |
Defragment first, then best-fit (10% overhead) |
split_workload |
Split across nodes if needed |
MCP Tools
| Tool | Description |
|---|---|
get_cluster_state() |
Returns metrics for all GPU+CPU nodes |
get_task_info() |
Returns task description, objectives, valid actions |
take_action(decisions) |
Applies decisions, advances timestep, returns reward |
Observation Space (per node)
| Field | Description |
|---|---|
gpu_utilization_pct |
GPU compute utilisation (%) |
cpu_utilization_pct |
CPU utilisation (%) |
gpu_vram_used_gb / gpu_vram_capacity_gb |
VRAM usage |
cpu_usage / cpu_capacity |
CPU cores usage |
memory_usage_gb / memory_capacity_gb |
RAM usage |
gpu_temp_celsius |
Current GPU temperature |
ambient_temp_celsius |
Outside/data center temperature |
cooling_level |
Cooling intensity (0-3) |
thermal_throttle |
Whether GPU is throttling |
fragmentation_score |
How fragmented free GPU slots are (0-1) |
cost_per_step |
Running cost |
power_draw_watts |
Power consumption |
Quick Start (Async)
import asyncio
from cloud_resource_env import CloudResourceClient
async def main():
client = await CloudResourceClient.from_docker_image("cloud-resource-env:latest")
async with client:
# Task 1: GPU+CPU allocation
await client.reset(task="gpu_cpu_allocation")
state = await client.call_tool("get_cluster_state")
result = await client.call_tool(
"take_action",
decisions='{"node_0": "allocate_high", "node_1": "maintain", "node_2": "migrate"}'
)
# Task 2: Thermal management
await client.reset(task="thermal_management")
state = await client.call_tool("get_cluster_state")
result = await client.call_tool(
"take_action",
decisions='{"node_0": "increase_cooling", "node_1": "migrate_load", "node_2": "maintain", "node_3": "maintain"}'
)
asyncio.run(main())
Quick Start (Sync)
from cloud_resource_env import CloudResourceClient
with CloudResourceClient(base_url="http://localhost:8000").sync() as env:
env.reset(task="heuristic_fragmentation")
state = env.call_tool("get_cluster_state")
result = env.call_tool(
"take_action",
decisions='{"node_0": "best_fit", "node_1": "compact", "node_2": "first_fit", "node_3": "best_fit", "node_4": "split_workload"}'
)
GPU Node Types
| Node | GPU | VRAM | CPU | RAM | Cost/step | TDP |
|---|---|---|---|---|---|---|
| T4-node | T4 | 16 GB | 4 cores | 16 GB | $8 | 70W |
| A100-node | A100 | 40 GB | 8 cores | 64 GB | $30 | 250W |
| H100-node | H100 | 80 GB | 16 cores | 128 GB | $55 | 350W |
| V100-node | V100 | 32 GB | 8 cores | 32 GB | $18 | 300W |
| L4-node | L4 | 24 GB | 4 cores | 32 GB | $12 | 72W |
Setup & Installation
# Install the environment
pip install -e .
# Run the server locally
uvicorn server.app:app --host 0.0.0.0 --port 8000
# Or build and run Docker
docker build -t cloud-resource-env:latest .
docker run -p 8000:8000 cloud-resource-env:latest
# Train with PPO
python train.py --task gpu_cpu_allocation --timesteps 5000
python train.py --task all # train on all tasks
Project Structure
cloud_resource_env/
├── __init__.py # Package exports
├── models.py # CloudAction, CloudObservation
├── client.py # CloudResourceClient (MCPToolClient)
├── cloud_env.py # Gymnasium wrapper for RL training
├── openenv.yaml # OpenEnv manifest
├── pyproject.toml # Dependencies
├── Dockerfile # Container image (HF Spaces compatible)
├── inference.py # LLM inference with task-specific prompts
├── train.py # PPO training script
├── README.md # This file
└── server/
├── __init__.py # Server exports
├── app.py # FastAPI application
└── cloud_environment.py # Core environment logic (3 tasks)