cloud_resource_env / README.md
sunil18p31a0101's picture
FEAT: " Added for BOth GPU and CPU utilization with thermal control and best allocation."
fa65b6c
---
title: Cloud GPU+CPU Resource Management Environment
emoji: ☁️🔥
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv-0.2.3
- openenv
---
## Hugging Face Space Deployment
This Space runs the **Cloud GPU+CPU Resource Management** OpenEnv environment.
- OpenEnv pinned ref: `0.2.3`
- Hub tag: `openenv`
- **Runs on HF Spaces free tier** (2 vCPUs, 16GB RAM — no physical GPU needed)
### Connecting from Code
```python
from cloud_resource_env import CloudResourceClient
env = CloudResourceClient(base_url="https://huggingface.co/spaces/<your-username>/cloud_resource_env")
```
# Cloud GPU+CPU Resource Management Environment
A real-world OpenEnv environment that simulates **cloud GPU and CPU resource
management** with three progressively harder tasks covering allocation,
thermal management, and heuristic fragmentation.
## Environment Overview
| Component | Description |
|---|---|
| **Domain** | Cloud GPU+CPU infrastructure management |
| **State** | GPU/CPU utilisation, VRAM, temperature, fragmentation, cost |
| **Actions** | Task-specific (see below) — 4 actions per task |
| **Reward** | Multi-objective: utilisation efficiency + thermal safety + fragmentation + cost |
| **Score** | Normalised cumulative reward ∈ [0.0, 1.0] |
## Tasks (3 difficulty levels)
| Task | Difficulty | Nodes | Steps | Focus |
|---|---|---|---|---|
| `gpu_cpu_allocation` | Easy | 3 | 8 | GPU+CPU allocation with cost optimisation |
| `thermal_management` | Medium | 4 | 10 | Temperature monitoring, cooling, load migration |
| `heuristic_fragmentation` | Hard | 5 | 12 | Fragmented GPU placement + defragmentation |
### Task 1: GPU+CPU Allocation (`gpu_cpu_allocation`)
Manage a cluster of mixed GPU nodes (T4, A100, H100, V100, L4) with both GPU
and CPU resources. Optimise throughput while staying within budget.
| Action | Effect |
|---|---|
| `allocate_high` | Capacity × 1.5, Cost × 1.5 |
| `allocate_low` | Capacity ÷ 1.5, Cost ÷ 1.5 |
| `maintain` | No change |
| `migrate` | Move 30% load to other nodes |
### Task 2: Thermal Management (`thermal_management`)
Monitor GPU temperatures and ambient temperature. Prevent thermal throttling
by redistributing load or adjusting cooling levels.
| Action | Effect |
|---|---|
| `increase_cooling` | Cooling level +1 (max 3), reduces GPU temp ~5°C |
| `decrease_cooling` | Cooling level -1 (min 0), saves energy |
| `migrate_load` | Move 40% load to coolest node |
| `maintain` | No change |
**Temperature zones:**
- 🟢 Safe: 55°C – 75°C
- 🟡 Warning: 75°C – max threshold
- 🔴 Critical: Above max threshold → thermal throttle!
### Task 3: Heuristic Fragmentation (`heuristic_fragmentation`)
Place workloads in a fragmented GPU cluster. Each node has 8 GPU slots;
workloads need contiguous blocks (1, 2, 4, or 8 slots).
| Action | Effect |
|---|---|
| `best_fit` | Place in node with smallest sufficient free block |
| `first_fit` | Place in first node with enough space |
| `compact` | Defragment first, then best-fit (10% overhead) |
| `split_workload` | Split across nodes if needed |
## MCP Tools
| Tool | Description |
|---|---|
| `get_cluster_state()` | Returns metrics for all GPU+CPU nodes |
| `get_task_info()` | Returns task description, objectives, valid actions |
| `take_action(decisions)` | Applies decisions, advances timestep, returns reward |
## Observation Space (per node)
| Field | Description |
|---|---|
| `gpu_utilization_pct` | GPU compute utilisation (%) |
| `cpu_utilization_pct` | CPU utilisation (%) |
| `gpu_vram_used_gb` / `gpu_vram_capacity_gb` | VRAM usage |
| `cpu_usage` / `cpu_capacity` | CPU cores usage |
| `memory_usage_gb` / `memory_capacity_gb` | RAM usage |
| `gpu_temp_celsius` | Current GPU temperature |
| `ambient_temp_celsius` | Outside/data center temperature |
| `cooling_level` | Cooling intensity (0-3) |
| `thermal_throttle` | Whether GPU is throttling |
| `fragmentation_score` | How fragmented free GPU slots are (0-1) |
| `cost_per_step` | Running cost |
| `power_draw_watts` | Power consumption |
## Quick Start (Async)
```python
import asyncio
from cloud_resource_env import CloudResourceClient
async def main():
client = await CloudResourceClient.from_docker_image("cloud-resource-env:latest")
async with client:
# Task 1: GPU+CPU allocation
await client.reset(task="gpu_cpu_allocation")
state = await client.call_tool("get_cluster_state")
result = await client.call_tool(
"take_action",
decisions='{"node_0": "allocate_high", "node_1": "maintain", "node_2": "migrate"}'
)
# Task 2: Thermal management
await client.reset(task="thermal_management")
state = await client.call_tool("get_cluster_state")
result = await client.call_tool(
"take_action",
decisions='{"node_0": "increase_cooling", "node_1": "migrate_load", "node_2": "maintain", "node_3": "maintain"}'
)
asyncio.run(main())
```
## Quick Start (Sync)
```python
from cloud_resource_env import CloudResourceClient
with CloudResourceClient(base_url="http://localhost:8000").sync() as env:
env.reset(task="heuristic_fragmentation")
state = env.call_tool("get_cluster_state")
result = env.call_tool(
"take_action",
decisions='{"node_0": "best_fit", "node_1": "compact", "node_2": "first_fit", "node_3": "best_fit", "node_4": "split_workload"}'
)
```
## GPU Node Types
| Node | GPU | VRAM | CPU | RAM | Cost/step | TDP |
|---|---|---|---|---|---|---|
| T4-node | T4 | 16 GB | 4 cores | 16 GB | $8 | 70W |
| A100-node | A100 | 40 GB | 8 cores | 64 GB | $30 | 250W |
| H100-node | H100 | 80 GB | 16 cores | 128 GB | $55 | 350W |
| V100-node | V100 | 32 GB | 8 cores | 32 GB | $18 | 300W |
| L4-node | L4 | 24 GB | 4 cores | 32 GB | $12 | 72W |
## Setup & Installation
```bash
# Install the environment
pip install -e .
# Run the server locally
uvicorn server.app:app --host 0.0.0.0 --port 8000
# Or build and run Docker
docker build -t cloud-resource-env:latest .
docker run -p 8000:8000 cloud-resource-env:latest
# Train with PPO
python train.py --task gpu_cpu_allocation --timesteps 5000
python train.py --task all # train on all tasks
```
## Project Structure
```
cloud_resource_env/
├── __init__.py # Package exports
├── models.py # CloudAction, CloudObservation
├── client.py # CloudResourceClient (MCPToolClient)
├── cloud_env.py # Gymnasium wrapper for RL training
├── openenv.yaml # OpenEnv manifest
├── pyproject.toml # Dependencies
├── Dockerfile # Container image (HF Spaces compatible)
├── inference.py # LLM inference with task-specific prompts
├── train.py # PPO training script
├── README.md # This file
└── server/
├── __init__.py # Server exports
├── app.py # FastAPI application
└── cloud_environment.py # Core environment logic (3 tasks)
```