Spaces:

sunil18p31a0101
/

cloud_resource_env

Sleeping

App Files Files Community

cloud_resource_env / README.md

sunil18p31a0101

FEAT: " Added for BOth GPU and CPU utilization with thermal control and best allocation."

fa65b6c 2 months ago

preview code

raw

history blame contribute delete

7.14 kB

metadata

title: Cloud GPU+CPU Resource Management Environment
emoji: ☁️🔥
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv-0.2.3
  - openenv

Hugging Face Space Deployment

This Space runs the Cloud GPU+CPU Resource Management OpenEnv environment.

OpenEnv pinned ref: 0.2.3
Hub tag: openenv
Runs on HF Spaces free tier (2 vCPUs, 16GB RAM — no physical GPU needed)

Connecting from Code

from cloud_resource_env import CloudResourceClient

env = CloudResourceClient(base_url="https://huggingface.co/spaces/<your-username>/cloud_resource_env")

Cloud GPU+CPU Resource Management Environment

A real-world OpenEnv environment that simulates cloud GPU and CPU resource management with three progressively harder tasks covering allocation, thermal management, and heuristic fragmentation.

Environment Overview

Component	Description
Domain	Cloud GPU+CPU infrastructure management
State	GPU/CPU utilisation, VRAM, temperature, fragmentation, cost
Actions	Task-specific (see below) — 4 actions per task
Reward	Multi-objective: utilisation efficiency + thermal safety + fragmentation + cost
Score	Normalised cumulative reward ∈ [0.0, 1.0]

Tasks (3 difficulty levels)

Task	Difficulty	Nodes	Steps	Focus
`gpu_cpu_allocation`	Easy	3	8	GPU+CPU allocation with cost optimisation
`thermal_management`	Medium	4	10	Temperature monitoring, cooling, load migration
`heuristic_fragmentation`	Hard	5	12	Fragmented GPU placement + defragmentation

Task 1: GPU+CPU Allocation (`gpu_cpu_allocation`)

Manage a cluster of mixed GPU nodes (T4, A100, H100, V100, L4) with both GPU and CPU resources. Optimise throughput while staying within budget.

Action	Effect
`allocate_high`	Capacity × 1.5, Cost × 1.5
`allocate_low`	Capacity ÷ 1.5, Cost ÷ 1.5
`maintain`	No change
`migrate`	Move 30% load to other nodes

Task 2: Thermal Management (`thermal_management`)

Monitor GPU temperatures and ambient temperature. Prevent thermal throttling by redistributing load or adjusting cooling levels.

Action	Effect
`increase_cooling`	Cooling level +1 (max 3), reduces GPU temp ~5°C
`decrease_cooling`	Cooling level -1 (min 0), saves energy
`migrate_load`	Move 40% load to coolest node
`maintain`	No change

Temperature zones:

🟢 Safe: 55°C – 75°C
🟡 Warning: 75°C – max threshold
🔴 Critical: Above max threshold → thermal throttle!

Task 3: Heuristic Fragmentation (`heuristic_fragmentation`)

Place workloads in a fragmented GPU cluster. Each node has 8 GPU slots; workloads need contiguous blocks (1, 2, 4, or 8 slots).

Action	Effect
`best_fit`	Place in node with smallest sufficient free block
`first_fit`	Place in first node with enough space
`compact`	Defragment first, then best-fit (10% overhead)
`split_workload`	Split across nodes if needed

MCP Tools

Tool	Description
`get_cluster_state()`	Returns metrics for all GPU+CPU nodes
`get_task_info()`	Returns task description, objectives, valid actions
`take_action(decisions)`	Applies decisions, advances timestep, returns reward

Observation Space (per node)

Field	Description
`gpu_utilization_pct`	GPU compute utilisation (%)
`cpu_utilization_pct`	CPU utilisation (%)
`gpu_vram_used_gb` / `gpu_vram_capacity_gb`	VRAM usage
`cpu_usage` / `cpu_capacity`	CPU cores usage
`memory_usage_gb` / `memory_capacity_gb`	RAM usage
`gpu_temp_celsius`	Current GPU temperature
`ambient_temp_celsius`	Outside/data center temperature
`cooling_level`	Cooling intensity (0-3)
`thermal_throttle`	Whether GPU is throttling
`fragmentation_score`	How fragmented free GPU slots are (0-1)
`cost_per_step`	Running cost
`power_draw_watts`	Power consumption

Quick Start (Async)

import asyncio
from cloud_resource_env import CloudResourceClient

async def main():
    client = await CloudResourceClient.from_docker_image("cloud-resource-env:latest")
    async with client:
        # Task 1: GPU+CPU allocation
        await client.reset(task="gpu_cpu_allocation")
        state = await client.call_tool("get_cluster_state")
        result = await client.call_tool(
            "take_action",
            decisions='{"node_0": "allocate_high", "node_1": "maintain", "node_2": "migrate"}'
        )

        # Task 2: Thermal management
        await client.reset(task="thermal_management")
        state = await client.call_tool("get_cluster_state")
        result = await client.call_tool(
            "take_action",
            decisions='{"node_0": "increase_cooling", "node_1": "migrate_load", "node_2": "maintain", "node_3": "maintain"}'
        )

asyncio.run(main())

Quick Start (Sync)

from cloud_resource_env import CloudResourceClient

with CloudResourceClient(base_url="http://localhost:8000").sync() as env:
    env.reset(task="heuristic_fragmentation")
    state = env.call_tool("get_cluster_state")
    result = env.call_tool(
        "take_action",
        decisions='{"node_0": "best_fit", "node_1": "compact", "node_2": "first_fit", "node_3": "best_fit", "node_4": "split_workload"}'
    )

GPU Node Types

Node	GPU	VRAM	CPU	RAM	Cost/step	TDP
T4-node	T4	16 GB	4 cores	16 GB	$8	70W
A100-node	A100	40 GB	8 cores	64 GB	$30	250W
H100-node	H100	80 GB	16 cores	128 GB	$55	350W
V100-node	V100	32 GB	8 cores	32 GB	$18	300W
L4-node	L4	24 GB	4 cores	32 GB	$12	72W

Setup & Installation

# Install the environment
pip install -e .

# Run the server locally
uvicorn server.app:app --host 0.0.0.0 --port 8000

# Or build and run Docker
docker build -t cloud-resource-env:latest .
docker run -p 8000:8000 cloud-resource-env:latest

# Train with PPO
python train.py --task gpu_cpu_allocation --timesteps 5000
python train.py --task all  # train on all tasks

Project Structure

cloud_resource_env/
├── __init__.py            # Package exports
├── models.py              # CloudAction, CloudObservation
├── client.py              # CloudResourceClient (MCPToolClient)
├── cloud_env.py           # Gymnasium wrapper for RL training
├── openenv.yaml           # OpenEnv manifest
├── pyproject.toml         # Dependencies
├── Dockerfile             # Container image (HF Spaces compatible)
├── inference.py           # LLM inference with task-specific prompts
├── train.py               # PPO training script
├── README.md              # This file
└── server/
    ├── __init__.py        # Server exports
    ├── app.py             # FastAPI application
    └── cloud_environment.py  # Core environment logic (3 tasks)