cloud_resource_env / README.md
sunil18p31a0101's picture
FEAT: " Added for BOth GPU and CPU utilization with thermal control and best allocation."
fa65b6c
metadata
title: Cloud GPU+CPU Resource Management Environment
emoji: ☁️🔥
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv-0.2.3
  - openenv

Hugging Face Space Deployment

This Space runs the Cloud GPU+CPU Resource Management OpenEnv environment.

  • OpenEnv pinned ref: 0.2.3
  • Hub tag: openenv
  • Runs on HF Spaces free tier (2 vCPUs, 16GB RAM — no physical GPU needed)

Connecting from Code

from cloud_resource_env import CloudResourceClient

env = CloudResourceClient(base_url="https://huggingface.co/spaces/<your-username>/cloud_resource_env")

Cloud GPU+CPU Resource Management Environment

A real-world OpenEnv environment that simulates cloud GPU and CPU resource management with three progressively harder tasks covering allocation, thermal management, and heuristic fragmentation.

Environment Overview

Component Description
Domain Cloud GPU+CPU infrastructure management
State GPU/CPU utilisation, VRAM, temperature, fragmentation, cost
Actions Task-specific (see below) — 4 actions per task
Reward Multi-objective: utilisation efficiency + thermal safety + fragmentation + cost
Score Normalised cumulative reward ∈ [0.0, 1.0]

Tasks (3 difficulty levels)

Task Difficulty Nodes Steps Focus
gpu_cpu_allocation Easy 3 8 GPU+CPU allocation with cost optimisation
thermal_management Medium 4 10 Temperature monitoring, cooling, load migration
heuristic_fragmentation Hard 5 12 Fragmented GPU placement + defragmentation

Task 1: GPU+CPU Allocation (gpu_cpu_allocation)

Manage a cluster of mixed GPU nodes (T4, A100, H100, V100, L4) with both GPU and CPU resources. Optimise throughput while staying within budget.

Action Effect
allocate_high Capacity × 1.5, Cost × 1.5
allocate_low Capacity ÷ 1.5, Cost ÷ 1.5
maintain No change
migrate Move 30% load to other nodes

Task 2: Thermal Management (thermal_management)

Monitor GPU temperatures and ambient temperature. Prevent thermal throttling by redistributing load or adjusting cooling levels.

Action Effect
increase_cooling Cooling level +1 (max 3), reduces GPU temp ~5°C
decrease_cooling Cooling level -1 (min 0), saves energy
migrate_load Move 40% load to coolest node
maintain No change

Temperature zones:

  • 🟢 Safe: 55°C – 75°C
  • 🟡 Warning: 75°C – max threshold
  • 🔴 Critical: Above max threshold → thermal throttle!

Task 3: Heuristic Fragmentation (heuristic_fragmentation)

Place workloads in a fragmented GPU cluster. Each node has 8 GPU slots; workloads need contiguous blocks (1, 2, 4, or 8 slots).

Action Effect
best_fit Place in node with smallest sufficient free block
first_fit Place in first node with enough space
compact Defragment first, then best-fit (10% overhead)
split_workload Split across nodes if needed

MCP Tools

Tool Description
get_cluster_state() Returns metrics for all GPU+CPU nodes
get_task_info() Returns task description, objectives, valid actions
take_action(decisions) Applies decisions, advances timestep, returns reward

Observation Space (per node)

Field Description
gpu_utilization_pct GPU compute utilisation (%)
cpu_utilization_pct CPU utilisation (%)
gpu_vram_used_gb / gpu_vram_capacity_gb VRAM usage
cpu_usage / cpu_capacity CPU cores usage
memory_usage_gb / memory_capacity_gb RAM usage
gpu_temp_celsius Current GPU temperature
ambient_temp_celsius Outside/data center temperature
cooling_level Cooling intensity (0-3)
thermal_throttle Whether GPU is throttling
fragmentation_score How fragmented free GPU slots are (0-1)
cost_per_step Running cost
power_draw_watts Power consumption

Quick Start (Async)

import asyncio
from cloud_resource_env import CloudResourceClient

async def main():
    client = await CloudResourceClient.from_docker_image("cloud-resource-env:latest")
    async with client:
        # Task 1: GPU+CPU allocation
        await client.reset(task="gpu_cpu_allocation")
        state = await client.call_tool("get_cluster_state")
        result = await client.call_tool(
            "take_action",
            decisions='{"node_0": "allocate_high", "node_1": "maintain", "node_2": "migrate"}'
        )

        # Task 2: Thermal management
        await client.reset(task="thermal_management")
        state = await client.call_tool("get_cluster_state")
        result = await client.call_tool(
            "take_action",
            decisions='{"node_0": "increase_cooling", "node_1": "migrate_load", "node_2": "maintain", "node_3": "maintain"}'
        )

asyncio.run(main())

Quick Start (Sync)

from cloud_resource_env import CloudResourceClient

with CloudResourceClient(base_url="http://localhost:8000").sync() as env:
    env.reset(task="heuristic_fragmentation")
    state = env.call_tool("get_cluster_state")
    result = env.call_tool(
        "take_action",
        decisions='{"node_0": "best_fit", "node_1": "compact", "node_2": "first_fit", "node_3": "best_fit", "node_4": "split_workload"}'
    )

GPU Node Types

Node GPU VRAM CPU RAM Cost/step TDP
T4-node T4 16 GB 4 cores 16 GB $8 70W
A100-node A100 40 GB 8 cores 64 GB $30 250W
H100-node H100 80 GB 16 cores 128 GB $55 350W
V100-node V100 32 GB 8 cores 32 GB $18 300W
L4-node L4 24 GB 4 cores 32 GB $12 72W

Setup & Installation

# Install the environment
pip install -e .

# Run the server locally
uvicorn server.app:app --host 0.0.0.0 --port 8000

# Or build and run Docker
docker build -t cloud-resource-env:latest .
docker run -p 8000:8000 cloud-resource-env:latest

# Train with PPO
python train.py --task gpu_cpu_allocation --timesteps 5000
python train.py --task all  # train on all tasks

Project Structure

cloud_resource_env/
├── __init__.py            # Package exports
├── models.py              # CloudAction, CloudObservation
├── client.py              # CloudResourceClient (MCPToolClient)
├── cloud_env.py           # Gymnasium wrapper for RL training
├── openenv.yaml           # OpenEnv manifest
├── pyproject.toml         # Dependencies
├── Dockerfile             # Container image (HF Spaces compatible)
├── inference.py           # LLM inference with task-specific prompts
├── train.py               # PPO training script
├── README.md              # This file
└── server/
    ├── __init__.py        # Server exports
    ├── app.py             # FastAPI application
    └── cloud_environment.py  # Core environment logic (3 tasks)