--- title: Cloud GPU+CPU Resource Management Environment emoji: โ˜๏ธ๐Ÿ”ฅ colorFrom: blue colorTo: purple sdk: docker pinned: false app_port: 8000 base_path: /web tags: - openenv-0.2.3 - openenv --- ## Hugging Face Space Deployment This Space runs the **Cloud GPU+CPU Resource Management** OpenEnv environment. - OpenEnv pinned ref: `0.2.3` - Hub tag: `openenv` - **Runs on HF Spaces free tier** (2 vCPUs, 16GB RAM โ€” no physical GPU needed) ### Connecting from Code ```python from cloud_resource_env import CloudResourceClient env = CloudResourceClient(base_url="https://huggingface.co/spaces//cloud_resource_env") ``` # Cloud GPU+CPU Resource Management Environment A real-world OpenEnv environment that simulates **cloud GPU and CPU resource management** with three progressively harder tasks covering allocation, thermal management, and heuristic fragmentation. ## Environment Overview | Component | Description | |---|---| | **Domain** | Cloud GPU+CPU infrastructure management | | **State** | GPU/CPU utilisation, VRAM, temperature, fragmentation, cost | | **Actions** | Task-specific (see below) โ€” 4 actions per task | | **Reward** | Multi-objective: utilisation efficiency + thermal safety + fragmentation + cost | | **Score** | Normalised cumulative reward โˆˆ [0.0, 1.0] | ## Tasks (3 difficulty levels) | Task | Difficulty | Nodes | Steps | Focus | |---|---|---|---|---| | `gpu_cpu_allocation` | Easy | 3 | 8 | GPU+CPU allocation with cost optimisation | | `thermal_management` | Medium | 4 | 10 | Temperature monitoring, cooling, load migration | | `heuristic_fragmentation` | Hard | 5 | 12 | Fragmented GPU placement + defragmentation | ### Task 1: GPU+CPU Allocation (`gpu_cpu_allocation`) Manage a cluster of mixed GPU nodes (T4, A100, H100, V100, L4) with both GPU and CPU resources. Optimise throughput while staying within budget. | Action | Effect | |---|---| | `allocate_high` | Capacity ร— 1.5, Cost ร— 1.5 | | `allocate_low` | Capacity รท 1.5, Cost รท 1.5 | | `maintain` | No change | | `migrate` | Move 30% load to other nodes | ### Task 2: Thermal Management (`thermal_management`) Monitor GPU temperatures and ambient temperature. Prevent thermal throttling by redistributing load or adjusting cooling levels. | Action | Effect | |---|---| | `increase_cooling` | Cooling level +1 (max 3), reduces GPU temp ~5ยฐC | | `decrease_cooling` | Cooling level -1 (min 0), saves energy | | `migrate_load` | Move 40% load to coolest node | | `maintain` | No change | **Temperature zones:** - ๐ŸŸข Safe: 55ยฐC โ€“ 75ยฐC - ๐ŸŸก Warning: 75ยฐC โ€“ max threshold - ๐Ÿ”ด Critical: Above max threshold โ†’ thermal throttle! ### Task 3: Heuristic Fragmentation (`heuristic_fragmentation`) Place workloads in a fragmented GPU cluster. Each node has 8 GPU slots; workloads need contiguous blocks (1, 2, 4, or 8 slots). | Action | Effect | |---|---| | `best_fit` | Place in node with smallest sufficient free block | | `first_fit` | Place in first node with enough space | | `compact` | Defragment first, then best-fit (10% overhead) | | `split_workload` | Split across nodes if needed | ## MCP Tools | Tool | Description | |---|---| | `get_cluster_state()` | Returns metrics for all GPU+CPU nodes | | `get_task_info()` | Returns task description, objectives, valid actions | | `take_action(decisions)` | Applies decisions, advances timestep, returns reward | ## Observation Space (per node) | Field | Description | |---|---| | `gpu_utilization_pct` | GPU compute utilisation (%) | | `cpu_utilization_pct` | CPU utilisation (%) | | `gpu_vram_used_gb` / `gpu_vram_capacity_gb` | VRAM usage | | `cpu_usage` / `cpu_capacity` | CPU cores usage | | `memory_usage_gb` / `memory_capacity_gb` | RAM usage | | `gpu_temp_celsius` | Current GPU temperature | | `ambient_temp_celsius` | Outside/data center temperature | | `cooling_level` | Cooling intensity (0-3) | | `thermal_throttle` | Whether GPU is throttling | | `fragmentation_score` | How fragmented free GPU slots are (0-1) | | `cost_per_step` | Running cost | | `power_draw_watts` | Power consumption | ## Quick Start (Async) ```python import asyncio from cloud_resource_env import CloudResourceClient async def main(): client = await CloudResourceClient.from_docker_image("cloud-resource-env:latest") async with client: # Task 1: GPU+CPU allocation await client.reset(task="gpu_cpu_allocation") state = await client.call_tool("get_cluster_state") result = await client.call_tool( "take_action", decisions='{"node_0": "allocate_high", "node_1": "maintain", "node_2": "migrate"}' ) # Task 2: Thermal management await client.reset(task="thermal_management") state = await client.call_tool("get_cluster_state") result = await client.call_tool( "take_action", decisions='{"node_0": "increase_cooling", "node_1": "migrate_load", "node_2": "maintain", "node_3": "maintain"}' ) asyncio.run(main()) ``` ## Quick Start (Sync) ```python from cloud_resource_env import CloudResourceClient with CloudResourceClient(base_url="http://localhost:8000").sync() as env: env.reset(task="heuristic_fragmentation") state = env.call_tool("get_cluster_state") result = env.call_tool( "take_action", decisions='{"node_0": "best_fit", "node_1": "compact", "node_2": "first_fit", "node_3": "best_fit", "node_4": "split_workload"}' ) ``` ## GPU Node Types | Node | GPU | VRAM | CPU | RAM | Cost/step | TDP | |---|---|---|---|---|---|---| | T4-node | T4 | 16 GB | 4 cores | 16 GB | $8 | 70W | | A100-node | A100 | 40 GB | 8 cores | 64 GB | $30 | 250W | | H100-node | H100 | 80 GB | 16 cores | 128 GB | $55 | 350W | | V100-node | V100 | 32 GB | 8 cores | 32 GB | $18 | 300W | | L4-node | L4 | 24 GB | 4 cores | 32 GB | $12 | 72W | ## Setup & Installation ```bash # Install the environment pip install -e . # Run the server locally uvicorn server.app:app --host 0.0.0.0 --port 8000 # Or build and run Docker docker build -t cloud-resource-env:latest . docker run -p 8000:8000 cloud-resource-env:latest # Train with PPO python train.py --task gpu_cpu_allocation --timesteps 5000 python train.py --task all # train on all tasks ``` ## Project Structure ``` cloud_resource_env/ โ”œโ”€โ”€ __init__.py # Package exports โ”œโ”€โ”€ models.py # CloudAction, CloudObservation โ”œโ”€โ”€ client.py # CloudResourceClient (MCPToolClient) โ”œโ”€โ”€ cloud_env.py # Gymnasium wrapper for RL training โ”œโ”€โ”€ openenv.yaml # OpenEnv manifest โ”œโ”€โ”€ pyproject.toml # Dependencies โ”œโ”€โ”€ Dockerfile # Container image (HF Spaces compatible) โ”œโ”€โ”€ inference.py # LLM inference with task-specific prompts โ”œโ”€โ”€ train.py # PPO training script โ”œโ”€โ”€ README.md # This file โ””โ”€โ”€ server/ โ”œโ”€โ”€ __init__.py # Server exports โ”œโ”€โ”€ app.py # FastAPI application โ””โ”€โ”€ cloud_environment.py # Core environment logic (3 tasks) ```