Spaces:
Sleeping
GPU Environment Training Integration Guide
This guide explains how to integrate the new VectorEnvGPU into the existing training pipeline (train_optimized.py) to achieve production-level performance.
1. Replacing the Environment Wrapper
Currently, train_optimized.py uses BatchedSubprocVecEnv which manages multiple CPU processes. The GPU environment is a single object that manages thousands of environments internally.
Steps:
Import
VectorEnvGPU:from ai.vector_env_gpu import VectorEnvGPU, HAS_CUDAConditional Initialization: In
train()function, replace theBatchedSubprocVecEnvblock:if HAS_CUDA and os.getenv("USE_GPU_ENV") == "1": print(" [GPU] Initializing GPU-Resident Environment...") # num_envs should be large (e.g., 4096) to saturate GPU env = VectorEnvGPU(num_envs=4096, seed=42) # VectorEnvGPU doesn't need a VecEnv wrapper usually, # but SB3 expects specific API. We might need a thin adapter. env = SB3CudaAdapter(env) else: # Existing CPU Logic env_fns = [...] env = BatchedSubprocVecEnv(...)
2. The SB3CudaAdapter
Stable Baselines 3 expects numpy arrays on CPU by default. To fully utilize the GPU env, we must intercept the data before SB3 tries to convert it, or use a custom Policy that accepts Torch tensors directly.
However, MaskablePPO in sb3_contrib might try to cast inputs to numpy.
Strategy: Zero-Copy Torch Wrapper
import torch
from gymnasium import spaces
class SB3CudaAdapter:
def __init__(self, gpu_env):
self.env = gpu_env
self.num_envs = gpu_env.num_envs
# Define spaces (Mocking them for SB3)
self.observation_space = spaces.Box(low=0, high=1, shape=(8192,), dtype=np.float32)
self.action_space = spaces.Discrete(2000)
def reset(self):
# returns torch tensor on GPU
obs, _ = self.env.reset()
return torch.as_tensor(obs, device='cuda')
def step(self, actions):
# actions come from Policy (Torch Tensor on GPU)
# Pass directly to env
obs, rewards, dones, infos = self.env.step(actions)
# Wrap outputs in Torch Tensors (Zero Copy)
# obs is already CuPy/DeviceArray
t_obs = torch.as_tensor(obs, device='cuda')
t_rewards = torch.as_tensor(rewards, device='cuda')
t_dones = torch.as_tensor(dones, device='cuda')
return t_obs, t_rewards, t_dones, infos
3. PPO Policy Modifications
Standard SB3 algorithms often force cpu() calls. For maximum speed, you might need to subclass MaskablePPO or MlpPolicy to ensure it accepts GPU tensors without moving them.
- Check
rollout_buffer.py: SB3's rollout buffer stores data in CPU RAM by default. - Optimization: For "Isaac Gym" style training, the Rollout Buffer should live on the GPU.
- Option A: Use
sb3'sDictRolloutBuffer? No, standard buffer. - Option B: Modify SB3 or use a library designed for GPU-only training like
skrlorcleanrl. - Option C (Easiest): Accept that
collect_rolloutsmight do one copy to CPU RAM for storage, but the Inference (Forward Pass) stays on GPU.
- Option A: Use
4. Remaining Logic Gaps
The current VectorEnvGPU POC has simplified logic in resolve_bytecode_device. Before production:
- Complete Opcode Support:
O_CHARGE,O_CHOOSE,O_ADD_Hneed full card movement logic (finding indices, updating arrays). - Opponent Simulation:
step_kernelcurrently simulates a random opponent. Thestep_opponent_vectorizedlogic from CPU env needs to be ported to a CUDA kernel. - Collision Handling: In
resolve_bytecode_device, we useatomicoperations or careful logic if multiple effects try to modify the same global state (rare in this game, butbatch_global_ctxis per-env so it's safe).
5. Performance Expectations
- Current CPU: ~10k SPS (128 envs).
- Target GPU: ~100k-500k SPS (4096+ envs).
- Bottleneck: Will shift from "PCI-E Transfer" to "Policy Network Forward Pass".