# GPU Migration Guide: Moving VectorEnv to Numba CUDA ## 1. The Core Question: "Aren't we already using Numba?" Yes, the current `VectorEnv` uses **Numba CPU (`@njit`)**. While Numba is famous for compiling Python to machine code, it has two distinct backends: 1. **CPU Backend (`@njit`)**: Compiles to x86/AVX/ARM assembly. Uses OpenMP (`parallel=True`) for multi-threading. Data usually lives in standard RAM (Numpy arrays). 2. **CUDA Backend (`@cuda.jit`)**: Compiles to PTX (NVidia GPU assembly). Runs on the GPU. Data **must** live in VRAM (Device Arrays). **The bottleneck today** is not the execution speed of the logic itself, but the **PCI-E Bus**. * **Current Flow**: `CPU Logic` -> `CPU RAM` -> `PCI-E Copy` -> `GPU VRAM (Policy Net)` -> `PCI-E Copy` -> `CPU RAM` -> ... * **Target Flow (Isaac Gym Style)**: `GPU Logic` -> `GPU VRAM` -> `Policy Net` -> `GPU VRAM` -> ... Porting to Numba CUDA eliminates the PCI-E transfer, potentially unlocking 100k+ steps per second for massive batches. ## 2. Architecture Comparison ### Current (CPU Parallel) ```python @njit(parallel=True) def step_vectorized(...): for i in prange(num_envs): # CPU Threads # Process Env i ``` * **Memory**: Host RAM (Numpy). * **Parallelism**: ~16-64 threads (CPU Cores). * **Observation**: Generated on CPU, copied to GPU. ### Proposed (GPU Massive Parallel) ```python @cuda.jit def step_kernel(...): i = cuda.grid(1) # GPU Thread ID if i < num_envs: # Process Env i ``` * **Memory**: Device VRAM (CuPy / Numba DeviceArray). * **Parallelism**: ~10,000+ threads. * **Observation**: Stays on VRAM. Passed to PyTorch via `__cuda_array_interface__`. ## 3. Implementation Challenges & Solutions ### A. Memory Management (The "Zero Copy" Goal) You cannot pass standard Numpy arrays to `@cuda.jit` kernels efficiently without triggering a transfer. * **Solution**: Use `cupy` arrays or `numba.cuda.device_array` for the master state (`batch_stage`, `batch_hand`, etc.). * **PyTorch Integration**: PyTorch can wrap these arrays zero-copy using `torch.as_tensor(cupy_array)` or `torch.utils.dlpack.from_dlpack`. ### B. The "Warp Divergence" Problem GPUs execute instructions in "Warps" (groups of 32 threads). If Thread 1 executes `if A:` and Thread 2 executes `else B:`, **both** threads execute **both** paths (masking out the inactive one). * **Risk**: The `resolve_bytecode` VM is a giant switch-case loop. If Env 1 runs Opcode 10 (Draw) and Env 2 runs Opcode 20 (Attack), they diverge. * **Mitigation**: The high throughput of GPUs (thousands of cores) usually overcomes this inefficiency. Even at 10% efficiency due to divergence, a 4090 GPU (16k cores) might beat a 32-core CPU. * **Advanced Fix**: Sort environments by "Next Opcode" before execution (sorting on GPU is fast). This ensures threads in a warp execute the same instruction. (Complex to implement). ### C. Random Numbers `np.random` does not work in CUDA kernels. * **Solution**: Use `numba.cuda.random.xoroshiro128p`. * **Requirement**: You must initialize and maintain an array of RNG states (one per thread). ### D. Recursion & Dynamic Allocation Numba CUDA does not support recursion or list allocations (`[]`). * **Status**: The current `fast_logic.py` is already largely iterative and uses fixed arrays, so this is **Ready for Porting**. ## 4. Migration Roadmap ### Phase 1: Data Structures Convert `VectorGameState` to allocate memory on GPU. ```python # ai/vector_env_gpu.py import cupy as cp class VectorGameStateGPU: def __init__(self, num_envs): self.batch_stage = cp.full((num_envs, 3), -1, dtype=cp.int32) # ... all other arrays as cp.ndarray ``` ### Phase 2: Kernel Rewrite Rewrite `step_vectorized` as a kernel. * Replace `prange` with `cuda.grid(1)`. * Move `resolve_bytecode` to a `@cuda.jit(device=True)` function. ### Phase 3: PPO Adapter Update the RL training loop (`train_optimized.py`) to accept GPU tensors. ```python # In PPO Rollout Buffer def collect_rollouts(self): # obs is already on GPU! with torch.no_grad(): action, value, log_prob = self.policy(obs) # action is on GPU. Pass directly to env.step() next_obs = env.step(action) ``` ## 5. Feasibility Verdict **High Feasibility.** The codebase is already "Numba-Friendly" (no objects, flat arrays). The transition is primarily syntactic (`prange` -> `kernel`) and infrastructural (Numpy -> CuPy). **Estimated Effort**: 1-2 weeks for a skilled GPU engineer. **Expected Gain**: 5x-10x throughput scaling for batch sizes > 4096.