rabukasim / docs /spec /GPU_MIGRATION_GUIDE.md
trioskosmos's picture
Upload folder using huggingface_hub
463f868 verified
# GPU Migration Guide: Moving VectorEnv to Numba CUDA
## 1. The Core Question: "Aren't we already using Numba?"
Yes, the current `VectorEnv` uses **Numba CPU (`@njit`)**. While Numba is famous for compiling Python to machine code, it has two distinct backends:
1. **CPU Backend (`@njit`)**: Compiles to x86/AVX/ARM assembly. Uses OpenMP (`parallel=True`) for multi-threading. Data usually lives in standard RAM (Numpy arrays).
2. **CUDA Backend (`@cuda.jit`)**: Compiles to PTX (NVidia GPU assembly). Runs on the GPU. Data **must** live in VRAM (Device Arrays).
**The bottleneck today** is not the execution speed of the logic itself, but the **PCI-E Bus**.
* **Current Flow**: `CPU Logic` -> `CPU RAM` -> `PCI-E Copy` -> `GPU VRAM (Policy Net)` -> `PCI-E Copy` -> `CPU RAM` -> ...
* **Target Flow (Isaac Gym Style)**: `GPU Logic` -> `GPU VRAM` -> `Policy Net` -> `GPU VRAM` -> ...
Porting to Numba CUDA eliminates the PCI-E transfer, potentially unlocking 100k+ steps per second for massive batches.
## 2. Architecture Comparison
### Current (CPU Parallel)
```python
@njit(parallel=True)
def step_vectorized(...):
for i in prange(num_envs): # CPU Threads
# Process Env i
```
* **Memory**: Host RAM (Numpy).
* **Parallelism**: ~16-64 threads (CPU Cores).
* **Observation**: Generated on CPU, copied to GPU.
### Proposed (GPU Massive Parallel)
```python
@cuda.jit
def step_kernel(...):
i = cuda.grid(1) # GPU Thread ID
if i < num_envs:
# Process Env i
```
* **Memory**: Device VRAM (CuPy / Numba DeviceArray).
* **Parallelism**: ~10,000+ threads.
* **Observation**: Stays on VRAM. Passed to PyTorch via `__cuda_array_interface__`.
## 3. Implementation Challenges & Solutions
### A. Memory Management (The "Zero Copy" Goal)
You cannot pass standard Numpy arrays to `@cuda.jit` kernels efficiently without triggering a transfer.
* **Solution**: Use `cupy` arrays or `numba.cuda.device_array` for the master state (`batch_stage`, `batch_hand`, etc.).
* **PyTorch Integration**: PyTorch can wrap these arrays zero-copy using `torch.as_tensor(cupy_array)` or `torch.utils.dlpack.from_dlpack`.
### B. The "Warp Divergence" Problem
GPUs execute instructions in "Warps" (groups of 32 threads). If Thread 1 executes `if A:` and Thread 2 executes `else B:`, **both** threads execute **both** paths (masking out the inactive one).
* **Risk**: The `resolve_bytecode` VM is a giant switch-case loop. If Env 1 runs Opcode 10 (Draw) and Env 2 runs Opcode 20 (Attack), they diverge.
* **Mitigation**: The high throughput of GPUs (thousands of cores) usually overcomes this inefficiency. Even at 10% efficiency due to divergence, a 4090 GPU (16k cores) might beat a 32-core CPU.
* **Advanced Fix**: Sort environments by "Next Opcode" before execution (sorting on GPU is fast). This ensures threads in a warp execute the same instruction. (Complex to implement).
### C. Random Numbers
`np.random` does not work in CUDA kernels.
* **Solution**: Use `numba.cuda.random.xoroshiro128p`.
* **Requirement**: You must initialize and maintain an array of RNG states (one per thread).
### D. Recursion & Dynamic Allocation
Numba CUDA does not support recursion or list allocations (`[]`).
* **Status**: The current `fast_logic.py` is already largely iterative and uses fixed arrays, so this is **Ready for Porting**.
## 4. Migration Roadmap
### Phase 1: Data Structures
Convert `VectorGameState` to allocate memory on GPU.
```python
# ai/vector_env_gpu.py
import cupy as cp
class VectorGameStateGPU:
def __init__(self, num_envs):
self.batch_stage = cp.full((num_envs, 3), -1, dtype=cp.int32)
# ... all other arrays as cp.ndarray
```
### Phase 2: Kernel Rewrite
Rewrite `step_vectorized` as a kernel.
* Replace `prange` with `cuda.grid(1)`.
* Move `resolve_bytecode` to a `@cuda.jit(device=True)` function.
### Phase 3: PPO Adapter
Update the RL training loop (`train_optimized.py`) to accept GPU tensors.
```python
# In PPO Rollout Buffer
def collect_rollouts(self):
# obs is already on GPU!
with torch.no_grad():
action, value, log_prob = self.policy(obs)
# action is on GPU. Pass directly to env.step()
next_obs = env.step(action)
```
## 5. Feasibility Verdict
**High Feasibility.** The codebase is already "Numba-Friendly" (no objects, flat arrays). The transition is primarily syntactic (`prange` -> `kernel`) and infrastructural (Numpy -> CuPy).
**Estimated Effort**: 1-2 weeks for a skilled GPU engineer.
**Expected Gain**: 5x-10x throughput scaling for batch sizes > 4096.