Spaces:

trioskosmos
/

rabukasim

Sleeping

App Files Files Community

rabukasim / docs /spec /GPU_MIGRATION_GUIDE.md

trioskosmos

Upload folder using huggingface_hub

463f868 verified 21 days ago

preview code

raw

history blame contribute delete

4.69 kB

	# GPU Migration Guide: Moving VectorEnv to Numba CUDA

	## 1. The Core Question: "Aren't we already using Numba?"

	Yes, the current `VectorEnv` uses Numba CPU (`@njit`). While Numba is famous for compiling Python to machine code, it has two distinct backends:

	1. CPU Backend (`@njit`): Compiles to x86/AVX/ARM assembly. Uses OpenMP (`parallel=True`) for multi-threading. Data usually lives in standard RAM (Numpy arrays).
	2. CUDA Backend (`@cuda.jit`): Compiles to PTX (NVidia GPU assembly). Runs on the GPU. Data must live in VRAM (Device Arrays).

	The bottleneck today is not the execution speed of the logic itself, but the PCI-E Bus.
	* Current Flow: `CPU Logic` -> `CPU RAM` -> `PCI-E Copy` -> `GPU VRAM (Policy Net)` -> `PCI-E Copy` -> `CPU RAM` -> ...
	* Target Flow (Isaac Gym Style): `GPU Logic` -> `GPU VRAM` -> `Policy Net` -> `GPU VRAM` -> ...

	Porting to Numba CUDA eliminates the PCI-E transfer, potentially unlocking 100k+ steps per second for massive batches.

	## 2. Architecture Comparison

	### Current (CPU Parallel)
	```python
	@njit(parallel=True)
	def step_vectorized(...):
	for i in prange(num_envs): # CPU Threads
	# Process Env i
	```
	* Memory: Host RAM (Numpy).
	* Parallelism: ~16-64 threads (CPU Cores).
	* Observation: Generated on CPU, copied to GPU.

	### Proposed (GPU Massive Parallel)
	```python
	@cuda.jit
	def step_kernel(...):
	i = cuda.grid(1) # GPU Thread ID
	if i < num_envs:
	# Process Env i
	```
	* Memory: Device VRAM (CuPy / Numba DeviceArray).
	* Parallelism: ~10,000+ threads.
	* Observation: Stays on VRAM. Passed to PyTorch via `__cuda_array_interface__`.

	## 3. Implementation Challenges & Solutions

	### A. Memory Management (The "Zero Copy" Goal)
	You cannot pass standard Numpy arrays to `@cuda.jit` kernels efficiently without triggering a transfer.
	* Solution: Use `cupy` arrays or `numba.cuda.device_array` for the master state (`batch_stage`, `batch_hand`, etc.).
	* PyTorch Integration: PyTorch can wrap these arrays zero-copy using `torch.as_tensor(cupy_array)` or `torch.utils.dlpack.from_dlpack`.

	### B. The "Warp Divergence" Problem
	GPUs execute instructions in "Warps" (groups of 32 threads). If Thread 1 executes `if A:` and Thread 2 executes `else B:`, both threads execute both paths (masking out the inactive one).
	* Risk: The `resolve_bytecode` VM is a giant switch-case loop. If Env 1 runs Opcode 10 (Draw) and Env 2 runs Opcode 20 (Attack), they diverge.
	* Mitigation: The high throughput of GPUs (thousands of cores) usually overcomes this inefficiency. Even at 10% efficiency due to divergence, a 4090 GPU (16k cores) might beat a 32-core CPU.
	* Advanced Fix: Sort environments by "Next Opcode" before execution (sorting on GPU is fast). This ensures threads in a warp execute the same instruction. (Complex to implement).

	### C. Random Numbers
	`np.random` does not work in CUDA kernels.
	* Solution: Use `numba.cuda.random.xoroshiro128p`.
	* Requirement: You must initialize and maintain an array of RNG states (one per thread).

	### D. Recursion & Dynamic Allocation
	Numba CUDA does not support recursion or list allocations (`[]`).
	* Status: The current `fast_logic.py` is already largely iterative and uses fixed arrays, so this is Ready for Porting.

	## 4. Migration Roadmap

	### Phase 1: Data Structures
	Convert `VectorGameState` to allocate memory on GPU.
	```python
	# ai/vector_env_gpu.py
	import cupy as cp

	class VectorGameStateGPU:
	def __init__(self, num_envs):
	self.batch_stage = cp.full((num_envs, 3), -1, dtype=cp.int32)
	# ... all other arrays as cp.ndarray
	```

	### Phase 2: Kernel Rewrite
	Rewrite `step_vectorized` as a kernel.
	* Replace `prange` with `cuda.grid(1)`.
	* Move `resolve_bytecode` to a `@cuda.jit(device=True)` function.

	### Phase 3: PPO Adapter
	Update the RL training loop (`train_optimized.py`) to accept GPU tensors.
	```python
	# In PPO Rollout Buffer
	def collect_rollouts(self):
	# obs is already on GPU!
	with torch.no_grad():
	action, value, log_prob = self.policy(obs)

	# action is on GPU. Pass directly to env.step()
	next_obs = env.step(action)
	```

	## 5. Feasibility Verdict

	High Feasibility. The codebase is already "Numba-Friendly" (no objects, flat arrays). The transition is primarily syntactic (`prange` -> `kernel`) and infrastructural (Numpy -> CuPy).

	Estimated Effort: 1-2 weeks for a skilled GPU engineer.
	Expected Gain: 5x-10x throughput scaling for batch sizes > 4096.