# GPU Migration Guide: Moving VectorEnv to Numba CUDA

## 1. The Core Question: "Aren't we already using Numba?"

Yes, the current `VectorEnv` uses **Numba CPU (`@njit`)**. While Numba is famous for compiling Python to machine code, it has two distinct backends:

1.  **CPU Backend (`@njit`)**: Compiles to x86/AVX/ARM assembly. Uses OpenMP (`parallel=True`) for multi-threading. Data usually lives in standard RAM (Numpy arrays).
2.  **CUDA Backend (`@cuda.jit`)**: Compiles to PTX (NVidia GPU assembly). Runs on the GPU. Data **must** live in VRAM (Device Arrays).

**The bottleneck today** is not the execution speed of the logic itself, but the **PCI-E Bus**.
*   **Current Flow**: `CPU Logic` -> `CPU RAM` -> `PCI-E Copy` -> `GPU VRAM (Policy Net)` -> `PCI-E Copy` -> `CPU RAM` -> ...
*   **Target Flow (Isaac Gym Style)**: `GPU Logic` -> `GPU VRAM` -> `Policy Net` -> `GPU VRAM` -> ...

Porting to Numba CUDA eliminates the PCI-E transfer, potentially unlocking 100k+ steps per second for massive batches.

## 2. Architecture Comparison

### Current (CPU Parallel)
```python
@njit(parallel=True)
def step_vectorized(...):
    for i in prange(num_envs):  # CPU Threads
        # Process Env i
```
*   **Memory**: Host RAM (Numpy).
*   **Parallelism**: ~16-64 threads (CPU Cores).
*   **Observation**: Generated on CPU, copied to GPU.

### Proposed (GPU Massive Parallel)
```python
@cuda.jit
def step_kernel(...):
    i = cuda.grid(1)  # GPU Thread ID
    if i < num_envs:
        # Process Env i
```
*   **Memory**: Device VRAM (CuPy / Numba DeviceArray).
*   **Parallelism**: ~10,000+ threads.
*   **Observation**: Stays on VRAM. Passed to PyTorch via `__cuda_array_interface__`.

## 3. Implementation Challenges & Solutions

### A. Memory Management (The "Zero Copy" Goal)
You cannot pass standard Numpy arrays to `@cuda.jit` kernels efficiently without triggering a transfer.
*   **Solution**: Use `cupy` arrays or `numba.cuda.device_array` for the master state (`batch_stage`, `batch_hand`, etc.).
*   **PyTorch Integration**: PyTorch can wrap these arrays zero-copy using `torch.as_tensor(cupy_array)` or `torch.utils.dlpack.from_dlpack`.

### B. The "Warp Divergence" Problem
GPUs execute instructions in "Warps" (groups of 32 threads). If Thread 1 executes `if A:` and Thread 2 executes `else B:`, **both** threads execute **both** paths (masking out the inactive one).
*   **Risk**: The `resolve_bytecode` VM is a giant switch-case loop. If Env 1 runs Opcode 10 (Draw) and Env 2 runs Opcode 20 (Attack), they diverge.
*   **Mitigation**: The high throughput of GPUs (thousands of cores) usually overcomes this inefficiency. Even at 10% efficiency due to divergence, a 4090 GPU (16k cores) might beat a 32-core CPU.
*   **Advanced Fix**: Sort environments by "Next Opcode" before execution (sorting on GPU is fast). This ensures threads in a warp execute the same instruction. (Complex to implement).

### C. Random Numbers
`np.random` does not work in CUDA kernels.
*   **Solution**: Use `numba.cuda.random.xoroshiro128p`.
*   **Requirement**: You must initialize and maintain an array of RNG states (one per thread).

### D. Recursion & Dynamic Allocation
Numba CUDA does not support recursion or list allocations (`[]`).
*   **Status**: The current `fast_logic.py` is already largely iterative and uses fixed arrays, so this is **Ready for Porting**.

## 4. Migration Roadmap

### Phase 1: Data Structures
Convert `VectorGameState` to allocate memory on GPU.
```python
# ai/vector_env_gpu.py
import cupy as cp

class VectorGameStateGPU:
    def __init__(self, num_envs):
        self.batch_stage = cp.full((num_envs, 3), -1, dtype=cp.int32)
        # ... all other arrays as cp.ndarray
```

### Phase 2: Kernel Rewrite
Rewrite `step_vectorized` as a kernel.
*   Replace `prange` with `cuda.grid(1)`.
*   Move `resolve_bytecode` to a `@cuda.jit(device=True)` function.

### Phase 3: PPO Adapter
Update the RL training loop (`train_optimized.py`) to accept GPU tensors.
```python
# In PPO Rollout Buffer
def collect_rollouts(self):
    # obs is already on GPU!
    with torch.no_grad():
        action, value, log_prob = self.policy(obs)

    # action is on GPU. Pass directly to env.step()
    next_obs = env.step(action)
```

## 5. Feasibility Verdict

**High Feasibility.** The codebase is already "Numba-Friendly" (no objects, flat arrays). The transition is primarily syntactic (`prange` -> `kernel`) and infrastructural (Numpy -> CuPy).

**Estimated Effort**: 1-2 weeks for a skilled GPU engineer.
**Expected Gain**: 5x-10x throughput scaling for batch sizes > 4096.