File size: 4,251 Bytes
d828770
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
# GPU Environment Training Integration Guide

This guide explains how to integrate the new `VectorEnvGPU` into the existing training pipeline (`train_optimized.py`) to achieve production-level performance.

## 1. Replacing the Environment Wrapper

Currently, `train_optimized.py` uses `BatchedSubprocVecEnv` which manages multiple CPU processes. The GPU environment is a single object that manages thousands of environments internally.

### Steps:

1.  **Import `VectorEnvGPU`**:
    ```python

    from ai.vector_env_gpu import VectorEnvGPU, HAS_CUDA

    ```


2.  **Conditional Initialization**:
    In `train()` function, replace the `BatchedSubprocVecEnv` block:


    ```python

    if HAS_CUDA and os.getenv("USE_GPU_ENV") == "1":

        print(" [GPU] Initializing GPU-Resident Environment...")

        # num_envs should be large (e.g., 4096) to saturate GPU

        env = VectorEnvGPU(num_envs=4096, seed=42)


        # VectorEnvGPU doesn't need a VecEnv wrapper usually,

        # but SB3 expects specific API. We might need a thin adapter.

        env = SB3CudaAdapter(env)

    else:

        # Existing CPU Logic

        env_fns = [...]

        env = BatchedSubprocVecEnv(...)

    ```


## 2. The `SB3CudaAdapter`

Stable Baselines 3 expects numpy arrays on CPU by default. To fully utilize the GPU env, we must intercept the data *before* SB3 tries to convert it, or use a custom Policy that accepts Torch tensors directly.

However, `MaskablePPO` in `sb3_contrib` might try to cast inputs to numpy.

**Strategy: Zero-Copy Torch Wrapper**

```python

import torch

from gymnasium import spaces



class SB3CudaAdapter:

    def __init__(self, gpu_env):

        self.env = gpu_env

        self.num_envs = gpu_env.num_envs

        # Define spaces (Mocking them for SB3)

        self.observation_space = spaces.Box(low=0, high=1, shape=(8192,), dtype=np.float32)

        self.action_space = spaces.Discrete(2000)



    def reset(self):

        # returns torch tensor on GPU

        obs, _ = self.env.reset()

        return torch.as_tensor(obs, device='cuda')



    def step(self, actions):

        # actions come from Policy (Torch Tensor on GPU)

        # Pass directly to env

        obs, rewards, dones, infos = self.env.step(actions)



        # Wrap outputs in Torch Tensors (Zero Copy)

        # obs is already CuPy/DeviceArray

        t_obs = torch.as_tensor(obs, device='cuda')

        t_rewards = torch.as_tensor(rewards, device='cuda')

        t_dones = torch.as_tensor(dones, device='cuda')



        return t_obs, t_rewards, t_dones, infos

```

## 3. PPO Policy Modifications

Standard SB3 algorithms often force `cpu()` calls. For maximum speed, you might need to subclass `MaskablePPO` or `MlpPolicy` to ensure it accepts GPU tensors without moving them.

*   **Check `rollout_buffer.py`**: SB3's rollout buffer stores data in CPU RAM by default.

*   **Optimization**: For "Isaac Gym" style training, the Rollout Buffer should live on the GPU.

    *   *Option A*: Use `sb3`'s `DictRolloutBuffer`? No, standard buffer.

    *   *Option B*: Modify SB3 or use a library designed for GPU-only training like `skrl` or `cleanrl`.

    *   *Option C (Easiest)*: Accept that `collect_rollouts` might do one copy to CPU RAM for storage, but the **Inference** (Forward Pass) stays on GPU.

## 4. Remaining Logic Gaps

The current `VectorEnvGPU` POC has simplified logic in `resolve_bytecode_device`. Before production:

1.  **Complete Opcode Support**: `O_CHARGE`, `O_CHOOSE`, `O_ADD_H` need full card movement logic (finding indices, updating arrays).
2.  **Opponent Simulation**: `step_kernel` currently simulates a random opponent. The `step_opponent_vectorized` logic from CPU env needs to be ported to a CUDA kernel.
3.  **Collision Handling**: In `resolve_bytecode_device`, we use `atomic` operations or careful logic if multiple effects try to modify the same global state (rare in this game, but `batch_global_ctx` is per-env so it's safe).

## 5. Performance Expectations

*   **Current CPU**: ~10k SPS (128 envs).
*   **Target GPU**: ~100k-500k SPS (4096+ envs).
*   **Bottleneck**: Will shift from "PCI-E Transfer" to "Policy Network Forward Pass".