trioskosmos commited on
Commit
d828770
·
verified ·
1 Parent(s): 8855e6c

Upload ai/TRAINING_INTEGRATION_GUIDE.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. ai/TRAINING_INTEGRATION_GUIDE.md +95 -0
ai/TRAINING_INTEGRATION_GUIDE.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Environment Training Integration Guide
2
+
3
+ This guide explains how to integrate the new `VectorEnvGPU` into the existing training pipeline (`train_optimized.py`) to achieve production-level performance.
4
+
5
+ ## 1. Replacing the Environment Wrapper
6
+
7
+ Currently, `train_optimized.py` uses `BatchedSubprocVecEnv` which manages multiple CPU processes. The GPU environment is a single object that manages thousands of environments internally.
8
+
9
+ ### Steps:
10
+
11
+ 1. **Import `VectorEnvGPU`**:
12
+ ```python
13
+ from ai.vector_env_gpu import VectorEnvGPU, HAS_CUDA
14
+ ```
15
+
16
+ 2. **Conditional Initialization**:
17
+ In `train()` function, replace the `BatchedSubprocVecEnv` block:
18
+
19
+ ```python
20
+ if HAS_CUDA and os.getenv("USE_GPU_ENV") == "1":
21
+ print(" [GPU] Initializing GPU-Resident Environment...")
22
+ # num_envs should be large (e.g., 4096) to saturate GPU
23
+ env = VectorEnvGPU(num_envs=4096, seed=42)
24
+
25
+ # VectorEnvGPU doesn't need a VecEnv wrapper usually,
26
+ # but SB3 expects specific API. We might need a thin adapter.
27
+ env = SB3CudaAdapter(env)
28
+ else:
29
+ # Existing CPU Logic
30
+ env_fns = [...]
31
+ env = BatchedSubprocVecEnv(...)
32
+ ```
33
+
34
+ ## 2. The `SB3CudaAdapter`
35
+
36
+ Stable Baselines 3 expects numpy arrays on CPU by default. To fully utilize the GPU env, we must intercept the data *before* SB3 tries to convert it, or use a custom Policy that accepts Torch tensors directly.
37
+
38
+ However, `MaskablePPO` in `sb3_contrib` might try to cast inputs to numpy.
39
+
40
+ **Strategy: Zero-Copy Torch Wrapper**
41
+
42
+ ```python
43
+ import torch
44
+ from gymnasium import spaces
45
+
46
+ class SB3CudaAdapter:
47
+ def __init__(self, gpu_env):
48
+ self.env = gpu_env
49
+ self.num_envs = gpu_env.num_envs
50
+ # Define spaces (Mocking them for SB3)
51
+ self.observation_space = spaces.Box(low=0, high=1, shape=(8192,), dtype=np.float32)
52
+ self.action_space = spaces.Discrete(2000)
53
+
54
+ def reset(self):
55
+ # returns torch tensor on GPU
56
+ obs, _ = self.env.reset()
57
+ return torch.as_tensor(obs, device='cuda')
58
+
59
+ def step(self, actions):
60
+ # actions come from Policy (Torch Tensor on GPU)
61
+ # Pass directly to env
62
+ obs, rewards, dones, infos = self.env.step(actions)
63
+
64
+ # Wrap outputs in Torch Tensors (Zero Copy)
65
+ # obs is already CuPy/DeviceArray
66
+ t_obs = torch.as_tensor(obs, device='cuda')
67
+ t_rewards = torch.as_tensor(rewards, device='cuda')
68
+ t_dones = torch.as_tensor(dones, device='cuda')
69
+
70
+ return t_obs, t_rewards, t_dones, infos
71
+ ```
72
+
73
+ ## 3. PPO Policy Modifications
74
+
75
+ Standard SB3 algorithms often force `cpu()` calls. For maximum speed, you might need to subclass `MaskablePPO` or `MlpPolicy` to ensure it accepts GPU tensors without moving them.
76
+
77
+ * **Check `rollout_buffer.py`**: SB3's rollout buffer stores data in CPU RAM by default.
78
+ * **Optimization**: For "Isaac Gym" style training, the Rollout Buffer should live on the GPU.
79
+ * *Option A*: Use `sb3`'s `DictRolloutBuffer`? No, standard buffer.
80
+ * *Option B*: Modify SB3 or use a library designed for GPU-only training like `skrl` or `cleanrl`.
81
+ * *Option C (Easiest)*: Accept that `collect_rollouts` might do one copy to CPU RAM for storage, but the **Inference** (Forward Pass) stays on GPU.
82
+
83
+ ## 4. Remaining Logic Gaps
84
+
85
+ The current `VectorEnvGPU` POC has simplified logic in `resolve_bytecode_device`. Before production:
86
+
87
+ 1. **Complete Opcode Support**: `O_CHARGE`, `O_CHOOSE`, `O_ADD_H` need full card movement logic (finding indices, updating arrays).
88
+ 2. **Opponent Simulation**: `step_kernel` currently simulates a random opponent. The `step_opponent_vectorized` logic from CPU env needs to be ported to a CUDA kernel.
89
+ 3. **Collision Handling**: In `resolve_bytecode_device`, we use `atomic` operations or careful logic if multiple effects try to modify the same global state (rare in this game, but `batch_global_ctx` is per-env so it's safe).
90
+
91
+ ## 5. Performance Expectations
92
+
93
+ * **Current CPU**: ~10k SPS (128 envs).
94
+ * **Target GPU**: ~100k-500k SPS (4096+ envs).
95
+ * **Bottleneck**: Will shift from "PCI-E Transfer" to "Policy Network Forward Pass".