Spaces:
Running
Running
Upload ai/TRAINING_INTEGRATION_GUIDE.md with huggingface_hub
Browse files
ai/TRAINING_INTEGRATION_GUIDE.md
ADDED
|
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# GPU Environment Training Integration Guide
|
| 2 |
+
|
| 3 |
+
This guide explains how to integrate the new `VectorEnvGPU` into the existing training pipeline (`train_optimized.py`) to achieve production-level performance.
|
| 4 |
+
|
| 5 |
+
## 1. Replacing the Environment Wrapper
|
| 6 |
+
|
| 7 |
+
Currently, `train_optimized.py` uses `BatchedSubprocVecEnv` which manages multiple CPU processes. The GPU environment is a single object that manages thousands of environments internally.
|
| 8 |
+
|
| 9 |
+
### Steps:
|
| 10 |
+
|
| 11 |
+
1. **Import `VectorEnvGPU`**:
|
| 12 |
+
```python
|
| 13 |
+
from ai.vector_env_gpu import VectorEnvGPU, HAS_CUDA
|
| 14 |
+
```
|
| 15 |
+
|
| 16 |
+
2. **Conditional Initialization**:
|
| 17 |
+
In `train()` function, replace the `BatchedSubprocVecEnv` block:
|
| 18 |
+
|
| 19 |
+
```python
|
| 20 |
+
if HAS_CUDA and os.getenv("USE_GPU_ENV") == "1":
|
| 21 |
+
print(" [GPU] Initializing GPU-Resident Environment...")
|
| 22 |
+
# num_envs should be large (e.g., 4096) to saturate GPU
|
| 23 |
+
env = VectorEnvGPU(num_envs=4096, seed=42)
|
| 24 |
+
|
| 25 |
+
# VectorEnvGPU doesn't need a VecEnv wrapper usually,
|
| 26 |
+
# but SB3 expects specific API. We might need a thin adapter.
|
| 27 |
+
env = SB3CudaAdapter(env)
|
| 28 |
+
else:
|
| 29 |
+
# Existing CPU Logic
|
| 30 |
+
env_fns = [...]
|
| 31 |
+
env = BatchedSubprocVecEnv(...)
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
## 2. The `SB3CudaAdapter`
|
| 35 |
+
|
| 36 |
+
Stable Baselines 3 expects numpy arrays on CPU by default. To fully utilize the GPU env, we must intercept the data *before* SB3 tries to convert it, or use a custom Policy that accepts Torch tensors directly.
|
| 37 |
+
|
| 38 |
+
However, `MaskablePPO` in `sb3_contrib` might try to cast inputs to numpy.
|
| 39 |
+
|
| 40 |
+
**Strategy: Zero-Copy Torch Wrapper**
|
| 41 |
+
|
| 42 |
+
```python
|
| 43 |
+
import torch
|
| 44 |
+
from gymnasium import spaces
|
| 45 |
+
|
| 46 |
+
class SB3CudaAdapter:
|
| 47 |
+
def __init__(self, gpu_env):
|
| 48 |
+
self.env = gpu_env
|
| 49 |
+
self.num_envs = gpu_env.num_envs
|
| 50 |
+
# Define spaces (Mocking them for SB3)
|
| 51 |
+
self.observation_space = spaces.Box(low=0, high=1, shape=(8192,), dtype=np.float32)
|
| 52 |
+
self.action_space = spaces.Discrete(2000)
|
| 53 |
+
|
| 54 |
+
def reset(self):
|
| 55 |
+
# returns torch tensor on GPU
|
| 56 |
+
obs, _ = self.env.reset()
|
| 57 |
+
return torch.as_tensor(obs, device='cuda')
|
| 58 |
+
|
| 59 |
+
def step(self, actions):
|
| 60 |
+
# actions come from Policy (Torch Tensor on GPU)
|
| 61 |
+
# Pass directly to env
|
| 62 |
+
obs, rewards, dones, infos = self.env.step(actions)
|
| 63 |
+
|
| 64 |
+
# Wrap outputs in Torch Tensors (Zero Copy)
|
| 65 |
+
# obs is already CuPy/DeviceArray
|
| 66 |
+
t_obs = torch.as_tensor(obs, device='cuda')
|
| 67 |
+
t_rewards = torch.as_tensor(rewards, device='cuda')
|
| 68 |
+
t_dones = torch.as_tensor(dones, device='cuda')
|
| 69 |
+
|
| 70 |
+
return t_obs, t_rewards, t_dones, infos
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
## 3. PPO Policy Modifications
|
| 74 |
+
|
| 75 |
+
Standard SB3 algorithms often force `cpu()` calls. For maximum speed, you might need to subclass `MaskablePPO` or `MlpPolicy` to ensure it accepts GPU tensors without moving them.
|
| 76 |
+
|
| 77 |
+
* **Check `rollout_buffer.py`**: SB3's rollout buffer stores data in CPU RAM by default.
|
| 78 |
+
* **Optimization**: For "Isaac Gym" style training, the Rollout Buffer should live on the GPU.
|
| 79 |
+
* *Option A*: Use `sb3`'s `DictRolloutBuffer`? No, standard buffer.
|
| 80 |
+
* *Option B*: Modify SB3 or use a library designed for GPU-only training like `skrl` or `cleanrl`.
|
| 81 |
+
* *Option C (Easiest)*: Accept that `collect_rollouts` might do one copy to CPU RAM for storage, but the **Inference** (Forward Pass) stays on GPU.
|
| 82 |
+
|
| 83 |
+
## 4. Remaining Logic Gaps
|
| 84 |
+
|
| 85 |
+
The current `VectorEnvGPU` POC has simplified logic in `resolve_bytecode_device`. Before production:
|
| 86 |
+
|
| 87 |
+
1. **Complete Opcode Support**: `O_CHARGE`, `O_CHOOSE`, `O_ADD_H` need full card movement logic (finding indices, updating arrays).
|
| 88 |
+
2. **Opponent Simulation**: `step_kernel` currently simulates a random opponent. The `step_opponent_vectorized` logic from CPU env needs to be ported to a CUDA kernel.
|
| 89 |
+
3. **Collision Handling**: In `resolve_bytecode_device`, we use `atomic` operations or careful logic if multiple effects try to modify the same global state (rare in this game, but `batch_global_ctx` is per-env so it's safe).
|
| 90 |
+
|
| 91 |
+
## 5. Performance Expectations
|
| 92 |
+
|
| 93 |
+
* **Current CPU**: ~10k SPS (128 envs).
|
| 94 |
+
* **Target GPU**: ~100k-500k SPS (4096+ envs).
|
| 95 |
+
* **Bottleneck**: Will shift from "PCI-E Transfer" to "Policy Network Forward Pass".
|