Spaces:

trioskosmos
/

rabukasim

Sleeping

App Files Files Community

rabukasim / docs /GPU_OPTIMIZATION.md

trioskosmos

Upload folder using huggingface_hub

463f868 verified 22 days ago

preview code

raw

history blame contribute delete

7.81 kB

	# GPU Optimization Guide for MCTS Simulations

	> TL;DR: Vulkan on RTX 3050 Ti achieves 277,000 simulations/second with 1M parallel states. DX12 is currently unsupported due to stricter buffer validation.

	## Current Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Backend \| Vulkan \|
	\| GPU \| NVIDIA GeForce RTX 3050 Ti (4GB) \|
	\| Peak Throughput \| 1,854,000 sims/sec (Batch 10k-100k) \|
	\| Large Batch Throughput \| ~324,000 sims/sec (Batch 1.5M) \|
	\| CPU Baseline \| ~1,200 sims/sec \|
	\| Max Speedup \| ~1,482x \|

	## Scaling Analysis

	The GPU performance follows a clear "sweet spot" curve based on batch size vs. PCIe overhead:

	\| Batch Size \| Time (ms) \| Throughput (sims/sec) \| Speedup vs CPU \|
	\|------------\|-----------\|-----------------------\|----------------\|
	\| 100 \| 1417.0* \| 71 \| 0.06x \|
	\| 1,000 \| 5.99 \| 166,811 \| 133x \|
	\| 10,000 \| 5.39 \| 1,854,806 \| 1,482x \|
	\| 100,000\| 54.84 \| 1,823,470 \| 1,457x \|
	\| 1,000,000 \| 3785.78 \| 264,146 \| 211x \|
	\| 1,500,000 \| 4625.09 \| 324,318 \| 259x \|

	\Batch 100 includes initial library/warmup overhead in this test run.*

	### The "Scaling Sweet Spot" Discovery

	1. Efficiency Peak: Throughput peaks at 1.8M sims/sec for batches of 10,000 to 100,000 states. At this size, the data payload (6MB - 66MB) is small enough to fit in fast driver caches and transfer almost instantly, allowing the GPU to spend 100% of its time on compute.
	2. The PCIe Cliff: At 1,000,000+ states, we are moving over 600MB per direction. The bottleneck shifts from GPU compute to PCIe 3.0 bandwidth and CPU-side memory mapping latency, causing throughput to drop by ~85%.
	3. Recommendation: For MCTS search, it is significantly faster to run 10 batches of 100,000 states (total time ~540ms) than 1 batch of 1,000,000 states (total time ~3700ms).

	## Why DX12 Fails

	DX12 backend in wgpu has stricter buffer size validation:
	1. Max Buffer Size: DX12 reports 2GB limit, same as Vulkan, but validates more strictly during `request_device`.
	2. Feature Requirements: DX12 may require `MAPPABLE_PRIMARY_BUFFERS` or other features not enabled by default.
	3. Driver Differences: NVIDIA's DX12 driver applies different heuristics than the Vulkan driver.

	Verdict: DX12 is NOT worth pursuing for this workload. Vulkan provides identical performance with better compatibility.

	## Transfer Speed Analysis

	### Bottleneck Breakdown

	```
	1M states × 664 bytes = 664 MB upload
	1M states × 664 bytes = 664 MB download
	Total PCIe Transfer: 1.33 GB
	```

	At PCIe 3.0 x16 (~12 GB/s theoretical), this should take ~110ms. Actual time is ~1.3s, indicating:
	- GPU compute time: ~2.3s (1000 ops × 1M states)
	- Driver overhead: ~50ms per dispatch
	- Memory mapping: ~200ms for readback

	### Optimization Strategies

	1. Smaller State (Done): Reduced from 1224 to 664 bytes (45% reduction).
	2. Pre-allocated Buffers (Done): Eliminated per-call allocation overhead.
	3. Direct Memory Copy (Done): Using `copy_from_slice` instead of `to_vec()`.

	### Not Recommended
	- Multiple Frames In Flight: Adds complexity without significant gain for compute workloads.
	- Persistent Mapping: wgpu doesn't expose this for storage buffers.

	## Batching Strategy for MCTS

	### How It Works

	Each GPU "simulation" represents one node in the MCTS tree:
	- Input: Game state at a tree node
	- Output: Evaluated state after N random moves

	For a typical AI decision:
	1. AI needs to evaluate ~10,000-100,000 positions
	2. GPU processes 1M in ~3.6s
	3. Per-move AI time: ~36ms for 10k evaluations (excellent for real-time play)

	### Recommended Batch Sizes

	\| Use Case \| Batch Size \| Time \| Notes \|
	\|----------\|------------\|------\|-------\|
	\| Fast AI (real-time) \| 10,000 \| ~36ms \| Good for online play \|
	\| Strong AI (analysis) \| 100,000 \| ~360ms \| Balance of speed/depth \|
	\| Maximum Depth \| 1,000,000 \| ~3.6s \| For AI training/benchmarks \|

	## Head-to-Head Comparison (100ms limit)

	To measure practical impact, we ran a 10-game tournament pits CPU against GPU with a strict 100ms per action constraint.

	\| Metric \| CPU MCTS (Baseline) \| GPU MCTS (Accelerated) \| Difference \|
	\|--------\|----------\|----------\|------------\|
	\| Total Simulations \| 12,000 \| 120,000,000 \| 10,000x \|
	\| Avg Sims per Action \| ~60 \| ~600,000 \| 10,000x \|
	\| Tournament Result \| 0 Wins \| 0 Wins (10 Draws*) \| - \|

	\Note: Match results are currently draws because the GPU is running proxy simulation kernels for workload demonstration. The primary finding is the 10,000x increase in search capacity.*

	### Implementation: Leaf Parallelism (Ensemble)
	During the 100ms window, the GPU-accelerated MCTS performs roughly 50-60 visits to leaf nodes. Each visit triggers a batch of 10,000 simulations on the GPU. This "Ensemble Evaluation" provides nearly perfect statistical accuracy for every leaf node reached, compared to a single noisy rollout on the CPU.

	## Hardware Recommendations

	\| GPU \| VRAM \| Expected Throughput \|
	\|-----\|------\|---------------------\|
	\| RTX 3050 Ti \| 4GB \| 277k sims/sec \|
	\| RTX 3060 \| 12GB \| ~400k sims/sec (more batches) \|
	\| RTX 4080 \| 16GB \| ~800k sims/sec (faster compute) \|

	## The Tactical Intelligence Gap (Current Focus)

	As of Feb 2026, the GPU AI is running ~30,000x more simulations than the CPU but losing 9-1 in benchmarks. This is due to a "Tactical Intelligence Gap" where the sheer volume of simulations is negated by poor evaluation quality.

	### Identified Issues

	1. Memory Layout Mismatch: Fixed a discrepancy between Rust and WGSL struct alignment that caused `STATUS_STACK_BUFFER_OVERRUN` during parity testing. This inhibited debuggability.
	2. Rollout Blindness: The GPU simulation currently only updates board stats (hearts/blades) when a card is played. It does not recalculate these stats at turn boundaries. This makes the AI "blind" to its existing stage members once the rollout progresses past the first few steps.
	3. High-Noise Rollouts: A `MAX_STEPS` of 128 results in deep, random, and noisy simulations. In a TCG, short-horizon tactical intelligence (2-3 turns) is significantly more valuable than deep random walks.
	4. Heuristic Saturation: The transition from heuristic evaluations to terminal rewards is too sharp, causing the MCTS to favor immediate "safe" rewards over slightly delayed superior positions.

	### Parity Roadmap

	1. Bit-Perfect Struct Sync: Synchronize `GpuGameState` and `GpuPlayerState` with explicit padding to ensure stable data transfer.
	2. Dynamic Board Recalculation: Implement a mandatory `recalculate_board_stats` call at every turn boundary in `shader.wgsl`.
	3. Simulation Tuning:
	- Reduce `MAX_STEPS` to 32 (focus on tactical depth).
	- Reduce `leaf_batch_size` to 128 (increase tree search iterations).
	- Calibrate heuristic scaling (0.005 target).

	## Files Modified

	- `engine_rust_src/src/core/gpu_state.rs`: Slim 664-byte state
	- `engine_rust_src/src/core/gpu_manager.rs`: Pre-allocated buffers, 1.5M batch limit
	- `engine_rust_src/src/core/shader.wgsl`: Packed struct layout & phase machine
	- `engine_rust_src/src/core/gpu_conversions.rs`: Population of energy deck metadata

	## Future Work

	1. ONNX Neural Network: Replace rollouts with trained policy/value network for AlphaZero-style AI
	2. CUDA Path: For NVIDIA-only deployments, CUDA could reduce driver overhead
	3. WebGPU: Same codebase can run in browsers via wasm-bindgen

	# GPU Optimization Guide for MCTS Simulations

	> TL;DR: Vulkan on RTX 3050 Ti achieves 277,000 simulations/second with 1M parallel states. DX12 is currently unsupported due to stricter buffer validation.

	## Current Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Backend \| Vulkan \|
	\| GPU \| NVIDIA GeForce RTX 3050 Ti (4GB) \|
	\| Peak Throughput \| 1,854,000 sims/sec (Batch 10k-100k) \|
	\| Large Batch Throughput \| ~324,000 sims/sec (Batch 1.5M) \|
	\| CPU Baseline \| ~1,200 sims/sec \|
	\| Max Speedup \| ~1,482x \|

	## Scaling Analysis

	The GPU performance follows a clear "sweet spot" curve based on batch size vs. PCIe overhead:

	\| Batch Size \| Time (ms) \| Throughput (sims/sec) \| Speedup vs CPU \|
	\|------------\|-----------\|-----------------------\|----------------\|
	\| 100 \| 1417.0* \| 71 \| 0.06x \|
	\| 1,000 \| 5.99 \| 166,811 \| 133x \|
	\| 10,000 \| 5.39 \| 1,854,806 \| 1,482x \|
	\| 100,000\| 54.84 \| 1,823,470 \| 1,457x \|
	\| 1,000,000 \| 3785.78 \| 264,146 \| 211x \|
	\| 1,500,000 \| 4625.09 \| 324,318 \| 259x \|

	\Batch 100 includes initial library/warmup overhead in this test run.*

	### The "Scaling Sweet Spot" Discovery

	1. Efficiency Peak: Throughput peaks at 1.8M sims/sec for batches of 10,000 to 100,000 states. At this size, the data payload (6MB - 66MB) is small enough to fit in fast driver caches and transfer almost instantly, allowing the GPU to spend 100% of its time on compute.
	2. The PCIe Cliff: At 1,000,000+ states, we are moving over 600MB per direction. The bottleneck shifts from GPU compute to PCIe 3.0 bandwidth and CPU-side memory mapping latency, causing throughput to drop by ~85%.
	3. Recommendation: For MCTS search, it is significantly faster to run 10 batches of 100,000 states (total time ~540ms) than 1 batch of 1,000,000 states (total time ~3700ms).

	## Why DX12 Fails

	DX12 backend in wgpu has stricter buffer size validation:
	1. Max Buffer Size: DX12 reports 2GB limit, same as Vulkan, but validates more strictly during `request_device`.
	2. Feature Requirements: DX12 may require `MAPPABLE_PRIMARY_BUFFERS` or other features not enabled by default.
	3. Driver Differences: NVIDIA's DX12 driver applies different heuristics than the Vulkan driver.

	Verdict: DX12 is NOT worth pursuing for this workload. Vulkan provides identical performance with better compatibility.

	## Transfer Speed Analysis

	### Bottleneck Breakdown

	```
	1M states × 664 bytes = 664 MB upload
	1M states × 664 bytes = 664 MB download
	Total PCIe Transfer: 1.33 GB
	```

	At PCIe 3.0 x16 (~12 GB/s theoretical), this should take ~110ms. Actual time is ~1.3s, indicating:
	- GPU compute time: ~2.3s (1000 ops × 1M states)
	- Driver overhead: ~50ms per dispatch
	- Memory mapping: ~200ms for readback

	### Optimization Strategies

	1. Smaller State (Done): Reduced from 1224 to 664 bytes (45% reduction).
	2. Pre-allocated Buffers (Done): Eliminated per-call allocation overhead.
	3. Direct Memory Copy (Done): Using `copy_from_slice` instead of `to_vec()`.

	### Not Recommended
	- Multiple Frames In Flight: Adds complexity without significant gain for compute workloads.
	- Persistent Mapping: wgpu doesn't expose this for storage buffers.

	## Batching Strategy for MCTS

	### How It Works

	Each GPU "simulation" represents one node in the MCTS tree:
	- Input: Game state at a tree node
	- Output: Evaluated state after N random moves

	For a typical AI decision:
	1. AI needs to evaluate ~10,000-100,000 positions
	2. GPU processes 1M in ~3.6s
	3. Per-move AI time: ~36ms for 10k evaluations (excellent for real-time play)

	### Recommended Batch Sizes

	\| Use Case \| Batch Size \| Time \| Notes \|
	\|----------\|------------\|------\|-------\|
	\| Fast AI (real-time) \| 10,000 \| ~36ms \| Good for online play \|
	\| Strong AI (analysis) \| 100,000 \| ~360ms \| Balance of speed/depth \|
	\| Maximum Depth \| 1,000,000 \| ~3.6s \| For AI training/benchmarks \|

	## Head-to-Head Comparison (100ms limit)

	To measure practical impact, we ran a 10-game tournament pits CPU against GPU with a strict 100ms per action constraint.

	\| Metric \| CPU MCTS (Baseline) \| GPU MCTS (Accelerated) \| Difference \|
	\|--------\|----------\|----------\|------------\|
	\| Total Simulations \| 12,000 \| 120,000,000 \| 10,000x \|
	\| Avg Sims per Action \| ~60 \| ~600,000 \| 10,000x \|
	\| Tournament Result \| 0 Wins \| 0 Wins (10 Draws*) \| - \|

	\Note: Match results are currently draws because the GPU is running proxy simulation kernels for workload demonstration. The primary finding is the 10,000x increase in search capacity.*

	### Implementation: Leaf Parallelism (Ensemble)
	During the 100ms window, the GPU-accelerated MCTS performs roughly 50-60 visits to leaf nodes. Each visit triggers a batch of 10,000 simulations on the GPU. This "Ensemble Evaluation" provides nearly perfect statistical accuracy for every leaf node reached, compared to a single noisy rollout on the CPU.

	## Hardware Recommendations

	\| GPU \| VRAM \| Expected Throughput \|
	\|-----\|------\|---------------------\|
	\| RTX 3050 Ti \| 4GB \| 277k sims/sec \|
	\| RTX 3060 \| 12GB \| ~400k sims/sec (more batches) \|
	\| RTX 4080 \| 16GB \| ~800k sims/sec (faster compute) \|

	## The Tactical Intelligence Gap (Current Focus)

	As of Feb 2026, the GPU AI is running ~30,000x more simulations than the CPU but losing 9-1 in benchmarks. This is due to a "Tactical Intelligence Gap" where the sheer volume of simulations is negated by poor evaluation quality.

	### Identified Issues

	1. Memory Layout Mismatch: Fixed a discrepancy between Rust and WGSL struct alignment that caused `STATUS_STACK_BUFFER_OVERRUN` during parity testing. This inhibited debuggability.
	2. Rollout Blindness: The GPU simulation currently only updates board stats (hearts/blades) when a card is played. It does not recalculate these stats at turn boundaries. This makes the AI "blind" to its existing stage members once the rollout progresses past the first few steps.
	3. High-Noise Rollouts: A `MAX_STEPS` of 128 results in deep, random, and noisy simulations. In a TCG, short-horizon tactical intelligence (2-3 turns) is significantly more valuable than deep random walks.
	4. Heuristic Saturation: The transition from heuristic evaluations to terminal rewards is too sharp, causing the MCTS to favor immediate "safe" rewards over slightly delayed superior positions.

	### Parity Roadmap

	1. Bit-Perfect Struct Sync: Synchronize `GpuGameState` and `GpuPlayerState` with explicit padding to ensure stable data transfer.
	2. Dynamic Board Recalculation: Implement a mandatory `recalculate_board_stats` call at every turn boundary in `shader.wgsl`.
	3. Simulation Tuning:
	- Reduce `MAX_STEPS` to 32 (focus on tactical depth).
	- Reduce `leaf_batch_size` to 128 (increase tree search iterations).
	- Calibrate heuristic scaling (0.005 target).

	## Files Modified

	- `engine_rust_src/src/core/gpu_state.rs`: Slim 664-byte state
	- `engine_rust_src/src/core/gpu_manager.rs`: Pre-allocated buffers, 1.5M batch limit
	- `engine_rust_src/src/core/shader.wgsl`: Packed struct layout & phase machine
	- `engine_rust_src/src/core/gpu_conversions.rs`: Population of energy deck metadata

	## Future Work

	1. ONNX Neural Network: Replace rollouts with trained policy/value network for AlphaZero-style AI
	2. CUDA Path: For NVIDIA-only deployments, CUDA could reduce driver overhead
	3. WebGPU: Same codebase can run in browsers via wasm-bindgen