Spaces:

trioskosmos
/

rabukasim

Sleeping

App Files Files Community

rabukasim / docs /GPU_OPTIMIZATION.md

trioskosmos

Upload folder using huggingface_hub

463f868 verified 22 days ago

preview code

raw

history blame contribute delete

7.81 kB

GPU Optimization Guide for MCTS Simulations

TL;DR: Vulkan on RTX 3050 Ti achieves 277,000 simulations/second with 1M parallel states. DX12 is currently unsupported due to stricter buffer validation.

Current Performance

Metric	Value
Backend	Vulkan
GPU	NVIDIA GeForce RTX 3050 Ti (4GB)
Peak Throughput	1,854,000 sims/sec (Batch 10k-100k)
Large Batch Throughput	~324,000 sims/sec (Batch 1.5M)
CPU Baseline	~1,200 sims/sec
Max Speedup	~1,482x

Scaling Analysis

The GPU performance follows a clear "sweet spot" curve based on batch size vs. PCIe overhead:

Batch Size	Time (ms)	Throughput (sims/sec)	Speedup vs CPU
100	1417.0*	71	0.06x
1,000	5.99	166,811	133x
10,000	5.39	1,854,806	1,482x
100,000	54.84	1,823,470	1,457x
1,000,000	3785.78	264,146	211x
1,500,000	4625.09	324,318	259x

*Batch 100 includes initial library/warmup overhead in this test run.

The "Scaling Sweet Spot" Discovery

Efficiency Peak: Throughput peaks at 1.8M sims/sec for batches of 10,000 to 100,000 states. At this size, the data payload (6MB - 66MB) is small enough to fit in fast driver caches and transfer almost instantly, allowing the GPU to spend 100% of its time on compute.
The PCIe Cliff: At 1,000,000+ states, we are moving over 600MB per direction. The bottleneck shifts from GPU compute to PCIe 3.0 bandwidth and CPU-side memory mapping latency, causing throughput to drop by ~85%.
Recommendation: For MCTS search, it is significantly faster to run 10 batches of 100,000 states (total time ~540ms) than 1 batch of 1,000,000 states (total time ~3700ms).

Why DX12 Fails

DX12 backend in wgpu has stricter buffer size validation:

Max Buffer Size: DX12 reports 2GB limit, same as Vulkan, but validates more strictly during request_device.
Feature Requirements: DX12 may require MAPPABLE_PRIMARY_BUFFERS or other features not enabled by default.
Driver Differences: NVIDIA's DX12 driver applies different heuristics than the Vulkan driver.

Verdict: DX12 is NOT worth pursuing for this workload. Vulkan provides identical performance with better compatibility.

Transfer Speed Analysis

Bottleneck Breakdown

1M states × 664 bytes = 664 MB upload
1M states × 664 bytes = 664 MB download
Total PCIe Transfer: 1.33 GB

At PCIe 3.0 x16 (~12 GB/s theoretical), this should take ~110ms. Actual time is ~1.3s, indicating:

GPU compute time: ~2.3s (1000 ops × 1M states)
Driver overhead: ~50ms per dispatch
Memory mapping: ~200ms for readback

Optimization Strategies

Smaller State (Done): Reduced from 1224 to 664 bytes (45% reduction).
Pre-allocated Buffers (Done): Eliminated per-call allocation overhead.
Direct Memory Copy (Done): Using copy_from_slice instead of to_vec().

Not Recommended

Multiple Frames In Flight: Adds complexity without significant gain for compute workloads.
Persistent Mapping: wgpu doesn't expose this for storage buffers.

Batching Strategy for MCTS

How It Works

Each GPU "simulation" represents one node in the MCTS tree:

Input: Game state at a tree node
Output: Evaluated state after N random moves

For a typical AI decision:

AI needs to evaluate ~10,000-100,000 positions
GPU processes 1M in ~3.6s
Per-move AI time: ~36ms for 10k evaluations (excellent for real-time play)

Recommended Batch Sizes

Use Case	Batch Size	Time	Notes
Fast AI (real-time)	10,000	~36ms	Good for online play
Strong AI (analysis)	100,000	~360ms	Balance of speed/depth
Maximum Depth	1,000,000	~3.6s	For AI training/benchmarks

Head-to-Head Comparison (100ms limit)

To measure practical impact, we ran a 10-game tournament pits CPU against GPU with a strict 100ms per action constraint.

Metric	CPU MCTS (Baseline)	GPU MCTS (Accelerated)	Difference
Total Simulations	12,000	120,000,000	10,000x
Avg Sims per Action	~60	~600,000	10,000x
Tournament Result	0 Wins	0 Wins (10 Draws*)	-

*Note: Match results are currently draws because the GPU is running proxy simulation kernels for workload demonstration. The primary finding is the 10,000x increase in search capacity.

Implementation: Leaf Parallelism (Ensemble)

During the 100ms window, the GPU-accelerated MCTS performs roughly 50-60 visits to leaf nodes. Each visit triggers a batch of 10,000 simulations on the GPU. This "Ensemble Evaluation" provides nearly perfect statistical accuracy for every leaf node reached, compared to a single noisy rollout on the CPU.

Hardware Recommendations

GPU	VRAM	Expected Throughput
RTX 3050 Ti	4GB	277k sims/sec
RTX 3060	12GB	~400k sims/sec (more batches)
RTX 4080	16GB	~800k sims/sec (faster compute)

The Tactical Intelligence Gap (Current Focus)

As of Feb 2026, the GPU AI is running ~30,000x more simulations than the CPU but losing 9-1 in benchmarks. This is due to a "Tactical Intelligence Gap" where the sheer volume of simulations is negated by poor evaluation quality.

Identified Issues

Memory Layout Mismatch: Fixed a discrepancy between Rust and WGSL struct alignment that caused STATUS_STACK_BUFFER_OVERRUN during parity testing. This inhibited debuggability.
Rollout Blindness: The GPU simulation currently only updates board stats (hearts/blades) when a card is played. It does not recalculate these stats at turn boundaries. This makes the AI "blind" to its existing stage members once the rollout progresses past the first few steps.
High-Noise Rollouts: A MAX_STEPS of 128 results in deep, random, and noisy simulations. In a TCG, short-horizon tactical intelligence (2-3 turns) is significantly more valuable than deep random walks.
Heuristic Saturation: The transition from heuristic evaluations to terminal rewards is too sharp, causing the MCTS to favor immediate "safe" rewards over slightly delayed superior positions.

Parity Roadmap

Bit-Perfect Struct Sync: Synchronize GpuGameState and GpuPlayerState with explicit padding to ensure stable data transfer.
Dynamic Board Recalculation: Implement a mandatory recalculate_board_stats call at every turn boundary in shader.wgsl.
Simulation Tuning:
- Reduce MAX_STEPS to 32 (focus on tactical depth).
- Reduce leaf_batch_size to 128 (increase tree search iterations).
- Calibrate heuristic scaling (0.005 target).

Files Modified

engine_rust_src/src/core/gpu_state.rs: Slim 664-byte state
engine_rust_src/src/core/gpu_manager.rs: Pre-allocated buffers, 1.5M batch limit
engine_rust_src/src/core/shader.wgsl: Packed struct layout & phase machine
engine_rust_src/src/core/gpu_conversions.rs: Population of energy deck metadata

Future Work

ONNX Neural Network: Replace rollouts with trained policy/value network for AlphaZero-style AI
CUDA Path: For NVIDIA-only deployments, CUDA could reduce driver overhead
WebGPU: Same codebase can run in browsers via wasm-bindgen