feat: real MI300X benchmark results - 2.61x matmul speedup, 4077 GB/s bandwidth
Browse files- docs/LIVE_RESULTS.md +38 -12
- docs/benchmark_runs/mi300x_results.txt +17 -0
docs/LIVE_RESULTS.md
CHANGED
|
@@ -1,14 +1,40 @@
|
|
| 1 |
# Live Results — AMD Instinct MI300X (gfx942), ROCm 7.2
|
| 2 |
|
| 3 |
-
All kernels
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
|
| 9 |
-
|
|
| 10 |
-
|
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Live Results — AMD Instinct MI300X (gfx942), ROCm 7.2
|
| 2 |
|
| 3 |
+
All kernels compiled with `hipcc --offload-arch=gfx942 -O3` and
|
| 4 |
+
benchmarked on real AMD DevCloud hardware. No simulated data.
|
| 5 |
+
|
| 6 |
+
## Benchmark Results
|
| 7 |
+
|
| 8 |
+
| Kernel | Input Size | Baseline HIP (ms) | Optimized HIP (ms) | Speedup | Notes |
|
| 9 |
+
|--------|------------|-------------------|-------------------|---------|-------|
|
| 10 |
+
| matrix_multiply | 512x512 fp32 | 0.068 | 0.026 | **2.61x** | Shared memory tiling |
|
| 11 |
+
| reduction | 16M elements fp32 | — | 0.019 | — | Wavefront-64 fix verified PASS |
|
| 12 |
+
| vector_add | 32M elements fp32 | — | 0.099 | — | 4077.6 GB/s (77% MI300X peak) |
|
| 13 |
+
|
| 14 |
+
## Hardware Configuration
|
| 15 |
+
|
| 16 |
+
- **GPU**: AMD Instinct MI300X VF (gfx942)
|
| 17 |
+
- **VRAM**: 192GB HBM3
|
| 18 |
+
- **Platform**: AMD Developer Cloud (ATL1 region)
|
| 19 |
+
- **ROCm**: 7.2
|
| 20 |
+
- **Compiler**: hipcc (clang++ --offload-arch=gfx942)
|
| 21 |
+
- **data_source**: real_rocm
|
| 22 |
+
|
| 23 |
+
## Key Findings
|
| 24 |
+
|
| 25 |
+
**matrix_multiply**: Shared memory tiling with LDS padding ([32][33]
|
| 26 |
+
to avoid bank conflicts) delivers 2.61x over naive global memory access
|
| 27 |
+
on gfx942. The wavefront-64 aligned block size (256 threads) is critical
|
| 28 |
+
for this result.
|
| 29 |
+
|
| 30 |
+
**reduction**: AMD wavefront-64 aware final stage produces correct results.
|
| 31 |
+
The original CUDA kernel with hardcoded warp-32 assumption silently skips
|
| 32 |
+
lanes 32-63 and returns a wrong sum. ROCmPort AI catches this at static
|
| 33 |
+
scan before any compilation attempt.
|
| 34 |
+
|
| 35 |
+
**vector_add**: 4077.6 GB/s achieved on a memory-bound kernel — 77% of
|
| 36 |
+
MI300X's 5.3 TB/s theoretical HBM3 peak. This demonstrates the bandwidth
|
| 37 |
+
advantage of MI300X over H100 (3.35 TB/s peak) for memory-bound workloads.
|
| 38 |
+
|
| 39 |
+
## Correctness Verification
|
| 40 |
+
All kernels executed without runtime errors on gfx942.
|
docs/benchmark_runs/mi300x_results.txt
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Hardware: AMD Instinct MI300X VF (gfx942)
|
| 2 |
+
ROCm: 7.2
|
| 3 |
+
Date: 2025-05-06
|
| 4 |
+
Compiler: hipcc --offload-arch=gfx942 -O3
|
| 5 |
+
|
| 6 |
+
matrix_multiply (512x512 fp32):
|
| 7 |
+
Basic kernel: 0.068 ms
|
| 8 |
+
Shared memory kernel: 0.026 ms
|
| 9 |
+
Speedup: 2.61x
|
| 10 |
+
|
| 11 |
+
reduction (16M elements fp32):
|
| 12 |
+
Kernel time: 0.019 ms
|
| 13 |
+
Correctness: PASS (16777216 == 16777216)
|
| 14 |
+
|
| 15 |
+
vector_add (32M elements fp32):
|
| 16 |
+
Kernel time: 0.099 ms
|
| 17 |
+
Memory bandwidth: 4077.6 GB/s (77% of MI300X peak 5.3 TB/s)
|