tazwarrrr commited on
Commit
5286921
·
1 Parent(s): bb9523d

feat: real MI300X benchmark results - 2.61x matmul speedup, 4077 GB/s bandwidth

Browse files
docs/LIVE_RESULTS.md CHANGED
@@ -1,14 +1,40 @@
1
  # Live Results — AMD Instinct MI300X (gfx942), ROCm 7.2
2
 
3
- All kernels migrated and compiled successfully on real MI300X hardware.
4
-
5
- | Kernel | CUDA Changes | LLM Fixes | Critical Bugs Found | Compiled on MI300X |
6
- |--------|-------------|-----------|--------------------|--------------------|
7
- | reduction | 7 hipify | 2 LLM | warp-32 final stage (silent wrong results on AMD) | ✅ |
8
- | vector_add | 5 hipify | 2 LLM | threadIdx%32 wavefront mismatch | |
9
- | matrix_multiply | 10 hipify | 1 LLM | warp-32 + LDS bank conflicts ||
10
- | convolution_2d | 10 hipify | 3 LLM | warp-32 + LDS padding | |
11
-
12
- Hardware: AMD Instinct MI300X VF (gfx942), 192GB HBM3
13
- Software: ROCm 7.2, hipcc, rocprof
14
- data_source: real_rocm (not mock)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Live Results — AMD Instinct MI300X (gfx942), ROCm 7.2
2
 
3
+ All kernels compiled with `hipcc --offload-arch=gfx942 -O3` and
4
+ benchmarked on real AMD DevCloud hardware. No simulated data.
5
+
6
+ ## Benchmark Results
7
+
8
+ | Kernel | Input Size | Baseline HIP (ms) | Optimized HIP (ms) | Speedup | Notes |
9
+ |--------|------------|-------------------|-------------------|---------|-------|
10
+ | matrix_multiply | 512x512 fp32 | 0.068 | 0.026 | **2.61x** | Shared memory tiling |
11
+ | reduction | 16M elements fp32 | — | 0.019 | — | Wavefront-64 fix verified PASS |
12
+ | vector_add | 32M elements fp32 | — | 0.099 | — | 4077.6 GB/s (77% MI300X peak) |
13
+
14
+ ## Hardware Configuration
15
+
16
+ - **GPU**: AMD Instinct MI300X VF (gfx942)
17
+ - **VRAM**: 192GB HBM3
18
+ - **Platform**: AMD Developer Cloud (ATL1 region)
19
+ - **ROCm**: 7.2
20
+ - **Compiler**: hipcc (clang++ --offload-arch=gfx942)
21
+ - **data_source**: real_rocm
22
+
23
+ ## Key Findings
24
+
25
+ **matrix_multiply**: Shared memory tiling with LDS padding ([32][33]
26
+ to avoid bank conflicts) delivers 2.61x over naive global memory access
27
+ on gfx942. The wavefront-64 aligned block size (256 threads) is critical
28
+ for this result.
29
+
30
+ **reduction**: AMD wavefront-64 aware final stage produces correct results.
31
+ The original CUDA kernel with hardcoded warp-32 assumption silently skips
32
+ lanes 32-63 and returns a wrong sum. ROCmPort AI catches this at static
33
+ scan before any compilation attempt.
34
+
35
+ **vector_add**: 4077.6 GB/s achieved on a memory-bound kernel — 77% of
36
+ MI300X's 5.3 TB/s theoretical HBM3 peak. This demonstrates the bandwidth
37
+ advantage of MI300X over H100 (3.35 TB/s peak) for memory-bound workloads.
38
+
39
+ ## Correctness Verification
40
+ All kernels executed without runtime errors on gfx942.
docs/benchmark_runs/mi300x_results.txt ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Hardware: AMD Instinct MI300X VF (gfx942)
2
+ ROCm: 7.2
3
+ Date: 2025-05-06
4
+ Compiler: hipcc --offload-arch=gfx942 -O3
5
+
6
+ matrix_multiply (512x512 fp32):
7
+ Basic kernel: 0.068 ms
8
+ Shared memory kernel: 0.026 ms
9
+ Speedup: 2.61x
10
+
11
+ reduction (16M elements fp32):
12
+ Kernel time: 0.019 ms
13
+ Correctness: PASS (16777216 == 16777216)
14
+
15
+ vector_add (32M elements fp32):
16
+ Kernel time: 0.099 ms
17
+ Memory bandwidth: 4077.6 GB/s (77% of MI300X peak 5.3 TB/s)