feat: add real rocprof CSV evidence and sync all benchmark docs (MI300X gfx942, ROCm 7.0, May 8 2026)
Browse files
README.md
CHANGED
|
@@ -6,7 +6,7 @@ A multi-agent pipeline that migrates CUDA kernels to AMD ROCm/HIP — catching t
|
|
| 6 |
|
| 7 |
## Live Demo
|
| 8 |
|
| 9 |
-
- **Backend API**: https://rocmport-ai.onrender.com
|
| 10 |
- **HuggingFace Space**: https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/ROCmPort-AI
|
| 11 |
|
| 12 |
---
|
|
|
|
| 6 |
|
| 7 |
## Live Demo
|
| 8 |
|
| 9 |
+
- **Backend API**: https://rocmport-ai-q2b1.onrender.com
|
| 10 |
- **HuggingFace Space**: https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/ROCmPort-AI
|
| 11 |
|
| 12 |
---
|
backend/tools/demo_artifacts.py
CHANGED
|
@@ -1,13 +1,14 @@
|
|
| 1 |
"""
|
| 2 |
-
|
| 3 |
|
| 4 |
-
|
| 5 |
-
|
|
|
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
|
| 10 |
-
Baseline definition: straight hipify output with minimal compile edits (Baseline A).
|
| 11 |
"""
|
| 12 |
|
| 13 |
from typing import Dict
|
|
@@ -28,7 +29,8 @@ from typing import Dict
|
|
| 28 |
|
| 29 |
KERNEL_DEMO_DATA: Dict[str, Dict] = {
|
| 30 |
"reduction": {
|
| 31 |
-
#
|
|
|
|
| 32 |
# Iteration 1 with naive block-size fails on wavefront-64 → regression shown honestly.
|
| 33 |
# Iteration 2 with wavefront-aware final stage fixes correctness + performance.
|
| 34 |
"iteration_1": {
|
|
@@ -67,6 +69,8 @@ KERNEL_DEMO_DATA: Dict[str, Dict] = {
|
|
| 67 |
},
|
| 68 |
|
| 69 |
"matrix_multiply": {
|
|
|
|
|
|
|
| 70 |
# Tiled GEMM benefits from LDS tiling on MI300X's large LDS capacity.
|
| 71 |
"iteration_1": {
|
| 72 |
"success": True,
|
|
@@ -89,6 +93,8 @@ KERNEL_DEMO_DATA: Dict[str, Dict] = {
|
|
| 89 |
},
|
| 90 |
|
| 91 |
"vector_add": {
|
|
|
|
|
|
|
| 92 |
# Simple memory-bound kernel — MI300X bandwidth advantage is most visible here.
|
| 93 |
"iteration_1": {
|
| 94 |
"success": True,
|
|
@@ -232,8 +238,10 @@ def get_benchmark_summary() -> Dict:
|
|
| 232 |
),
|
| 233 |
"data_source_note": (
|
| 234 |
"matrix_multiply, vector_add, and reduction are labelled 'mi300x_live': "
|
| 235 |
-
"
|
| 236 |
-
"
|
|
|
|
|
|
|
| 237 |
"Entries labelled 'simulated' use conservative estimates."
|
| 238 |
),
|
| 239 |
"reproducibility_note": (
|
|
|
|
| 1 |
"""
|
| 2 |
+
Real rocprof measurements for ROCmPort AI profiling layer.
|
| 3 |
|
| 4 |
+
matrix_multiply, vector_add, and reduction values are rocprof-measured on
|
| 5 |
+
AMD Instinct MI300X (gfx942), ROCm 7.0, AMD Developer Cloud, May 8 2026.
|
| 6 |
+
Raw profiler CSV files are in docs/benchmark_runs/.
|
| 7 |
|
| 8 |
+
convolution_2d: demo_artifact (not yet measured on hardware).
|
| 9 |
+
custom: simulated conservative estimate.
|
| 10 |
|
| 11 |
+
Baseline definition: straight hipify-clang output with minimal compile edits (Baseline A).
|
| 12 |
"""
|
| 13 |
|
| 14 |
from typing import Dict
|
|
|
|
| 29 |
|
| 30 |
KERNEL_DEMO_DATA: Dict[str, Dict] = {
|
| 31 |
"reduction": {
|
| 32 |
+
# source: docs/benchmark_runs/reduction.stats.csv
|
| 33 |
+
# rocprof: reduction(float*, float*, int) [clone .kd] — 10 calls, avg 42424 ns (0.042ms)
|
| 34 |
# Iteration 1 with naive block-size fails on wavefront-64 → regression shown honestly.
|
| 35 |
# Iteration 2 with wavefront-aware final stage fixes correctness + performance.
|
| 36 |
"iteration_1": {
|
|
|
|
| 69 |
},
|
| 70 |
|
| 71 |
"matrix_multiply": {
|
| 72 |
+
# source: docs/benchmark_runs/matmul_out.stats.csv
|
| 73 |
+
# rocprof: matmul_baseline avg 75893 ns (0.076ms), matmul_tiled avg 26123 ns (0.026ms) → 2.91x
|
| 74 |
# Tiled GEMM benefits from LDS tiling on MI300X's large LDS capacity.
|
| 75 |
"iteration_1": {
|
| 76 |
"success": True,
|
|
|
|
| 93 |
},
|
| 94 |
|
| 95 |
"vector_add": {
|
| 96 |
+
# source: docs/benchmark_runs/vecadd_out.stats.csv
|
| 97 |
+
# rocprof: vector_add(float*, float*, float*, int) [clone .kd] — 10 calls, avg 97646 ns (0.098ms), 3918 GB/s
|
| 98 |
# Simple memory-bound kernel — MI300X bandwidth advantage is most visible here.
|
| 99 |
"iteration_1": {
|
| 100 |
"success": True,
|
|
|
|
| 238 |
),
|
| 239 |
"data_source_note": (
|
| 240 |
"matrix_multiply, vector_add, and reduction are labelled 'mi300x_live': "
|
| 241 |
+
"rocprof-measured on AMD Instinct MI300X (gfx942), ROCm 7.0, AMD Developer Cloud, May 8 2026. "
|
| 242 |
+
"Raw CSV files: docs/benchmark_runs/matmul_out.stats.csv, "
|
| 243 |
+
"docs/benchmark_runs/vecadd_out.stats.csv, docs/benchmark_runs/reduction.stats.csv. "
|
| 244 |
+
"convolution_2d is labelled 'demo_artifact' (not yet measured on hardware). "
|
| 245 |
"Entries labelled 'simulated' use conservative estimates."
|
| 246 |
),
|
| 247 |
"reproducibility_note": (
|
docs/LIVE_RESULTS.md
CHANGED
|
@@ -1,30 +1,43 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
|
|
|
|
|
|
| 7 |
|
| 8 |
## Benchmark Results
|
| 9 |
|
| 10 |
-
| Kernel | Baseline
|
| 11 |
-
|--------|---------------
|
| 12 |
-
| matrix_multiply |
|
| 13 |
-
|
|
| 14 |
-
|
|
| 15 |
-
| convolution_2d | 211.7 | 158.3 | 1.34x |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
-
##
|
| 18 |
|
| 19 |
-
|
| 20 |
-
-
|
| 21 |
-
-
|
| 22 |
-
- Wavefront size: 64
|
| 23 |
-
- API data source in local/demo mode: `demo_artifact`
|
| 24 |
|
| 25 |
-
|
|
|
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
Real run output should be captured separately with the exact ROCm version, kernel
|
| 30 |
-
input size, compiler flags, and profiler logs.
|
|
|
|
| 1 |
+
# Live Results — AMD Instinct MI300X
|
| 2 |
|
| 3 |
+
Measurements taken with `rocprof` on AMD Instinct MI300X (gfx942),
|
| 4 |
+
ROCm 7.0, AMD Developer Cloud, **May 8 2026**.
|
| 5 |
+
|
| 6 |
+
Raw profiler CSV files are in [`docs/benchmark_runs/`](benchmark_runs/).
|
| 7 |
+
Values in `backend/tools/demo_artifacts.py` are labelled `data_source="mi300x_live"`
|
| 8 |
+
and match these CSV files exactly.
|
| 9 |
|
| 10 |
## Benchmark Results
|
| 11 |
|
| 12 |
+
| Kernel | Baseline (ms) | Optimized (ms) | Speedup | Bandwidth | Source CSV |
|
| 13 |
+
|--------|---------------|----------------|---------|-----------|------------|
|
| 14 |
+
| matrix_multiply (512×512) | 0.076 | 0.026 | 2.91x | — | [matmul_out.stats.csv](benchmark_runs/matmul_out.stats.csv) |
|
| 15 |
+
| vector_add (32M elements) | — | 0.098 | — | 3,918 GB/s | [vecadd_out.stats.csv](benchmark_runs/vecadd_out.stats.csv) |
|
| 16 |
+
| reduction (16M elements) | — | 0.042 | — | — | [reduction.stats.csv](benchmark_runs/reduction.stats.csv) |
|
| 17 |
+
| convolution_2d | 211.7 | 158.3 | 1.34x | 2,134.8 GB/s | demo_artifact (not yet measured) |
|
| 18 |
+
|
| 19 |
+
> **Note:** vector_add and reduction were run standalone; no pre-optimisation baseline
|
| 20 |
+
> was captured in these runs. matrix_multiply baseline is `matmul_baseline` (hipify
|
| 21 |
+
> output, no tiling); optimized is `matmul_tiled` (LDS 32×32 tile, wavefront-64 aligned).
|
| 22 |
+
|
| 23 |
+
## rocprof Run Details
|
| 24 |
+
|
| 25 |
+
- **Hardware:** AMD Instinct MI300X, gfx942
|
| 26 |
+
- **ROCm version:** 7.0
|
| 27 |
+
- **Platform:** AMD Developer Cloud
|
| 28 |
+
- **Date:** May 8 2026
|
| 29 |
+
- **Profiler:** `rocprof --stats`
|
| 30 |
+
- **Wavefront size:** 64
|
| 31 |
+
- **HBM3:** 192 GB, 5.3 TB/s theoretical
|
| 32 |
|
| 33 |
+
## Raw CSV Quick-Reference
|
| 34 |
|
| 35 |
+
**matmul_out.stats.csv** (`docs/benchmark_runs/matmul_out.stats.csv`)
|
| 36 |
+
- `matmul_baseline`: 5 calls, total 379,467 ns, avg **75,893 ns (0.076 ms)**
|
| 37 |
+
- `matmul_tiled`: 5 calls, total 130,618 ns, avg **26,123 ns (0.026 ms)** → **2.91× speedup**
|
|
|
|
|
|
|
| 38 |
|
| 39 |
+
**vecadd_out.stats.csv** (`docs/benchmark_runs/vecadd_out.stats.csv`)
|
| 40 |
+
- `vector_add`: 10 calls, total 976,466 ns, avg **97,646 ns (0.098 ms)** → **3,918 GB/s**
|
| 41 |
|
| 42 |
+
**reduction.stats.csv** (`docs/benchmark_runs/reduction.stats.csv`)
|
| 43 |
+
- `reduction`: 10 calls, total 424,248 ns, avg **42,424 ns (0.042 ms)**
|
|
|
|
|
|
docs/benchmark_runs/matmul_out.stats.csv
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
|
| 2 |
+
"matmul_baseline(float*, float*, float*, int) [clone .kd]",5,379467,75893,72.80766551993415
|
| 3 |
+
"matmul_tiled(float*, float*, float*, int) [clone .kd]",5,130618,26123,25.06144580393752
|
| 4 |
+
"__amd_rocclr_fillBufferAligned.kd",2,11106,5553,2.130888676128329
|
docs/benchmark_runs/mi300x_results.txt
CHANGED
|
@@ -1,28 +1,52 @@
|
|
| 1 |
-
Data source:
|
| 2 |
-
|
|
|
|
|
|
|
| 3 |
|
| 4 |
-
Hardware
|
| 5 |
-
GPU:
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
|
|
|
|
|
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
Bandwidth: 531.8 GB/s
|
| 21 |
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
-
Set ROCM_AVAILABLE=true on real MI300X hardware to produce real_rocm values.
|
|
|
|
| 1 |
+
Data source: mi300x_live (rocprof --stats)
|
| 2 |
+
Raw CSVs: docs/benchmark_runs/matmul_out.stats.csv
|
| 3 |
+
docs/benchmark_runs/vecadd_out.stats.csv
|
| 4 |
+
docs/benchmark_runs/reduction.stats.csv
|
| 5 |
|
| 6 |
+
Hardware:
|
| 7 |
+
GPU: AMD Instinct MI300X (gfx942)
|
| 8 |
+
HBM3: 192 GB
|
| 9 |
+
Theoretical bandwidth: 5.3 TB/s
|
| 10 |
+
Wavefront size: 64
|
| 11 |
+
ROCm version: 7.0
|
| 12 |
+
Platform: AMD Developer Cloud
|
| 13 |
+
Date: May 8 2026
|
| 14 |
+
Profiler: rocprof --stats
|
| 15 |
|
| 16 |
+
------------------------------------------------------------------------
|
| 17 |
+
matrix_multiply (source: matmul_out.stats.csv)
|
| 18 |
+
------------------------------------------------------------------------
|
| 19 |
+
Kernel (baseline): matmul_baseline(float*, float*, float*, int)
|
| 20 |
+
Calls: 5
|
| 21 |
+
Total duration: 379,467 ns
|
| 22 |
+
Average: 75,893 ns → 0.076 ms
|
| 23 |
|
| 24 |
+
Kernel (optimized): matmul_tiled(float*, float*, float*, int)
|
| 25 |
+
Calls: 5
|
| 26 |
+
Total duration: 130,618 ns
|
| 27 |
+
Average: 26,123 ns → 0.026 ms
|
|
|
|
| 28 |
|
| 29 |
+
Speedup: 2.91x (baseline / tiled = 75893 / 26123)
|
| 30 |
+
Optimization: LDS shared-memory tiling, 32x32 tile, block=256
|
| 31 |
+
|
| 32 |
+
------------------------------------------------------------------------
|
| 33 |
+
vector_add (source: vecadd_out.stats.csv)
|
| 34 |
+
------------------------------------------------------------------------
|
| 35 |
+
Kernel: vector_add(float*, float*, float*, int)
|
| 36 |
+
Input size: 32M elements
|
| 37 |
+
Calls: 10
|
| 38 |
+
Total duration: 976,466 ns
|
| 39 |
+
Average: 97,646 ns → 0.098 ms
|
| 40 |
+
Bandwidth: 3,918 GB/s
|
| 41 |
+
|
| 42 |
+
------------------------------------------------------------------------
|
| 43 |
+
reduction (source: reduction.stats.csv)
|
| 44 |
+
------------------------------------------------------------------------
|
| 45 |
+
Kernel: reduction(float*, float*, int)
|
| 46 |
+
Input size: 16M elements
|
| 47 |
+
Calls: 10
|
| 48 |
+
Total duration: 424,248 ns
|
| 49 |
+
Average: 42,424 ns → 0.042 ms
|
| 50 |
+
Fix applied: wavefront-64 final stage (tid<64 expanded)
|
| 51 |
+
Correctness: PASS
|
| 52 |
|
|
|
docs/benchmark_runs/reduction.stats.csv
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
|
| 2 |
+
"reduction(float*, float*, int) [clone .kd]",10,424248,42424,94.74431754737796
|
| 3 |
+
"__amd_rocclr_fillBufferAligned.kd",1,23534,23534,5.255682452622035
|
docs/benchmark_runs/vecadd_out.stats.csv
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
|
| 2 |
+
"vector_add(float*, float*, float*, int) [clone .kd]",10,976466,97646,94.6415417658507
|
| 3 |
+
"__amd_rocclr_fillBufferAligned.kd",2,55286,27643,5.358458234149292
|