tazwarrrr commited on
Commit
dccc6b3
·
1 Parent(s): 7fb1071

feat: add real rocprof CSV evidence and sync all benchmark docs (MI300X gfx942, ROCm 7.0, May 8 2026)

Browse files
README.md CHANGED
@@ -6,7 +6,7 @@ A multi-agent pipeline that migrates CUDA kernels to AMD ROCm/HIP — catching t
6
 
7
  ## Live Demo
8
 
9
- - **Backend API**: https://rocmport-ai.onrender.com
10
  - **HuggingFace Space**: https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/ROCmPort-AI
11
 
12
  ---
 
6
 
7
  ## Live Demo
8
 
9
+ - **Backend API**: https://rocmport-ai-q2b1.onrender.com
10
  - **HuggingFace Space**: https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/ROCmPort-AI
11
 
12
  ---
backend/tools/demo_artifacts.py CHANGED
@@ -1,13 +1,14 @@
1
  """
2
- Demo artifact data for ROCmPort AI profiling layer.
3
 
4
- These values replace random.uniform() with deterministic, per-kernel data derived from
5
- realistic AMD MI300X profiling ranges for each kernel class.
 
6
 
7
- Every entry is labelled data_source="demo_artifact" so the UI can show an honest badge.
8
- When ROCM_AVAILABLE=true the real rocprof path runs instead.
9
 
10
- Baseline definition: straight hipify output with minimal compile edits (Baseline A).
11
  """
12
 
13
  from typing import Dict
@@ -28,7 +29,8 @@ from typing import Dict
28
 
29
  KERNEL_DEMO_DATA: Dict[str, Dict] = {
30
  "reduction": {
31
- # Reduction is the canonical warp-size bug demo kernel.
 
32
  # Iteration 1 with naive block-size fails on wavefront-64 → regression shown honestly.
33
  # Iteration 2 with wavefront-aware final stage fixes correctness + performance.
34
  "iteration_1": {
@@ -67,6 +69,8 @@ KERNEL_DEMO_DATA: Dict[str, Dict] = {
67
  },
68
 
69
  "matrix_multiply": {
 
 
70
  # Tiled GEMM benefits from LDS tiling on MI300X's large LDS capacity.
71
  "iteration_1": {
72
  "success": True,
@@ -89,6 +93,8 @@ KERNEL_DEMO_DATA: Dict[str, Dict] = {
89
  },
90
 
91
  "vector_add": {
 
 
92
  # Simple memory-bound kernel — MI300X bandwidth advantage is most visible here.
93
  "iteration_1": {
94
  "success": True,
@@ -232,8 +238,10 @@ def get_benchmark_summary() -> Dict:
232
  ),
233
  "data_source_note": (
234
  "matrix_multiply, vector_add, and reduction are labelled 'mi300x_live': "
235
- "real measurements on AMD Instinct MI300X (gfx942, ROCm 7.2, AMD Developer Cloud). "
236
- "convolution_2d is labelled 'demo_artifact' (representative estimate). "
 
 
237
  "Entries labelled 'simulated' use conservative estimates."
238
  ),
239
  "reproducibility_note": (
 
1
  """
2
+ Real rocprof measurements for ROCmPort AI profiling layer.
3
 
4
+ matrix_multiply, vector_add, and reduction values are rocprof-measured on
5
+ AMD Instinct MI300X (gfx942), ROCm 7.0, AMD Developer Cloud, May 8 2026.
6
+ Raw profiler CSV files are in docs/benchmark_runs/.
7
 
8
+ convolution_2d: demo_artifact (not yet measured on hardware).
9
+ custom: simulated conservative estimate.
10
 
11
+ Baseline definition: straight hipify-clang output with minimal compile edits (Baseline A).
12
  """
13
 
14
  from typing import Dict
 
29
 
30
  KERNEL_DEMO_DATA: Dict[str, Dict] = {
31
  "reduction": {
32
+ # source: docs/benchmark_runs/reduction.stats.csv
33
+ # rocprof: reduction(float*, float*, int) [clone .kd] — 10 calls, avg 42424 ns (0.042ms)
34
  # Iteration 1 with naive block-size fails on wavefront-64 → regression shown honestly.
35
  # Iteration 2 with wavefront-aware final stage fixes correctness + performance.
36
  "iteration_1": {
 
69
  },
70
 
71
  "matrix_multiply": {
72
+ # source: docs/benchmark_runs/matmul_out.stats.csv
73
+ # rocprof: matmul_baseline avg 75893 ns (0.076ms), matmul_tiled avg 26123 ns (0.026ms) → 2.91x
74
  # Tiled GEMM benefits from LDS tiling on MI300X's large LDS capacity.
75
  "iteration_1": {
76
  "success": True,
 
93
  },
94
 
95
  "vector_add": {
96
+ # source: docs/benchmark_runs/vecadd_out.stats.csv
97
+ # rocprof: vector_add(float*, float*, float*, int) [clone .kd] — 10 calls, avg 97646 ns (0.098ms), 3918 GB/s
98
  # Simple memory-bound kernel — MI300X bandwidth advantage is most visible here.
99
  "iteration_1": {
100
  "success": True,
 
238
  ),
239
  "data_source_note": (
240
  "matrix_multiply, vector_add, and reduction are labelled 'mi300x_live': "
241
+ "rocprof-measured on AMD Instinct MI300X (gfx942), ROCm 7.0, AMD Developer Cloud, May 8 2026. "
242
+ "Raw CSV files: docs/benchmark_runs/matmul_out.stats.csv, "
243
+ "docs/benchmark_runs/vecadd_out.stats.csv, docs/benchmark_runs/reduction.stats.csv. "
244
+ "convolution_2d is labelled 'demo_artifact' (not yet measured on hardware). "
245
  "Entries labelled 'simulated' use conservative estimates."
246
  ),
247
  "reproducibility_note": (
docs/LIVE_RESULTS.md CHANGED
@@ -1,30 +1,43 @@
1
- # Reproducible Results
2
 
3
- The backend returns deterministic benchmark artifacts unless `ROCM_AVAILABLE=true`
4
- is set on real ROCm hardware. These values come from
5
- `backend/tools/demo_artifacts.py` and are labelled `data_source="demo_artifact"`
6
- in API responses.
 
 
7
 
8
  ## Benchmark Results
9
 
10
- | Kernel | Baseline HIP (ms) | Optimized HIP (ms) | Speedup | Bandwidth | Bottleneck |
11
- |--------|-------------------|--------------------|---------|-----------|------------|
12
- | matrix_multiply | 121.4 | 89.1 | 1.36x | 1843.7 GB/s | memory-bound |
13
- | reduction | 88.2 | 68.7 | 1.28x | 531.8 GB/s | compute-bound after wavefront fix |
14
- | vector_add | 45.1 | 38.2 | 1.18x | 4821.6 GB/s | memory-bound |
15
- | convolution_2d | 211.7 | 158.3 | 1.34x | 2134.8 GB/s | memory-bound |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- ## Hardware Context
18
 
19
- - GPU class: AMD Instinct MI300X
20
- - VRAM: 192GB HBM3
21
- - Theoretical memory bandwidth: 5.3 TB/s
22
- - Wavefront size: 64
23
- - API data source in local/demo mode: `demo_artifact`
24
 
25
- ## Real Hardware Mode
 
26
 
27
- Set `ROCM_AVAILABLE=true`, `HIPCC_PATH=hipcc`, and `ROCPROF_PATH=rocprof` on a
28
- real MI300X ROCm environment to replace demo artifacts with `data_source="real_rocm"`.
29
- Real run output should be captured separately with the exact ROCm version, kernel
30
- input size, compiler flags, and profiler logs.
 
1
+ # Live Results — AMD Instinct MI300X
2
 
3
+ Measurements taken with `rocprof` on AMD Instinct MI300X (gfx942),
4
+ ROCm 7.0, AMD Developer Cloud, **May 8 2026**.
5
+
6
+ Raw profiler CSV files are in [`docs/benchmark_runs/`](benchmark_runs/).
7
+ Values in `backend/tools/demo_artifacts.py` are labelled `data_source="mi300x_live"`
8
+ and match these CSV files exactly.
9
 
10
  ## Benchmark Results
11
 
12
+ | Kernel | Baseline (ms) | Optimized (ms) | Speedup | Bandwidth | Source CSV |
13
+ |--------|---------------|----------------|---------|-----------|------------|
14
+ | matrix_multiply (512×512) | 0.076 | 0.026 | 2.91x | | [matmul_out.stats.csv](benchmark_runs/matmul_out.stats.csv) |
15
+ | vector_add (32M elements) | | 0.098 | | 3,918 GB/s | [vecadd_out.stats.csv](benchmark_runs/vecadd_out.stats.csv) |
16
+ | reduction (16M elements) | | 0.042 | | | [reduction.stats.csv](benchmark_runs/reduction.stats.csv) |
17
+ | convolution_2d | 211.7 | 158.3 | 1.34x | 2,134.8 GB/s | demo_artifact (not yet measured) |
18
+
19
+ > **Note:** vector_add and reduction were run standalone; no pre-optimisation baseline
20
+ > was captured in these runs. matrix_multiply baseline is `matmul_baseline` (hipify
21
+ > output, no tiling); optimized is `matmul_tiled` (LDS 32×32 tile, wavefront-64 aligned).
22
+
23
+ ## rocprof Run Details
24
+
25
+ - **Hardware:** AMD Instinct MI300X, gfx942
26
+ - **ROCm version:** 7.0
27
+ - **Platform:** AMD Developer Cloud
28
+ - **Date:** May 8 2026
29
+ - **Profiler:** `rocprof --stats`
30
+ - **Wavefront size:** 64
31
+ - **HBM3:** 192 GB, 5.3 TB/s theoretical
32
 
33
+ ## Raw CSV Quick-Reference
34
 
35
+ **matmul_out.stats.csv** (`docs/benchmark_runs/matmul_out.stats.csv`)
36
+ - `matmul_baseline`: 5 calls, total 379,467 ns, avg **75,893 ns (0.076 ms)**
37
+ - `matmul_tiled`: 5 calls, total 130,618 ns, avg **26,123 ns (0.026 ms)** → **2.91× speedup**
 
 
38
 
39
+ **vecadd_out.stats.csv** (`docs/benchmark_runs/vecadd_out.stats.csv`)
40
+ - `vector_add`: 10 calls, total 976,466 ns, avg **97,646 ns (0.098 ms)** → **3,918 GB/s**
41
 
42
+ **reduction.stats.csv** (`docs/benchmark_runs/reduction.stats.csv`)
43
+ - `reduction`: 10 calls, total 424,248 ns, avg **42,424 ns (0.042 ms)**
 
 
docs/benchmark_runs/matmul_out.stats.csv ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "Name","Calls","TotalDurationNs","AverageNs","Percentage"
2
+ "matmul_baseline(float*, float*, float*, int) [clone .kd]",5,379467,75893,72.80766551993415
3
+ "matmul_tiled(float*, float*, float*, int) [clone .kd]",5,130618,26123,25.06144580393752
4
+ "__amd_rocclr_fillBufferAligned.kd",2,11106,5553,2.130888676128329
docs/benchmark_runs/mi300x_results.txt CHANGED
@@ -1,28 +1,52 @@
1
- Data source: demo_artifact
2
- Source file: backend/tools/demo_artifacts.py
 
 
3
 
4
- Hardware class:
5
- GPU: AMD Instinct MI300X
6
- HBM: 192GB
7
- Wavefront size: 64
8
- Theoretical memory bandwidth: 5.3 TB/s
 
 
 
 
9
 
10
- matrix_multiply:
11
- Baseline HIP: 121.4 ms
12
- Optimized HIP: 89.1 ms
13
- Speedup: 1.36x
14
- Bandwidth: 1843.7 GB/s
 
 
15
 
16
- reduction:
17
- Baseline HIP: 88.2 ms
18
- Optimized HIP: 68.7 ms
19
- Speedup: 1.28x
20
- Bandwidth: 531.8 GB/s
21
 
22
- vector_add:
23
- Baseline HIP: 45.1 ms
24
- Optimized HIP: 38.2 ms
25
- Speedup: 1.18x
26
- Bandwidth: 4821.6 GB/s
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
- Set ROCM_AVAILABLE=true on real MI300X hardware to produce real_rocm values.
 
1
+ Data source: mi300x_live (rocprof --stats)
2
+ Raw CSVs: docs/benchmark_runs/matmul_out.stats.csv
3
+ docs/benchmark_runs/vecadd_out.stats.csv
4
+ docs/benchmark_runs/reduction.stats.csv
5
 
6
+ Hardware:
7
+ GPU: AMD Instinct MI300X (gfx942)
8
+ HBM3: 192 GB
9
+ Theoretical bandwidth: 5.3 TB/s
10
+ Wavefront size: 64
11
+ ROCm version: 7.0
12
+ Platform: AMD Developer Cloud
13
+ Date: May 8 2026
14
+ Profiler: rocprof --stats
15
 
16
+ ------------------------------------------------------------------------
17
+ matrix_multiply (source: matmul_out.stats.csv)
18
+ ------------------------------------------------------------------------
19
+ Kernel (baseline): matmul_baseline(float*, float*, float*, int)
20
+ Calls: 5
21
+ Total duration: 379,467 ns
22
+ Average: 75,893 ns → 0.076 ms
23
 
24
+ Kernel (optimized): matmul_tiled(float*, float*, float*, int)
25
+ Calls: 5
26
+ Total duration: 130,618 ns
27
+ Average: 26,123 ns → 0.026 ms
 
28
 
29
+ Speedup: 2.91x (baseline / tiled = 75893 / 26123)
30
+ Optimization: LDS shared-memory tiling, 32x32 tile, block=256
31
+
32
+ ------------------------------------------------------------------------
33
+ vector_add (source: vecadd_out.stats.csv)
34
+ ------------------------------------------------------------------------
35
+ Kernel: vector_add(float*, float*, float*, int)
36
+ Input size: 32M elements
37
+ Calls: 10
38
+ Total duration: 976,466 ns
39
+ Average: 97,646 ns → 0.098 ms
40
+ Bandwidth: 3,918 GB/s
41
+
42
+ ------------------------------------------------------------------------
43
+ reduction (source: reduction.stats.csv)
44
+ ------------------------------------------------------------------------
45
+ Kernel: reduction(float*, float*, int)
46
+ Input size: 16M elements
47
+ Calls: 10
48
+ Total duration: 424,248 ns
49
+ Average: 42,424 ns → 0.042 ms
50
+ Fix applied: wavefront-64 final stage (tid<64 expanded)
51
+ Correctness: PASS
52
 
 
docs/benchmark_runs/reduction.stats.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ "Name","Calls","TotalDurationNs","AverageNs","Percentage"
2
+ "reduction(float*, float*, int) [clone .kd]",10,424248,42424,94.74431754737796
3
+ "__amd_rocclr_fillBufferAligned.kd",1,23534,23534,5.255682452622035
docs/benchmark_runs/vecadd_out.stats.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ "Name","Calls","TotalDurationNs","AverageNs","Percentage"
2
+ "vector_add(float*, float*, float*, int) [clone .kd]",10,976466,97646,94.6415417658507
3
+ "__amd_rocclr_fillBufferAligned.kd",2,55286,27643,5.358458234149292