Spaces:

lablab-ai-amd-developer-hackathon
/

ROCmPort-AI

Running

App Files Files Community

tazwarrrr commited on 26 days ago

Commit

dccc6b3

1 Parent(s): 7fb1071

feat: add real rocprof CSV evidence and sync all benchmark docs (MI300X gfx942, ROCm 7.0, May 8 2026)

Browse files

Files changed (7) hide show

README.md +1 -1
backend/tools/demo_artifacts.py +17 -9
docs/LIVE_RESULTS.md +35 -22
docs/benchmark_runs/matmul_out.stats.csv +4 -0
docs/benchmark_runs/mi300x_results.txt +47 -23
docs/benchmark_runs/reduction.stats.csv +3 -0
docs/benchmark_runs/vecadd_out.stats.csv +3 -0

README.md CHANGED Viewed

@@ -6,7 +6,7 @@ A multi-agent pipeline that migrates CUDA kernels to AMD ROCm/HIP — catching t
 ## Live Demo
-- **Backend API**: https://rocmport-ai.onrender.com
 - **HuggingFace Space**: https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/ROCmPort-AI
 ---

 ## Live Demo
+- **Backend API**: https://rocmport-ai-q2b1.onrender.com
 - **HuggingFace Space**: https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/ROCmPort-AI
 ---

backend/tools/demo_artifacts.py CHANGED Viewed

@@ -1,13 +1,14 @@
 """
-Demo artifact data for ROCmPort AI profiling layer.
-These values replace random.uniform() with deterministic, per-kernel data derived from
-realistic AMD MI300X profiling ranges for each kernel class.
-Every entry is labelled data_source="demo_artifact" so the UI can show an honest badge.
-When ROCM_AVAILABLE=true the real rocprof path runs instead.
-Baseline definition: straight hipify output with minimal compile edits (Baseline A).
 """
 from typing import Dict
@@ -28,7 +29,8 @@ from typing import Dict
 KERNEL_DEMO_DATA: Dict[str, Dict] = {
     "reduction": {
-        # Reduction is the canonical warp-size bug demo kernel.
         # Iteration 1 with naive block-size fails on wavefront-64 → regression shown honestly.
         # Iteration 2 with wavefront-aware final stage fixes correctness + performance.
         "iteration_1": {
@@ -67,6 +69,8 @@ KERNEL_DEMO_DATA: Dict[str, Dict] = {
     },
     "matrix_multiply": {
         # Tiled GEMM benefits from LDS tiling on MI300X's large LDS capacity.
         "iteration_1": {
             "success": True,
@@ -89,6 +93,8 @@ KERNEL_DEMO_DATA: Dict[str, Dict] = {
     },
     "vector_add": {
         # Simple memory-bound kernel — MI300X bandwidth advantage is most visible here.
         "iteration_1": {
             "success": True,
@@ -232,8 +238,10 @@ def get_benchmark_summary() -> Dict:
         ),
         "data_source_note": (
             "matrix_multiply, vector_add, and reduction are labelled 'mi300x_live': "
-            "real measurements on AMD Instinct MI300X (gfx942, ROCm 7.2, AMD Developer Cloud). "
-            "convolution_2d is labelled 'demo_artifact' (representative estimate). "
             "Entries labelled 'simulated' use conservative estimates."
         ),
         "reproducibility_note": (

 """
+Real rocprof measurements for ROCmPort AI profiling layer.
+matrix_multiply, vector_add, and reduction values are rocprof-measured on
+AMD Instinct MI300X (gfx942), ROCm 7.0, AMD Developer Cloud, May 8 2026.
+Raw profiler CSV files are in docs/benchmark_runs/.
+convolution_2d: demo_artifact (not yet measured on hardware).
+custom: simulated conservative estimate.
+Baseline definition: straight hipify-clang output with minimal compile edits (Baseline A).
 """
 from typing import Dict
 KERNEL_DEMO_DATA: Dict[str, Dict] = {
     "reduction": {
+        # source: docs/benchmark_runs/reduction.stats.csv
+        # rocprof: reduction(float*, float*, int) [clone .kd] — 10 calls, avg 42424 ns (0.042ms)
         # Iteration 1 with naive block-size fails on wavefront-64 → regression shown honestly.
         # Iteration 2 with wavefront-aware final stage fixes correctness + performance.
         "iteration_1": {
     },
     "matrix_multiply": {
+        # source: docs/benchmark_runs/matmul_out.stats.csv
+        # rocprof: matmul_baseline avg 75893 ns (0.076ms), matmul_tiled avg 26123 ns (0.026ms) → 2.91x
         # Tiled GEMM benefits from LDS tiling on MI300X's large LDS capacity.
         "iteration_1": {
             "success": True,
     },
     "vector_add": {
+        # source: docs/benchmark_runs/vecadd_out.stats.csv
+        # rocprof: vector_add(float*, float*, float*, int) [clone .kd] — 10 calls, avg 97646 ns (0.098ms), 3918 GB/s
         # Simple memory-bound kernel — MI300X bandwidth advantage is most visible here.
         "iteration_1": {
             "success": True,
         ),
         "data_source_note": (
             "matrix_multiply, vector_add, and reduction are labelled 'mi300x_live': "
+            "rocprof-measured on AMD Instinct MI300X (gfx942), ROCm 7.0, AMD Developer Cloud, May 8 2026. "
+            "Raw CSV files: docs/benchmark_runs/matmul_out.stats.csv, "
+            "docs/benchmark_runs/vecadd_out.stats.csv, docs/benchmark_runs/reduction.stats.csv. "
+            "convolution_2d is labelled 'demo_artifact' (not yet measured on hardware). "
             "Entries labelled 'simulated' use conservative estimates."
         ),
         "reproducibility_note": (

docs/LIVE_RESULTS.md CHANGED Viewed

@@ -1,30 +1,43 @@
-# Reproducible Results
-The backend returns deterministic benchmark artifacts unless `ROCM_AVAILABLE=true`
-is set on real ROCm hardware. These values come from
-`backend/tools/demo_artifacts.py` and are labelled `data_source="demo_artifact"`
-in API responses.
 ## Benchmark Results
-| Kernel | Baseline HIP (ms) | Optimized HIP (ms) | Speedup | Bandwidth | Bottleneck |
-|--------|-------------------|--------------------|---------|-----------|------------|
-| matrix_multiply | 121.4 | 89.1 | 1.36x | 1843.7 GB/s | memory-bound |
-| reduction | 88.2 | 68.7 | 1.28x | 531.8 GB/s | compute-bound after wavefront fix |
-| vector_add | 45.1 | 38.2 | 1.18x | 4821.6 GB/s | memory-bound |
-| convolution_2d | 211.7 | 158.3 | 1.34x | 2134.8 GB/s | memory-bound |
-## Hardware Context
-- GPU class: AMD Instinct MI300X
-- VRAM: 192GB HBM3
-- Theoretical memory bandwidth: 5.3 TB/s
-- Wavefront size: 64
-- API data source in local/demo mode: `demo_artifact`
-## Real Hardware Mode
-Set `ROCM_AVAILABLE=true`, `HIPCC_PATH=hipcc`, and `ROCPROF_PATH=rocprof` on a
-real MI300X ROCm environment to replace demo artifacts with `data_source="real_rocm"`.
-Real run output should be captured separately with the exact ROCm version, kernel
-input size, compiler flags, and profiler logs.

+# Live Results — AMD Instinct MI300X
+Measurements taken with `rocprof` on AMD Instinct MI300X (gfx942),
+ROCm 7.0, AMD Developer Cloud, **May 8 2026**.
+Raw profiler CSV files are in [`docs/benchmark_runs/`](benchmark_runs/).
+Values in `backend/tools/demo_artifacts.py` are labelled `data_source="mi300x_live"`
+and match these CSV files exactly.
 ## Benchmark Results
+| Kernel | Baseline (ms) | Optimized (ms) | Speedup | Bandwidth | Source CSV |
+|--------|---------------|----------------|---------|-----------|------------|
+| matrix_multiply (512×512) | 0.076 | 0.026 | 2.91x | — | [matmul_out.stats.csv](benchmark_runs/matmul_out.stats.csv) |
+| vector_add (32M elements) | — | 0.098 | — | 3,918 GB/s | [vecadd_out.stats.csv](benchmark_runs/vecadd_out.stats.csv) |
+| reduction (16M elements) | — | 0.042 | — | — | [reduction.stats.csv](benchmark_runs/reduction.stats.csv) |
+| convolution_2d | 211.7 | 158.3 | 1.34x | 2,134.8 GB/s | demo_artifact (not yet measured) |
+> **Note:** vector_add and reduction were run standalone; no pre-optimisation baseline
+> was captured in these runs. matrix_multiply baseline is `matmul_baseline` (hipify
+> output, no tiling); optimized is `matmul_tiled` (LDS 32×32 tile, wavefront-64 aligned).
+## rocprof Run Details
+- **Hardware:** AMD Instinct MI300X, gfx942
+- **ROCm version:** 7.0
+- **Platform:** AMD Developer Cloud
+- **Date:** May 8 2026
+- **Profiler:** `rocprof --stats`
+- **Wavefront size:** 64
+- **HBM3:** 192 GB, 5.3 TB/s theoretical
+## Raw CSV Quick-Reference
+**matmul_out.stats.csv** (`docs/benchmark_runs/matmul_out.stats.csv`)
+- `matmul_baseline`: 5 calls, total 379,467 ns, avg **75,893 ns (0.076 ms)**
+- `matmul_tiled`: 5 calls, total 130,618 ns, avg **26,123 ns (0.026 ms)** → **2.91× speedup**
+**vecadd_out.stats.csv** (`docs/benchmark_runs/vecadd_out.stats.csv`)
+- `vector_add`: 10 calls, total 976,466 ns, avg **97,646 ns (0.098 ms)** → **3,918 GB/s**
+**reduction.stats.csv** (`docs/benchmark_runs/reduction.stats.csv`)
+- `reduction`: 10 calls, total 424,248 ns, avg **42,424 ns (0.042 ms)**

docs/benchmark_runs/matmul_out.stats.csv ADDED Viewed

	@@ -0,0 +1,4 @@

+"Name","Calls","TotalDurationNs","AverageNs","Percentage"
+"matmul_baseline(float*, float*, float*, int) [clone .kd]",5,379467,75893,72.80766551993415
+"matmul_tiled(float*, float*, float*, int) [clone .kd]",5,130618,26123,25.06144580393752
+"__amd_rocclr_fillBufferAligned.kd",2,11106,5553,2.130888676128329

docs/benchmark_runs/mi300x_results.txt CHANGED Viewed

@@ -1,28 +1,52 @@
-Data source: demo_artifact
-Source file: backend/tools/demo_artifacts.py
-Hardware class:
-  GPU: AMD Instinct MI300X
-  HBM: 192GB
-  Wavefront size: 64
-  Theoretical memory bandwidth: 5.3 TB/s
-matrix_multiply:
-  Baseline HIP: 121.4 ms
-  Optimized HIP: 89.1 ms
-  Speedup: 1.36x
-  Bandwidth: 1843.7 GB/s
-reduction:
-  Baseline HIP: 88.2 ms
-  Optimized HIP: 68.7 ms
-  Speedup: 1.28x
-  Bandwidth: 531.8 GB/s
-vector_add:
-  Baseline HIP: 45.1 ms
-  Optimized HIP: 38.2 ms
-  Speedup: 1.18x
-  Bandwidth: 4821.6 GB/s
-Set ROCM_AVAILABLE=true on real MI300X hardware to produce real_rocm values.

+Data source: mi300x_live (rocprof --stats)
+Raw CSVs: docs/benchmark_runs/matmul_out.stats.csv
+          docs/benchmark_runs/vecadd_out.stats.csv
+          docs/benchmark_runs/reduction.stats.csv
+Hardware:
+  GPU:                  AMD Instinct MI300X (gfx942)
+  HBM3:                 192 GB
+  Theoretical bandwidth: 5.3 TB/s
+  Wavefront size:       64
+  ROCm version:         7.0
+  Platform:             AMD Developer Cloud
+  Date:                 May 8 2026
+  Profiler:             rocprof --stats
+------------------------------------------------------------------------
+matrix_multiply  (source: matmul_out.stats.csv)
+------------------------------------------------------------------------
+  Kernel (baseline):  matmul_baseline(float*, float*, float*, int)
+    Calls:            5
+    Total duration:   379,467 ns
+    Average:          75,893 ns  →  0.076 ms
+  Kernel (optimized): matmul_tiled(float*, float*, float*, int)
+    Calls:            5
+    Total duration:   130,618 ns
+    Average:          26,123 ns  →  0.026 ms
+  Speedup:            2.91x  (baseline / tiled = 75893 / 26123)
+  Optimization:       LDS shared-memory tiling, 32x32 tile, block=256
+------------------------------------------------------------------------
+vector_add  (source: vecadd_out.stats.csv)
+------------------------------------------------------------------------
+  Kernel:             vector_add(float*, float*, float*, int)
+    Input size:       32M elements
+    Calls:            10
+    Total duration:   976,466 ns
+    Average:          97,646 ns  →  0.098 ms
+    Bandwidth:        3,918 GB/s
+------------------------------------------------------------------------
+reduction  (source: reduction.stats.csv)
+------------------------------------------------------------------------
+  Kernel:             reduction(float*, float*, int)
+    Input size:       16M elements
+    Calls:            10
+    Total duration:   424,248 ns
+    Average:          42,424 ns  →  0.042 ms
+  Fix applied:        wavefront-64 final stage (tid<64 expanded)
+  Correctness:        PASS

docs/benchmark_runs/reduction.stats.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+"Name","Calls","TotalDurationNs","AverageNs","Percentage"
+"reduction(float*, float*, int) [clone .kd]",10,424248,42424,94.74431754737796
+"__amd_rocclr_fillBufferAligned.kd",1,23534,23534,5.255682452622035

docs/benchmark_runs/vecadd_out.stats.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+"Name","Calls","TotalDurationNs","AverageNs","Percentage"
+"vector_add(float*, float*, float*, int) [clone .kd]",10,976466,97646,94.6415417658507
+"__amd_rocclr_fillBufferAligned.kd",2,55286,27643,5.358458234149292