Spaces:

lablab-ai-amd-developer-hackathon
/

ROCmPort-AI

Running

App Files Files Community

ROCmPort-AI / docs /LIVE_RESULTS.md

tazwarrrr

fix: remove convolution_2d from evidence table, clarify docstring

7982eca 9 days ago

preview code

raw

history blame contribute delete

2.16 kB

Live Results — AMD Instinct MI300X

Measurements taken with rocprof on AMD Instinct MI300X (gfx942), ROCm 7.0, AMD Developer Cloud, May 8 2026.

Raw profiler CSV files are in docs/benchmark_runs/. Values in backend/tools/demo_artifacts.py are labelled data_source="mi300x_live" and match these CSV files exactly.

Benchmark Results

Kernel	Baseline (ms)	Optimized (ms)	Speedup	Bandwidth	Source CSV
matrix_multiply (512×512)	0.076	0.026	2.91x	—	matmul_out.stats.csv
vector_add (32M elements)	—	0.098	—	3,918 GB/s	vecadd_out.stats.csv
reduction (16M elements)	—	0.042	—	—	reduction.stats.csv

convolution_2d kernel was not profiled in this hardware run and is excluded from the evidence table. Only kernels with traceable rocprof CSV output are reported here.

Note: vector_add and reduction were run standalone; no pre-optimisation baseline was captured in these runs. matrix_multiply baseline is matmul_baseline (hipify output, no tiling); optimized is matmul_tiled (LDS 32×32 tile, wavefront-64 aligned).

rocprof Run Details

Hardware: AMD Instinct MI300X, gfx942
ROCm version: 7.0
Platform: AMD Developer Cloud
Date: May 8 2026
Profiler: rocprof --stats
Wavefront size: 64
HBM3: 192 GB, 5.3 TB/s theoretical

Raw CSV Quick-Reference

matmul_out.stats.csv (docs/benchmark_runs/matmul_out.stats.csv)

matmul_baseline: 5 calls, total 379,467 ns, avg 75,893 ns (0.076 ms)
matmul_tiled: 5 calls, total 130,618 ns, avg 26,123 ns (0.026 ms) → 2.91× speedup

vecadd_out.stats.csv (docs/benchmark_runs/vecadd_out.stats.csv)

vector_add: 10 calls, total 976,466 ns, avg 97,646 ns (0.098 ms) → 3,918 GB/s

reduction.stats.csv (docs/benchmark_runs/reduction.stats.csv)

reduction: 10 calls, total 424,248 ns, avg 42,424 ns (0.042 ms)