ROCmPort-AI / docs /LIVE_RESULTS.md
tazwarrrr's picture
fix: remove convolution_2d from evidence table, clarify docstring
7982eca

Live Results — AMD Instinct MI300X

Measurements taken with rocprof on AMD Instinct MI300X (gfx942), ROCm 7.0, AMD Developer Cloud, May 8 2026.

Raw profiler CSV files are in docs/benchmark_runs/. Values in backend/tools/demo_artifacts.py are labelled data_source="mi300x_live" and match these CSV files exactly.

Benchmark Results

Kernel Baseline (ms) Optimized (ms) Speedup Bandwidth Source CSV
matrix_multiply (512×512) 0.076 0.026 2.91x matmul_out.stats.csv
vector_add (32M elements) 0.098 3,918 GB/s vecadd_out.stats.csv
reduction (16M elements) 0.042 reduction.stats.csv

convolution_2d kernel was not profiled in this hardware run and is excluded from the evidence table. Only kernels with traceable rocprof CSV output are reported here.

Note: vector_add and reduction were run standalone; no pre-optimisation baseline was captured in these runs. matrix_multiply baseline is matmul_baseline (hipify output, no tiling); optimized is matmul_tiled (LDS 32×32 tile, wavefront-64 aligned).

rocprof Run Details

  • Hardware: AMD Instinct MI300X, gfx942
  • ROCm version: 7.0
  • Platform: AMD Developer Cloud
  • Date: May 8 2026
  • Profiler: rocprof --stats
  • Wavefront size: 64
  • HBM3: 192 GB, 5.3 TB/s theoretical

Raw CSV Quick-Reference

matmul_out.stats.csv (docs/benchmark_runs/matmul_out.stats.csv)

  • matmul_baseline: 5 calls, total 379,467 ns, avg 75,893 ns (0.076 ms)
  • matmul_tiled: 5 calls, total 130,618 ns, avg 26,123 ns (0.026 ms)2.91× speedup

vecadd_out.stats.csv (docs/benchmark_runs/vecadd_out.stats.csv)

  • vector_add: 10 calls, total 976,466 ns, avg 97,646 ns (0.098 ms)3,918 GB/s

reduction.stats.csv (docs/benchmark_runs/reduction.stats.csv)

  • reduction: 10 calls, total 424,248 ns, avg 42,424 ns (0.042 ms)