ROCmPort-AI / docs /FAILURE_CASES.md
tazwarrrr's picture
docs fix
e7a1a69

Failure Cases

This document records known failure modes with reproducible context.

FC-001: Inline PTX in CUDA Kernel

Why this matters

Kernels that embed inline PTX are a realistic migration boundary. hipify can translate CUDA APIs, but it cannot preserve NVIDIA-specific assembly semantics on AMD.

Original CUDA pattern (simplified)

__device__ __forceinline__ unsigned lane_id() {
  unsigned lane;
  asm volatile("mov.u32 %0, %%laneid;" : "=r"(lane));
  return lane;
}

Typical migration output

  • CUDA runtime calls are translated.
  • Inline PTX block is left unchanged or translated into invalid code for HIP compilation.

Observed failure mode

  • Compile error under hipcc due to unsupported PTX instruction syntax.
  • In some cases, compile succeeds after manual edits but semantics differ because lane behavior assumptions are NVIDIA-specific.

Root cause

  • Inline PTX is vendor-specific and outside mechanical translation scope.
  • Warp-level assumptions in PTX often rely on 32-lane behavior and NVIDIA ISA details.

What is required to fix

  1. Replace inline PTX with HIP or portable intrinsics.
  2. Rework lane-level logic for wavefront-64 behavior where required.
  3. Add correctness tests for edge lanes and reduction boundaries.
  4. Re-profile after rewrite to confirm no occupancy regressions.

Trust note

This is a deliberate example of where ROCmPort AI should report risk, not pretend full automation.

Failure Case: Library-Heavy CUDA Code (CUB, Thrust, cuDNN)

Input type: CUDA kernels that call into CUB, Thrust, or cuDNN directly

Example pattern:

#include <cub/cub.cuh>
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);

What happens: hipify-clang renames the include to <hipcub/hipcub.hpp> and the namespace to hipcub. ROCmPort AI passes this through. The translation is mechanically correct.

The limitation: hipCUB API coverage is not 1:1 with CUB. Some primitives behave differently under ROCm, and performance characteristics differ significantly due to wavefront width. ROCmPort AI does not currently benchmark library calls against rocPRIM equivalents.

What ROCmPort AI does: flags the library dependency in the static scan, marks it HIGH risk, and recommends manual review by a ROCm-experienced engineer.

What ROCmPort AI does not do: guarantee correctness or performance parity for library-heavy code without human validation.

Fix requirement: Manual comparison of CUB vs hipCUB primitive behavior for the specific use case, or replacement with rocPRIM equivalents.

Failure Case: Flash Attention — Warp Shuffle Intrinsics

Kernel: Simplified Flash Attention forward pass (Dao et al. 2022 style) File: backend/demo_kernels/flash_attention_simplified.cu

Bugs detected by ROCmPort AI static scan:

  • __shfl_down with implicit warp-32 offset=16 — on AMD wavefront-64, the final reduction should use offset=32 first
  • Softmax reduction terminates at 16 lanes — silently wrong on gfx942

What hipify does: renames cudaFree to hipFree, cuda headers to hip headers. Does NOT fix the shuffle semantics.

What ROCmPort AI does: flags __shfl_sync family calls as CRITICAL risk, and flags unsuffixed __shfl_down(..., 16) style reductions as HIGH risk. It identifies the offset=16 assumption and suggests a wavefront-64 aware rewrite.

Status: Compiled and executed on AMD Instinct MI300X (gfx942), ROCm 7.2. Numerical correctness not verified — requires reference CPU implementation.

Fix required: Replace __shfl_down(x, 16) with two-stage reduction: __shfl_down(x, 32) then __shfl_down(x, 16) for wavefront-64.