ROCmPort-AI / docs /FAILURE_CASES.md
tazwarrrr's picture
docs fix
e7a1a69
# Failure Cases
This document records known failure modes with reproducible context.
## FC-001: Inline PTX in CUDA Kernel
### Why this matters
Kernels that embed inline PTX are a realistic migration boundary. hipify can translate CUDA APIs, but it cannot preserve NVIDIA-specific assembly semantics on AMD.
### Original CUDA pattern (simplified)
```cpp
__device__ __forceinline__ unsigned lane_id() {
unsigned lane;
asm volatile("mov.u32 %0, %%laneid;" : "=r"(lane));
return lane;
}
```
### Typical migration output
- CUDA runtime calls are translated.
- Inline PTX block is left unchanged or translated into invalid code for HIP compilation.
### Observed failure mode
- Compile error under hipcc due to unsupported PTX instruction syntax.
- In some cases, compile succeeds after manual edits but semantics differ because lane behavior assumptions are NVIDIA-specific.
### Root cause
- Inline PTX is vendor-specific and outside mechanical translation scope.
- Warp-level assumptions in PTX often rely on 32-lane behavior and NVIDIA ISA details.
### What is required to fix
1. Replace inline PTX with HIP or portable intrinsics.
2. Rework lane-level logic for wavefront-64 behavior where required.
3. Add correctness tests for edge lanes and reduction boundaries.
4. Re-profile after rewrite to confirm no occupancy regressions.
### Trust note
This is a deliberate example of where ROCmPort AI should report risk, not pretend full automation.
## Failure Case: Library-Heavy CUDA Code (CUB, Thrust, cuDNN)
**Input type**: CUDA kernels that call into CUB, Thrust, or cuDNN directly
**Example pattern**:
```cpp
#include <cub/cub.cuh>
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
```
**What happens**: hipify-clang renames the include to `<hipcub/hipcub.hpp>` and the namespace to `hipcub`. ROCmPort AI passes this through. The translation is mechanically correct.
**The limitation**: hipCUB API coverage is not 1:1 with CUB. Some primitives behave differently under ROCm, and performance characteristics differ significantly due to wavefront width. ROCmPort AI does not currently benchmark library calls against rocPRIM equivalents.
**What ROCmPort AI does**: flags the library dependency in the static scan, marks it HIGH risk, and recommends manual review by a ROCm-experienced engineer.
**What ROCmPort AI does not do**: guarantee correctness or performance parity for library-heavy code without human validation.
**Fix requirement**: Manual comparison of CUB vs hipCUB primitive behavior for the specific use case, or replacement with rocPRIM equivalents.
## Failure Case: Flash Attention — Warp Shuffle Intrinsics
**Kernel**: Simplified Flash Attention forward pass (Dao et al. 2022 style)
**File**: backend/demo_kernels/flash_attention_simplified.cu
**Bugs detected by ROCmPort AI static scan**:
- `__shfl_down` with implicit warp-32 offset=16 — on AMD wavefront-64,
the final reduction should use offset=32 first
- Softmax reduction terminates at 16 lanes — silently wrong on gfx942
**What hipify does**: renames cudaFree to hipFree, cuda headers to hip headers.
Does NOT fix the shuffle semantics.
**What ROCmPort AI does**: flags `__shfl_sync` family calls as CRITICAL risk,
and flags unsuffixed `__shfl_down(..., 16)` style reductions as HIGH risk.
It identifies the offset=16 assumption and suggests a wavefront-64 aware rewrite.
**Status**: Compiled and executed on AMD Instinct MI300X (gfx942), ROCm 7.2.
Numerical correctness not verified — requires reference CPU implementation.
**Fix required**: Replace `__shfl_down(x, 16)` with two-stage reduction:
`__shfl_down(x, 32)` then `__shfl_down(x, 16)` for wavefront-64.