# Failure Cases This document records known failure modes with reproducible context. ## FC-001: Inline PTX in CUDA Kernel ### Why this matters Kernels that embed inline PTX are a realistic migration boundary. hipify can translate CUDA APIs, but it cannot preserve NVIDIA-specific assembly semantics on AMD. ### Original CUDA pattern (simplified) ```cpp __device__ __forceinline__ unsigned lane_id() { unsigned lane; asm volatile("mov.u32 %0, %%laneid;" : "=r"(lane)); return lane; } ``` ### Typical migration output - CUDA runtime calls are translated. - Inline PTX block is left unchanged or translated into invalid code for HIP compilation. ### Observed failure mode - Compile error under hipcc due to unsupported PTX instruction syntax. - In some cases, compile succeeds after manual edits but semantics differ because lane behavior assumptions are NVIDIA-specific. ### Root cause - Inline PTX is vendor-specific and outside mechanical translation scope. - Warp-level assumptions in PTX often rely on 32-lane behavior and NVIDIA ISA details. ### What is required to fix 1. Replace inline PTX with HIP or portable intrinsics. 2. Rework lane-level logic for wavefront-64 behavior where required. 3. Add correctness tests for edge lanes and reduction boundaries. 4. Re-profile after rewrite to confirm no occupancy regressions. ### Trust note This is a deliberate example of where ROCmPort AI should report risk, not pretend full automation. ## Failure Case: Library-Heavy CUDA Code (CUB, Thrust, cuDNN) **Input type**: CUDA kernels that call into CUB, Thrust, or cuDNN directly **Example pattern**: ```cpp #include cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items); ``` **What happens**: hipify-clang renames the include to `` and the namespace to `hipcub`. ROCmPort AI passes this through. The translation is mechanically correct. **The limitation**: hipCUB API coverage is not 1:1 with CUB. Some primitives behave differently under ROCm, and performance characteristics differ significantly due to wavefront width. ROCmPort AI does not currently benchmark library calls against rocPRIM equivalents. **What ROCmPort AI does**: flags the library dependency in the static scan, marks it HIGH risk, and recommends manual review by a ROCm-experienced engineer. **What ROCmPort AI does not do**: guarantee correctness or performance parity for library-heavy code without human validation. **Fix requirement**: Manual comparison of CUB vs hipCUB primitive behavior for the specific use case, or replacement with rocPRIM equivalents. ## Failure Case: Flash Attention — Warp Shuffle Intrinsics **Kernel**: Simplified Flash Attention forward pass (Dao et al. 2022 style) **File**: backend/demo_kernels/flash_attention_simplified.cu **Bugs detected by ROCmPort AI static scan**: - `__shfl_down` with implicit warp-32 offset=16 — on AMD wavefront-64, the final reduction should use offset=32 first - Softmax reduction terminates at 16 lanes — silently wrong on gfx942 **What hipify does**: renames cudaFree to hipFree, cuda headers to hip headers. Does NOT fix the shuffle semantics. **What ROCmPort AI does**: flags `__shfl_sync` family calls as CRITICAL risk, and flags unsuffixed `__shfl_down(..., 16)` style reductions as HIGH risk. It identifies the offset=16 assumption and suggests a wavefront-64 aware rewrite. **Status**: Compiled and executed on AMD Instinct MI300X (gfx942), ROCm 7.2. Numerical correctness not verified — requires reference CPU implementation. **Fix required**: Replace `__shfl_down(x, 16)` with two-stage reduction: `__shfl_down(x, 32)` then `__shfl_down(x, 16)` for wavefront-64.