# Failure Cases

This document records known failure modes with reproducible context.

## FC-001: Inline PTX in CUDA Kernel

### Why this matters
Kernels that embed inline PTX are a realistic migration boundary. hipify can translate CUDA APIs, but it cannot preserve NVIDIA-specific assembly semantics on AMD.

### Original CUDA pattern (simplified)
```cpp
__device__ __forceinline__ unsigned lane_id() {
  unsigned lane;
  asm volatile("mov.u32 %0, %%laneid;" : "=r"(lane));
  return lane;
}
```

### Typical migration output
- CUDA runtime calls are translated.
- Inline PTX block is left unchanged or translated into invalid code for HIP compilation.

### Observed failure mode
- Compile error under hipcc due to unsupported PTX instruction syntax.
- In some cases, compile succeeds after manual edits but semantics differ because lane behavior assumptions are NVIDIA-specific.

### Root cause
- Inline PTX is vendor-specific and outside mechanical translation scope.
- Warp-level assumptions in PTX often rely on 32-lane behavior and NVIDIA ISA details.

### What is required to fix
1. Replace inline PTX with HIP or portable intrinsics.
2. Rework lane-level logic for wavefront-64 behavior where required.
3. Add correctness tests for edge lanes and reduction boundaries.
4. Re-profile after rewrite to confirm no occupancy regressions.

### Trust note
This is a deliberate example of where ROCmPort AI should report risk, not pretend full automation.

## Failure Case: Library-Heavy CUDA Code (CUB, Thrust, cuDNN)

**Input type**: CUDA kernels that call into CUB, Thrust, or cuDNN directly

**Example pattern**:
```cpp
#include <cub/cub.cuh>
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
```

**What happens**: hipify-clang renames the include to `<hipcub/hipcub.hpp>` and the namespace to `hipcub`. ROCmPort AI passes this through. The translation is mechanically correct.

**The limitation**: hipCUB API coverage is not 1:1 with CUB. Some primitives behave differently under ROCm, and performance characteristics differ significantly due to wavefront width. ROCmPort AI does not currently benchmark library calls against rocPRIM equivalents.

**What ROCmPort AI does**: flags the library dependency in the static scan, marks it HIGH risk, and recommends manual review by a ROCm-experienced engineer.

**What ROCmPort AI does not do**: guarantee correctness or performance parity for library-heavy code without human validation.

**Fix requirement**: Manual comparison of CUB vs hipCUB primitive behavior for the specific use case, or replacement with rocPRIM equivalents.

## Failure Case: Flash Attention — Warp Shuffle Intrinsics

**Kernel**: Simplified Flash Attention forward pass (Dao et al. 2022 style)
**File**: backend/demo_kernels/flash_attention_simplified.cu

**Bugs detected by ROCmPort AI static scan**:
- `__shfl_down` with implicit warp-32 offset=16 — on AMD wavefront-64, 
  the final reduction should use offset=32 first
- Softmax reduction terminates at 16 lanes — silently wrong on gfx942

**What hipify does**: renames cudaFree to hipFree, cuda headers to hip headers. 
Does NOT fix the shuffle semantics.

**What ROCmPort AI does**: flags `__shfl_sync` family calls as CRITICAL risk,
and flags unsuffixed `__shfl_down(..., 16)` style reductions as HIGH risk.
It identifies the offset=16 assumption and suggests a wavefront-64 aware rewrite.

**Status**: Compiled and executed on AMD Instinct MI300X (gfx942), ROCm 7.2.
Numerical correctness not verified — requires reference CPU implementation.

**Fix required**: Replace `__shfl_down(x, 16)` with two-stage reduction:
  `__shfl_down(x, 32)` then `__shfl_down(x, 16)` for wavefront-64.