File size: 3,711 Bytes
a5be23e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3de7600
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
212db16
 
 
 
 
 
 
 
 
 
 
 
 
 
e7a1a69
 
 
212db16
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# Failure Cases

This document records known failure modes with reproducible context.

## FC-001: Inline PTX in CUDA Kernel

### Why this matters
Kernels that embed inline PTX are a realistic migration boundary. hipify can translate CUDA APIs, but it cannot preserve NVIDIA-specific assembly semantics on AMD.

### Original CUDA pattern (simplified)
```cpp
__device__ __forceinline__ unsigned lane_id() {
  unsigned lane;
  asm volatile("mov.u32 %0, %%laneid;" : "=r"(lane));
  return lane;
}
```

### Typical migration output
- CUDA runtime calls are translated.
- Inline PTX block is left unchanged or translated into invalid code for HIP compilation.

### Observed failure mode
- Compile error under hipcc due to unsupported PTX instruction syntax.
- In some cases, compile succeeds after manual edits but semantics differ because lane behavior assumptions are NVIDIA-specific.

### Root cause
- Inline PTX is vendor-specific and outside mechanical translation scope.
- Warp-level assumptions in PTX often rely on 32-lane behavior and NVIDIA ISA details.

### What is required to fix
1. Replace inline PTX with HIP or portable intrinsics.
2. Rework lane-level logic for wavefront-64 behavior where required.
3. Add correctness tests for edge lanes and reduction boundaries.
4. Re-profile after rewrite to confirm no occupancy regressions.

### Trust note
This is a deliberate example of where ROCmPort AI should report risk, not pretend full automation.

## Failure Case: Library-Heavy CUDA Code (CUB, Thrust, cuDNN)

**Input type**: CUDA kernels that call into CUB, Thrust, or cuDNN directly

**Example pattern**:
```cpp
#include <cub/cub.cuh>
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
```

**What happens**: hipify-clang renames the include to `<hipcub/hipcub.hpp>` and the namespace to `hipcub`. ROCmPort AI passes this through. The translation is mechanically correct.

**The limitation**: hipCUB API coverage is not 1:1 with CUB. Some primitives behave differently under ROCm, and performance characteristics differ significantly due to wavefront width. ROCmPort AI does not currently benchmark library calls against rocPRIM equivalents.

**What ROCmPort AI does**: flags the library dependency in the static scan, marks it HIGH risk, and recommends manual review by a ROCm-experienced engineer.

**What ROCmPort AI does not do**: guarantee correctness or performance parity for library-heavy code without human validation.

**Fix requirement**: Manual comparison of CUB vs hipCUB primitive behavior for the specific use case, or replacement with rocPRIM equivalents.

## Failure Case: Flash Attention — Warp Shuffle Intrinsics

**Kernel**: Simplified Flash Attention forward pass (Dao et al. 2022 style)
**File**: backend/demo_kernels/flash_attention_simplified.cu

**Bugs detected by ROCmPort AI static scan**:
- `__shfl_down` with implicit warp-32 offset=16 — on AMD wavefront-64, 
  the final reduction should use offset=32 first
- Softmax reduction terminates at 16 lanes — silently wrong on gfx942

**What hipify does**: renames cudaFree to hipFree, cuda headers to hip headers. 
Does NOT fix the shuffle semantics.

**What ROCmPort AI does**: flags `__shfl_sync` family calls as CRITICAL risk,
and flags unsuffixed `__shfl_down(..., 16)` style reductions as HIGH risk.
It identifies the offset=16 assumption and suggests a wavefront-64 aware rewrite.

**Status**: Compiled and executed on AMD Instinct MI300X (gfx942), ROCm 7.2.
Numerical correctness not verified — requires reference CPU implementation.

**Fix required**: Replace `__shfl_down(x, 16)` with two-stage reduction:
  `__shfl_down(x, 32)` then `__shfl_down(x, 16)` for wavefront-64.