Spaces:

lablab-ai-amd-developer-hackathon
/

ROCmPort-AI

Running

App Files Files Community

ROCmPort-AI / docs /FAILURE_CASES.md

tazwarrrr

docs fix

e7a1a69 13 days ago

preview code

raw

history blame contribute delete

3.71 kB

	# Failure Cases

	This document records known failure modes with reproducible context.

	## FC-001: Inline PTX in CUDA Kernel

	### Why this matters
	Kernels that embed inline PTX are a realistic migration boundary. hipify can translate CUDA APIs, but it cannot preserve NVIDIA-specific assembly semantics on AMD.

	### Original CUDA pattern (simplified)
	```cpp
	__device__ __forceinline__ unsigned lane_id() {
	unsigned lane;
	asm volatile("mov.u32 %0, %%laneid;" : "=r"(lane));
	return lane;
	}
	```

	### Typical migration output
	- CUDA runtime calls are translated.
	- Inline PTX block is left unchanged or translated into invalid code for HIP compilation.

	### Observed failure mode
	- Compile error under hipcc due to unsupported PTX instruction syntax.
	- In some cases, compile succeeds after manual edits but semantics differ because lane behavior assumptions are NVIDIA-specific.

	### Root cause
	- Inline PTX is vendor-specific and outside mechanical translation scope.
	- Warp-level assumptions in PTX often rely on 32-lane behavior and NVIDIA ISA details.

	### What is required to fix
	1. Replace inline PTX with HIP or portable intrinsics.
	2. Rework lane-level logic for wavefront-64 behavior where required.
	3. Add correctness tests for edge lanes and reduction boundaries.
	4. Re-profile after rewrite to confirm no occupancy regressions.

	### Trust note
	This is a deliberate example of where ROCmPort AI should report risk, not pretend full automation.

	## Failure Case: Library-Heavy CUDA Code (CUB, Thrust, cuDNN)

	Input type: CUDA kernels that call into CUB, Thrust, or cuDNN directly

	Example pattern:
	```cpp
	#include <cub/cub.cuh>
	cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
	```

	What happens: hipify-clang renames the include to `<hipcub/hipcub.hpp>` and the namespace to `hipcub`. ROCmPort AI passes this through. The translation is mechanically correct.

	The limitation: hipCUB API coverage is not 1:1 with CUB. Some primitives behave differently under ROCm, and performance characteristics differ significantly due to wavefront width. ROCmPort AI does not currently benchmark library calls against rocPRIM equivalents.

	What ROCmPort AI does: flags the library dependency in the static scan, marks it HIGH risk, and recommends manual review by a ROCm-experienced engineer.

	What ROCmPort AI does not do: guarantee correctness or performance parity for library-heavy code without human validation.

	Fix requirement: Manual comparison of CUB vs hipCUB primitive behavior for the specific use case, or replacement with rocPRIM equivalents.

	## Failure Case: Flash Attention — Warp Shuffle Intrinsics

	Kernel: Simplified Flash Attention forward pass (Dao et al. 2022 style)
	File: backend/demo_kernels/flash_attention_simplified.cu

	Bugs detected by ROCmPort AI static scan:
	- `__shfl_down` with implicit warp-32 offset=16 — on AMD wavefront-64,
	the final reduction should use offset=32 first
	- Softmax reduction terminates at 16 lanes — silently wrong on gfx942

	What hipify does: renames cudaFree to hipFree, cuda headers to hip headers.
	Does NOT fix the shuffle semantics.

	What ROCmPort AI does: flags `__shfl_sync` family calls as CRITICAL risk,
	and flags unsuffixed `__shfl_down(..., 16)` style reductions as HIGH risk.
	It identifies the offset=16 assumption and suggests a wavefront-64 aware rewrite.

	Status: Compiled and executed on AMD Instinct MI300X (gfx942), ROCm 7.2.
	Numerical correctness not verified — requires reference CPU implementation.

	Fix required: Replace `__shfl_down(x, 16)` with two-stage reduction:
	`__shfl_down(x, 32)` then `__shfl_down(x, 16)` for wavefront-64.