Amd R9700 - vLLM crashes on startup

by jmander11 - opened Feb 9

Feb 9

•

Hello,

Does anyone have a solution for trying to load this model in vLLM with rocm and hitting KeyError: 'layers.0.attn.qkv_proj.output_scale'?
I have only gotten GGUF models working with vllm, but I really want to test out an fp8 model to see what the R9700 can do. I am using the rocm7.2_navi_ubuntu24.04_py3.12_pytorch_2.9_vllm_0.14.0rc0 docker image.

Thanks!

Edit:
I attempted to follow the deployment tips to use two pull requests, but they don't have instructions. I now get:
alar_t = __hip_bfloat16, THRDS = 64, YTILE = 4, WvPrGrp = 16, A_CHUNK = 8, UNRL = 1, N = 4]: Device-side assertion `false' failed.

GPU coredump: execvp failed: No such file or directory

Failed to write segment data to pipe: Broken pipe

GPU coredump: handler exited with error (status: 1)

GPU core dump failed

:0:rocdevice.cpp :3586: 697995489480 us: Callback: Queue 0x7d42b4200000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016

It's pretty disappointing the deployment instructions don't give any guidance whatsoever :(

jmander11 changed discussion title from Amd R9700 - KeyError: 'layers.0.attn.qkv_proj.output_scale' to Amd R9700 - vLLM crashes on startup Feb 10

XuebinWang

AMD org Feb 11

Please make sure the PR#29008 is reflected in your working vllm branch. Thanks.

XuebinWang changed discussion status to closed Feb 11

jmander11

Feb 11

•

edited Feb 11

@XuebinWang What is the suggested method for doing this? This is what caused the exception on startup, so I am not sure how it should be applied. It upgrades my vllm version to a 0.15 version. Is this not the way to do it?

Do I clone the repo off of the official navi docker image and use that or what repo/branch do I apply the fixes to and using what sort of merge strategy?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment