Q4 with CUDA
Hi everyone π
I wonder if someone was able to run Q4 with CUDA? Looks like GatherBlockQuantized does not work with CUDA for now. F16 version works. However, it would be nice to be able to compare it with Q4 and Q8 on GPU. Besides that Q8 seems to be way slower than Q4 on CPU for some reason.
Any advice is appreciated. Thank you!
Hey I got the same error and when I asked from Gemini, it said "This is a known issue with onnxruntime-gpu when handling specific quantized operators like GatherBlockQuantized (used in your embed_tokens model). The CUDA kernel configuration for this specific operator sometimes calculates invalid block/grid dimensions depending on your specific GPU or driver version, causing the crash."
Try to force embed_tokens_session with cpu. It worked for me. Also, below quantized (q8), I got very gibberish generated voice. I think it depends on the cpu, gpu of every persons computer. Some hardware might calculate q4 better than q8 I guess... Here is code for session part:
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
speech_encoder_session = onnxruntime.InferenceSession(speech_encoder_path, providers=providers)
embed_tokens_session = onnxruntime.InferenceSession(embed_tokens_path, providers=["CPUExecutionProvider"])
language_model_session = onnxruntime.InferenceSession(language_model_path, providers=providers)
cond_decoder_session = onnxruntime.InferenceSession(conditional_decoder_path, providers=providers)