Q4 with CUDA

by BlindTech - opened Dec 24, 2025

Dec 24, 2025

Hi everyone 👋

I wonder if someone was able to run Q4 with CUDA? Looks like GatherBlockQuantized does not work with CUDA for now. F16 version works. However, it would be nice to be able to compare it with Q4 and Q8 on GPU. Besides that Q8 seems to be way slower than Q4 on CPU for some reason.

Any advice is appreciated. Thank you!

gikman

Dec 27, 2025

Hey I got the same error and when I asked from Gemini, it said "This is a known issue with onnxruntime-gpu when handling specific quantized operators like GatherBlockQuantized (used in your embed_tokens model). The CUDA kernel configuration for this specific operator sometimes calculates invalid block/grid dimensions depending on your specific GPU or driver version, causing the crash."
Try to force embed_tokens_session with cpu. It worked for me. Also, below quantized (q8), I got very gibberish generated voice. I think it depends on the cpu, gpu of every persons computer. Some hardware might calculate q4 better than q8 I guess... Here is code for session part:

providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]

speech_encoder_session = onnxruntime.InferenceSession(speech_encoder_path, providers=providers)
embed_tokens_session = onnxruntime.InferenceSession(embed_tokens_path, providers=["CPUExecutionProvider"])
language_model_session = onnxruntime.InferenceSession(language_model_path, providers=providers)
cond_decoder_session = onnxruntime.InferenceSession(conditional_decoder_path, providers=providers)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment