[ISSUE] W4A16 Kernel Selection Failure on Ampere A100 with TP > 1

by brownyeyez - opened Mar 4

Mar 4

I am encountering a critical issue when attempting to run a quantized model (W4A16) on NVIDIA A100 (Ampere, CC 80) using vLLM with Tensor Parallelism (TP) enabled (TP > 1). The system fails to find a suitable kernel to implement the linear layer, and manual overrides lead to corrupted outputs.

Environment

GPU: NVIDIA A100 (Compute Capability 80)
Framework: vLLM
Quantization: W4A16 (GPTQ/AWQ)
Configuration: Tensor Parallelism (TP) > 1

The Error Log

When launching the model, the multiproc_executor throws the following ValueError:

(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800] ValueError: Failed to find a kernel that can implement the WNA16 linear layer. Reasons: 

(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800] CutlassW4A8LinearKernel requires capability 90, current compute  capability is 80

(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800] MacheteLinearKernel requires capability 90, current compute  capability is 80

(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800]  AllSparkLinearKernel cannot implement due to: For Ampere GPU, AllSpark does not support group_size = 128. Only group_size = -1 are supported.

(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800]  MarlinLinearKernel cannot implement due to: Weight output_size_per_partition = 16 is not divisible by  min_thread_n = 64. Consider reducing tensor_parallel_size or running with --quantization gptq.

(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800]  ConchLinearKernel cannot implement due to: conch-triton-kernels is not installed, please install it via `pip install conch-triton-kernels` and try again!

(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800]  ExllamaLinearKernel cannot implement due to: Exllama only supports float16 activations

Attempted Workarounds & Results

Exllama Kernel with --dtype float16:

Result: The model loads without the "Kernel not found" error.
Issue: The model generates "gibberish" or garbled text. It seems forcing the Exllama kernel with TP > 1 on Ampere leads to weight misalignment or numerical instability.

Conch-triton-kernels (v1.3):

Result: Installation was successful, and the kernel was recognized.
Issue: Similar to Exllama, the output remains erroneous/corrupted.

Question

Is there a recommended way to run this W4A16 quantized model on A100 with TP > 1 without triggering these kernel failures or output corruption?

Is there a flag to relax the min_thread_n constraint in Marlin?
Are there specific versions of conch-triton-kernels or vLLM that stabilize this behavior for Ampere?
Should we avoid certain group_size configurations during quantization to remain compatible with AllSpark?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment