[ISSUE] W4A16 Kernel Selection Failure on Ampere A100 with TP > 1
#1
by brownyeyez - opened
I am encountering a critical issue when attempting to run a quantized model (W4A16) on NVIDIA A100 (Ampere, CC 80) using vLLM with Tensor Parallelism (TP) enabled (TP > 1). The system fails to find a suitable kernel to implement the linear layer, and manual overrides lead to corrupted outputs.
Environment
- GPU: NVIDIA A100 (Compute Capability 80)
- Framework: vLLM
- Quantization: W4A16 (GPTQ/AWQ)
- Configuration: Tensor Parallelism (TP) > 1
The Error Log
When launching the model, the multiproc_executor throws the following ValueError:
(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800] ValueError: Failed to find a kernel that can implement the WNA16 linear layer. Reasons:
(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800] CutlassW4A8LinearKernel requires capability 90, current compute capability is 80
(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800] MacheteLinearKernel requires capability 90, current compute capability is 80
(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800] AllSparkLinearKernel cannot implement due to: For Ampere GPU, AllSpark does not support group_size = 128. Only group_size = -1 are supported.
(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800] MarlinLinearKernel cannot implement due to: Weight output_size_per_partition = 16 is not divisible by min_thread_n = 64. Consider reducing tensor_parallel_size or running with --quantization gptq.
(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800] ConchLinearKernel cannot implement due to: conch-triton-kernels is not installed, please install it via `pip install conch-triton-kernels` and try again!
(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800] ExllamaLinearKernel cannot implement due to: Exllama only supports float16 activations
Attempted Workarounds & Results
- Exllama Kernel with
--dtype float16:
- Result: The model loads without the "Kernel not found" error.
- Issue: The model generates "gibberish" or garbled text. It seems forcing the Exllama kernel with TP > 1 on Ampere leads to weight misalignment or numerical instability.
- Conch-triton-kernels (v1.3):
- Result: Installation was successful, and the kernel was recognized.
- Issue: Similar to Exllama, the output remains erroneous/corrupted.
Question
Is there a recommended way to run this W4A16 quantized model on A100 with TP > 1 without triggering these kernel failures or output corruption?
- Is there a flag to relax the
min_thread_nconstraint in Marlin? - Are there specific versions of
conch-triton-kernelsorvLLMthat stabilize this behavior for Ampere? - Should we avoid certain
group_sizeconfigurations during quantization to remain compatible withAllSpark?