[ISSUE] W4A16 Kernel Selection Failure on Ampere A100 with TP > 1

#1
by brownyeyez - opened

I am encountering a critical issue when attempting to run a quantized model (W4A16) on NVIDIA A100 (Ampere, CC 80) using vLLM with Tensor Parallelism (TP) enabled (TP > 1). The system fails to find a suitable kernel to implement the linear layer, and manual overrides lead to corrupted outputs.

Environment

  • GPU: NVIDIA A100 (Compute Capability 80)
  • Framework: vLLM
  • Quantization: W4A16 (GPTQ/AWQ)
  • Configuration: Tensor Parallelism (TP) > 1

The Error Log

When launching the model, the multiproc_executor throws the following ValueError:

(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800] ValueError: Failed to find a kernel that can implement the WNA16 linear layer. Reasons: 

(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800] CutlassW4A8LinearKernel requires capability 90, current compute  capability is 80

(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800] MacheteLinearKernel requires capability 90, current compute  capability is 80

(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800]  AllSparkLinearKernel cannot implement due to: For Ampere GPU, AllSpark does not support group_size = 128. Only group_size = -1 are supported.

(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800]  MarlinLinearKernel cannot implement due to: Weight output_size_per_partition = 16 is not divisible by  min_thread_n = 64. Consider reducing tensor_parallel_size or running with --quantization gptq.

(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800]  ConchLinearKernel cannot implement due to: conch-triton-kernels is not installed, please install it via `pip install conch-triton-kernels` and try again!

(Worker pid=765658) (Worker_TP3 pid=765658) ERROR 03-04 08:48:38 [multiproc_executor.py:800]  ExllamaLinearKernel cannot implement due to: Exllama only supports float16 activations

Attempted Workarounds & Results

  1. Exllama Kernel with --dtype float16:
  • Result: The model loads without the "Kernel not found" error.
  • Issue: The model generates "gibberish" or garbled text. It seems forcing the Exllama kernel with TP > 1 on Ampere leads to weight misalignment or numerical instability.
  1. Conch-triton-kernels (v1.3):
  • Result: Installation was successful, and the kernel was recognized.
  • Issue: Similar to Exllama, the output remains erroneous/corrupted.

Question

Is there a recommended way to run this W4A16 quantized model on A100 with TP > 1 without triggering these kernel failures or output corruption?

  • Is there a flag to relax the min_thread_n constraint in Marlin?
  • Are there specific versions of conch-triton-kernels or vLLM that stabilize this behavior for Ampere?
  • Should we avoid certain group_size configurations during quantization to remain compatible with AllSpark?

Sign up or log in to comment