The model startup using vllm failed.

#5
by beausoft - opened

Follow the vllm installation method provided in the document:

# install vllm
pip install vllm==0.11.2
# install deep_gemm
git clone https://github.com/deepseek-ai/DeepGEMM.git
cd DeepGEMM/third-party
git clone https://github.com/NVIDIA/cutlass.git
git clone https://github.com/fmtlib/fmt.git
cd ../
git checkout v2.1.1.post3
pip install . --no-build-isolation

An error occurred when starting vllm:

ValueError: No valid attention backend found for cuda with head_size: 576, dtype: torch.bfloat16, kv_cache_dtype: auto, block_size: 64, use_mla: True, has_sink: False, use_sparse: True. Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

When using vllm v0.13.0, the following error occurred during startup

ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

The startup command for vllm is as follows:

export VLLM_USE_DEEP_GEMM=0  # ATM, this line is a "must" for Hopper devices
export TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4

vllm serve \
    __YOUR_PATH__/QuantTrio/DeepSeek-V3.2-AWQ \
    --served-model-name MY_MODEL_NAME \
    --enable-auto-tool-choice \
    --tool-call-parser deepseek_v31 \
    --reasoning-parser deepseek_v3 \
    --swap-space 16 \
    --max-num-seqs 32 \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \  # optional
    --speculative-config '{"model": "__YOUR_PATH__/QuantTrio/DeepSeek-V3.2-AWQ", "num_speculative_tokens": 1}' \  # optional, 50%+- throughput increase is observed
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

Please help me. How can I properly start it? Thank you.

What GPU device are you using?
This model uses a sparse‑attention mechanism, and for now it can only run on H‑series and B‑series cards. Older GPUs are not yet supported.

I have the same problem:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

Using 8x RTX Blackwell 6000

Parameter:
VLLM_USE_DEEP_GEMM=1
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
VLLM_USE_FLASHINFER_MOE_FP16=1
VLLM_USE_FLASHINFER_SAMPLER=0
OMP_NUM_THREADS=4
vllm serve QuantTrio/DeepSeek-V3.2-AWQ
--host 192.168.xxx.yyy
--port 8000
--enable-auto-tool-choice
--tool-call-parser deepseek_v31
--reasoning-parser deepseek_v3
--swap-space 16
--max-num-seqs 32
--gpu-memory-utilization 0.9
--trust-remote-code
--served-model-name "vllm_thinkingparam"
--tensor-parallel-size 8
--enable-expert-parallel
--speculative-config '{"model": "QuantTrio/DeepSeek-V3.2-AWQ", "num_speculative_tokens": 1}'
--max_model_len $token

QuantTrio org

Have you all tried the one from vLLM official guide for Deepseek-V3.2?

source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
uv pip install git+https://github.com/deepseek-ai/DeepGEMM.git@v2.1.1.post3 --no-build-isolation # Other versions may also work. We recommend using the latest released version from https://github.com/deepseek-ai/DeepGEMM/releases

Yeah, i tried this - ending in the same problem:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

Are the SM120 (RTX Blackwell) supported? For me it seems they arent

QuantTrio org

Yeah, i tried this - ending in the same problem:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

Are the SM120 (RTX Blackwell) supported? For me it seems they arent

Could you try to edit the config.json file, change "torch_dtype": "bfloat16" to "torch_dtype": "float16"
Then have a try one more time. If this still doesn't work, then it probably indeed doesn't work 🥲

Yeah, i tried it but unfortunally a pretty similar errormessage occurs:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.float16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [dtype not supported, compute capability not supported]}.

:(

QuantTrio org

Yeah, i tried it but unfortunally a pretty similar errormessage occurs:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.float16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [dtype not supported, compute capability not supported]}.

:(

🥲

What GPU device are you using?
This model uses a sparse‑attention mechanism, and for now it can only run on H‑series and B‑series cards. Older GPUs are not yet supported.

I'm using 8*A100.

QuantTrio org

What GPU device are you using?
This model uses a sparse‑attention mechanism, and for now it can only run on H‑series and B‑series cards. Older GPUs are not yet supported.

I'm using 8*A100.

As above, I tested with 8×A100 and encountered the same issue. We need to wait for vLLM to support Sparse Attention on the Ampere architecture.

Same issue with rtx pro 6000 x4

Sign up or log in to comment