vLLM fails to serve Intel/GLM-5-int4-mixed-AutoRound on NVIDIA DGX Spark (GB10, sm121) due to no valid MLA attention backend (qk_nope_head_dim 192)

#2
by oliverjohnwilson - opened

First off, thank you for taking the time to quantize this model. I am excited to get Intel/GLM-5-int4-mixed-AutoRound served via vLLM on my 4x DGX Spark cluster; however, I have not been successful so far and am seeking help/guidance.

I am trying to serve Intel/GLM-5-int4-mixed-AutoRound with vLLM on a 4-node NVIDIA DGX Spark cluster (GB10). The cluster is correctly configured for multi-node tensor parallelism (Ray + NCCL over RoCE), and other models run fine with -tp 4. However GLM-5-int4-mixed-AutoRound fails during model initialization with “No valid attention backend found”.

Hardware

  • 4x NVIDIA DGX Spark Founders Edition (GB10 Grace Blackwell)
  • 1 GPU per node, total 4 GPUs across 4 nodes
  • GPU compute capability reported by runtime: 12.1 (sm121)
  • Interconnect: ConnectX-7 RoCE, switched fabric at 200G, MTU 4200 (RoCE active_mtu 4096)

Software versions (container)

  • vLLM: 0.16.1rc1.dev160+g6521ccf28.d20260303
  • transformers: 5.3.0.dev0 (installed from source)
  • NCCL: vLLM reports nccl==2.29.2
  • Base container: nvcr.io/nvidia/pytorch:26.01-py3
  • FlashInfer built from source / wheels (can provide exact version if needed)

Serving commands:

vllm serve "<local_snapshot_path>" \
  --served-model-name "Intel/GLM-5-int4-mixed-AutoRound" \
  --distributed-executor-backend ray \
  --gpu-memory-utilization 0.85 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --host 0.0.0.0 \
  --port 8000
'"

Error:

ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, use_per_head_quant_scales=False, attn_type=AttentionType.DECODER). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, vllm._flashmla_C is not available, likely was not compiled due to insufficient nvcc version or a supported arch was not in the list of target arches to compile for.], FLASHINFER_MLA: [sparse not supported, compute capability not supported, FlashInfer MLA kernel requires qk_nope_head_dim == 128, but got 192], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}. [repeated 3x across cluster]

Error, but more readable:

ValueError: No valid attention backend found for CUDA

AttentionSelectorConfig:
  head_size: 576
  dtype: torch.bfloat16
  kv_cache_dtype: auto
  block_size: None
  use_mla: true
  has_sink: false
  use_sparse: true
  use_mm_prefix: false
  use_per_head_quant_scales: false
  attn_type: AttentionType.DECODER

Backends tried (and why each failed):

  1) FLASH_ATTN_MLA
     - sparse not supported
     - compute capability not supported
     - FlashAttention MLA not supported on this device

  2) FLASHMLA
     - sparse not supported
     - compute capability not supported
     - vllm._flashmla_C is not available
       (likely not compiled due to insufficient NVCC version
        or target GPU arch not included in compilation)

  3) FLASHINFER_MLA
     - sparse not supported
     - compute capability not supported
     - FlashInfer MLA kernel constraint violated:
       requires qk_nope_head_dim == 128, but got 192

  4) TRITON_MLA
     - sparse not supported

  5) FLASHMLA_SPARSE
     - compute capability not supported

Cluster note:
  - repeated 3x across cluster

Questions / requested guidance

  1. Is Intel/GLM-5-int4-mixed-AutoRound validated on NVIDIA Blackwell sm121 (DGX Spark),
    or only on other Blackwell variants (e.g. sm120)?
  2. If not yet supported, can you provide:
    • a recommended workaround (e.g. disable sparse MLA if safe), or
    • a patch/mod/PR to vLLM/FlashInfer to support this model’s MLA dimensions?

Thank you in advance.

Intel org

This issue is unlikely to be related to the quantized model, since we have not quantized the attention layers. You may want to seek support from the vLLM or CUDA side for further investigation.

AutoRound is mainly for Intel devices, but we have tested it on sm100 before publishing it.

Sign up or log in to comment