Couldn't get it to work...

#1
by kbuettner - opened

...but I didn't really expect to either. Here's my report:

Tested on: Dual NVIDIA RTX Pro 6000 Blackwell (2×96GB), vLLM v0.16.0rc2.dev69+gc4b9e6778, Fedora, Python 3.12

Command:

vllm serve Firworks/Step-3.5-Flash-nvfp4 -tp 2 --max-num-seqs 1 --gpu-memory-utilization 0.96 --trust_remote_code --max-model-len 32000

Result: Fails during model load with NotImplementedError: No NvFp4 MoE backend supports the deployment configuration originating from vllm/model_executor/layers/fused_moe/oracle/nvfp4.py. The issue is that Step 3.5's SharedFusedMoE (shared expert) architecture doesn't have a compatible NVFP4 MoE kernel backend in vLLM.

Additional options tested (all failed with the same error):

  • VLLM_NVFP4_MOE_BACKEND=FLASHINFER_CUTLASS
  • VLLM_NVFP4_MOE_BACKEND=VLLM_CUTLASS
  • --quantization compressed-tensors (explicit)

Conclusion: This appears to require new code in vLLM to support NVFP4 quantization for SharedFusedMoE layers, not just a configuration change. The non-MoE NVFP4 linear layers resolve fine (FLASHINFER_CUTLASS), but the shared expert MoE variant has no supported backend.

I was able to get it loaded, but it just produces garbage. Something likely broken in the quant too. I'll try to make one later on to see if behaves the same.

Changes:

  • vllm/model_executor/layers/fused_moe/cutlass_moe.py — Added swiglustep to the CUTLASS FP4 activation allowlist. The fallback path (non-fused apply_moe_activation + separate fp4 quant) already handles it.

  • vllm/model_executor/layers/fused_moe/fused_marlin_moe.py — Same for Marlin backend.

  • vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py — The apply() method had a blanket assert layer.activation == "silu" that blocked ALL activations for ALL backends. Moved it into the FlashInfer TRTLLM branch only, since the modular kernel path (VLLM_CUTLASS, MARLIN) already validates activation support internally.

  • vllm/model_executor/models/step3p5.py — The weight loader iterates loaded_weight[expert_id] for each expert, but compressed-tensors NVFP4 stores weight_global_scale and input_global_scale as per-tensor with shape [1], not [num_experts, ...]. Added expand() to broadcast shape [1] to [num_experts, ...] before the loop.

Sign up or log in to comment