Couldn't get it to work...

by kbuettner - opened Feb 11

Feb 11

...but I didn't really expect to either. Here's my report:

Tested on: Dual NVIDIA RTX Pro 6000 Blackwell (2×96GB), vLLM v0.16.0rc2.dev69+gc4b9e6778, Fedora, Python 3.12

Command:

vllm serve Firworks/Step-3.5-Flash-nvfp4 -tp 2 --max-num-seqs 1 --gpu-memory-utilization 0.96 --trust_remote_code --max-model-len 32000

Result: Fails during model load with NotImplementedError: No NvFp4 MoE backend supports the deployment configuration originating from vllm/model_executor/layers/fused_moe/oracle/nvfp4.py. The issue is that Step 3.5's SharedFusedMoE (shared expert) architecture doesn't have a compatible NVFP4 MoE kernel backend in vLLM.

Additional options tested (all failed with the same error):

VLLM_NVFP4_MOE_BACKEND=FLASHINFER_CUTLASS
VLLM_NVFP4_MOE_BACKEND=VLLM_CUTLASS
--quantization compressed-tensors (explicit)

Conclusion: This appears to require new code in vLLM to support NVFP4 quantization for SharedFusedMoE layers, not just a configuration change. The non-MoE NVFP4 linear layers resolve fine (FLASHINFER_CUTLASS), but the shared expert MoE variant has no supported backend.

tacos4me

Feb 11

I was able to get it loaded, but it just produces garbage. Something likely broken in the quant too. I'll try to make one later on to see if behaves the same.

Changes:

vllm/model_executor/layers/fused_moe/cutlass_moe.py — Added swiglustep to the CUTLASS FP4 activation allowlist. The fallback path (non-fused apply_moe_activation + separate fp4 quant) already handles it.
vllm/model_executor/layers/fused_moe/fused_marlin_moe.py — Same for Marlin backend.
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py — The apply() method had a blanket assert layer.activation == "silu" that blocked ALL activations for ALL backends. Moved it into the FlashInfer TRTLLM branch only, since the modular kernel path (VLLM_CUTLASS, MARLIN) already validates activation support internally.
vllm/model_executor/models/step3p5.py — The weight loader iterates loaded_weight[expert_id] for each expert, but compressed-tensors NVFP4 stores weight_global_scale and input_global_scale as per-tensor with shape [1], not [num_experts, ...]. Added expand() to broadcast shape [1] to [num_experts, ...] before the loop.

tacos4me

Feb 13

https://github.com/vllm-project/vllm/pull/34478

Geodd

Feb 13

Great work, planning to give this a try

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment