Couldn't get it to work...
...but I didn't really expect to either. Here's my report:
Tested on: Dual NVIDIA RTX Pro 6000 Blackwell (2×96GB), vLLM v0.16.0rc2.dev69+gc4b9e6778, Fedora, Python 3.12
Command:
vllm serve Firworks/Step-3.5-Flash-nvfp4 -tp 2 --max-num-seqs 1 --gpu-memory-utilization 0.96 --trust_remote_code --max-model-len 32000
Result: Fails during model load with NotImplementedError: No NvFp4 MoE backend supports the deployment configuration originating from vllm/model_executor/layers/fused_moe/oracle/nvfp4.py. The issue is that Step 3.5's SharedFusedMoE (shared expert) architecture doesn't have a compatible NVFP4 MoE kernel backend in vLLM.
Additional options tested (all failed with the same error):
VLLM_NVFP4_MOE_BACKEND=FLASHINFER_CUTLASSVLLM_NVFP4_MOE_BACKEND=VLLM_CUTLASS--quantization compressed-tensors(explicit)
Conclusion: This appears to require new code in vLLM to support NVFP4 quantization for SharedFusedMoE layers, not just a configuration change. The non-MoE NVFP4 linear layers resolve fine (FLASHINFER_CUTLASS), but the shared expert MoE variant has no supported backend.
I was able to get it loaded, but it just produces garbage. Something likely broken in the quant too. I'll try to make one later on to see if behaves the same.
Changes:
vllm/model_executor/layers/fused_moe/cutlass_moe.py— Addedswiglustepto the CUTLASS FP4 activation allowlist. The fallback path (non-fusedapply_moe_activation+ separate fp4 quant) already handles it.vllm/model_executor/layers/fused_moe/fused_marlin_moe.py— Same for Marlin backend.vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py— Theapply()method had a blanketassert layer.activation == "silu"that blocked ALL activations for ALL backends. Moved it into the FlashInfer TRTLLM branch only, since the modular kernel path (VLLM_CUTLASS, MARLIN) already validates activation support internally.vllm/model_executor/models/step3p5.py— The weight loader iteratesloaded_weight[expert_id]for each expert, but compressed-tensors NVFP4 storesweight_global_scaleandinput_global_scaleas per-tensor with shape[1], not[num_experts, ...]. Addedexpand()to broadcast shape[1]to[num_experts, ...]before the loop.