| | --- |
| | license: other |
| | library_name: transformers |
| | tags: |
| | - step3p5 |
| | - moe |
| | - nvfp4 |
| | - fp4 |
| | - modelopt |
| | - quantized |
| | base_model: stepfun-ai/Step3p5 |
| | quantized_by: modelopt |
| | pipeline_tag: text-generation |
| | model_type: step3p5 |
| | --- |
| | |
| | # Step3p5 NVFP4 |
| |
|
| | NVIDIA FP4 (NVFP4) quantized version of the Step3p5 Mixture-of-Experts model, with MoE router/gate weights dequantized to bfloat16 for vLLM compatibility. |
| |
|
| | ## Quantization Details |
| |
|
| | - **Quantization method**: NVIDIA ModelOpt 0.41.0, NVFP4 (W4A4) |
| | - **Weight format**: FP4 E2M1, packed 2 values per uint8 byte |
| | - **Group size**: 16 |
| | - **Excluded from quantization**: `lm_head`, `*.moe.gate*` (router/gate) |
| |
|
| | The MoE router/gate weights are stored in bfloat16 (not quantized) following NVIDIA ModelOpt best practices — quantizing the router degrades routing quality with negligible memory savings. |
| |
|
| | ## Serving with vLLM |
| |
|
| | ```bash |
| | VLLM_USE_FLASHINFER_MOE_FP4=0 vllm serve apandacoding/step3p5-nvfp4 \ |
| | --quantization modelopt_fp4 \ |
| | --trust-remote-code \ |
| | --host 0.0.0.0 --port 8000 |
| | ``` |
| |
|
| | > **Note**: `VLLM_USE_FLASHINFER_MOE_FP4=0` is required to use the VLLM_CUTLASS MoE backend. The FlashInfer TRTLLM monolithic MoE kernel has a known issue with 288-expert models. |
| | |
| | ## Model Architecture |
| | |
| | - **Type**: Mixture of Experts (MoE) with shared experts |
| | - **Experts**: 288 routed + shared expert per layer |
| | - **Top-K**: 8 experts per token |
| | - **Hidden size**: 4096 |
| | - **MoE intermediate size**: 1280 |
| | - **MoE layers**: 42 (layers 3–44) |
| | - **Attention**: GQA with 96 heads, 8 KV heads |
| | - **Context length**: 262,144 tokens |
| | |