step3p5-nvfp4 / README.md
apandacoding's picture
Upload README.md with huggingface_hub
ddbc7e0 verified
metadata
license: other
library_name: transformers
tags:
  - step3p5
  - moe
  - nvfp4
  - fp4
  - modelopt
  - quantized
base_model: stepfun-ai/Step3p5
quantized_by: modelopt
pipeline_tag: text-generation
model_type: step3p5

Step3p5 NVFP4

NVIDIA FP4 (NVFP4) quantized version of the Step3p5 Mixture-of-Experts model, with MoE router/gate weights dequantized to bfloat16 for vLLM compatibility.

Quantization Details

  • Quantization method: NVIDIA ModelOpt 0.41.0, NVFP4 (W4A4)
  • Weight format: FP4 E2M1, packed 2 values per uint8 byte
  • Group size: 16
  • Excluded from quantization: lm_head, *.moe.gate* (router/gate)

The MoE router/gate weights are stored in bfloat16 (not quantized) following NVIDIA ModelOpt best practices — quantizing the router degrades routing quality with negligible memory savings.

Serving with vLLM

VLLM_USE_FLASHINFER_MOE_FP4=0 vllm serve apandacoding/step3p5-nvfp4 \
  --quantization modelopt_fp4 \
  --trust-remote-code \
  --host 0.0.0.0 --port 8000

Note: VLLM_USE_FLASHINFER_MOE_FP4=0 is required to use the VLLM_CUTLASS MoE backend. The FlashInfer TRTLLM monolithic MoE kernel has a known issue with 288-expert models.

Model Architecture

  • Type: Mixture of Experts (MoE) with shared experts
  • Experts: 288 routed + shared expert per layer
  • Top-K: 8 experts per token
  • Hidden size: 4096
  • MoE intermediate size: 1280
  • MoE layers: 42 (layers 3–44)
  • Attention: GQA with 96 heads, 8 KV heads
  • Context length: 262,144 tokens