File size: 1,551 Bytes
ddbc7e0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
license: other
library_name: transformers
tags:
  - step3p5
  - moe
  - nvfp4
  - fp4
  - modelopt
  - quantized
base_model: stepfun-ai/Step3p5
quantized_by: modelopt
pipeline_tag: text-generation
model_type: step3p5
---

# Step3p5 NVFP4

NVIDIA FP4 (NVFP4) quantized version of the Step3p5 Mixture-of-Experts model, with MoE router/gate weights dequantized to bfloat16 for vLLM compatibility.

## Quantization Details

- **Quantization method**: NVIDIA ModelOpt 0.41.0, NVFP4 (W4A4)
- **Weight format**: FP4 E2M1, packed 2 values per uint8 byte
- **Group size**: 16
- **Excluded from quantization**: `lm_head`, `*.moe.gate*` (router/gate)

The MoE router/gate weights are stored in bfloat16 (not quantized) following NVIDIA ModelOpt best practices — quantizing the router degrades routing quality with negligible memory savings.

## Serving with vLLM

```bash
VLLM_USE_FLASHINFER_MOE_FP4=0 vllm serve apandacoding/step3p5-nvfp4 \
  --quantization modelopt_fp4 \
  --trust-remote-code \
  --host 0.0.0.0 --port 8000
```

> **Note**: `VLLM_USE_FLASHINFER_MOE_FP4=0` is required to use the VLLM_CUTLASS MoE backend. The FlashInfer TRTLLM monolithic MoE kernel has a known issue with 288-expert models.

## Model Architecture

- **Type**: Mixture of Experts (MoE) with shared experts
- **Experts**: 288 routed + shared expert per layer
- **Top-K**: 8 experts per token
- **Hidden size**: 4096
- **MoE intermediate size**: 1280
- **MoE layers**: 42 (layers 3–44)
- **Attention**: GQA with 96 heads, 8 KV heads
- **Context length**: 262,144 tokens