apandacoding
/

step3p5-nvfp4

Text Generation

Mixture of Experts

8-bit precision

Model card Files Files and versions

step3p5-nvfp4 / README.md

apandacoding's picture

Upload README.md with huggingface_hub

ddbc7e0 verified 20 days ago

|

history blame contribute delete

1.55 kB

	---
	license: other
	library_name: transformers
	tags:
	- step3p5
	- moe
	- nvfp4
	- fp4
	- modelopt
	- quantized
	base_model: stepfun-ai/Step3p5
	quantized_by: modelopt
	pipeline_tag: text-generation
	model_type: step3p5
	---

	# Step3p5 NVFP4

	NVIDIA FP4 (NVFP4) quantized version of the Step3p5 Mixture-of-Experts model, with MoE router/gate weights dequantized to bfloat16 for vLLM compatibility.

	## Quantization Details

	- Quantization method: NVIDIA ModelOpt 0.41.0, NVFP4 (W4A4)
	- Weight format: FP4 E2M1, packed 2 values per uint8 byte
	- Group size: 16
	- Excluded from quantization: `lm_head`, `.moe.gate` (router/gate)

	The MoE router/gate weights are stored in bfloat16 (not quantized) following NVIDIA ModelOpt best practices — quantizing the router degrades routing quality with negligible memory savings.

	## Serving with vLLM

	```bash
	VLLM_USE_FLASHINFER_MOE_FP4=0 vllm serve apandacoding/step3p5-nvfp4 \
	--quantization modelopt_fp4 \
	--trust-remote-code \
	--host 0.0.0.0 --port 8000
	```

	> Note: `VLLM_USE_FLASHINFER_MOE_FP4=0` is required to use the VLLM_CUTLASS MoE backend. The FlashInfer TRTLLM monolithic MoE kernel has a known issue with 288-expert models.

	## Model Architecture

	- Type: Mixture of Experts (MoE) with shared experts
	- Experts: 288 routed + shared expert per layer
	- Top-K: 8 experts per token
	- Hidden size: 4096
	- MoE intermediate size: 1280
	- MoE layers: 42 (layers 3–44)
	- Attention: GQA with 96 heads, 8 KV heads
	- Context length: 262,144 tokens