tacos4me
/

Step-3.5-Flash-NVFP4

+---
+license: apache-2.0
+base_model: stepfun-ai/Step-3.5-Flash
+tags:
+  - nvfp4
+  - fp4
+  - quantized
+  - moe
+  - compressed-tensors
+  - vllm
+  - step3p5
+library_name: transformers
+quantized_by: tacos4me
+pipeline_tag: text-generation
+---
+# Step-3.5-Flash-NVFP4
+NVFP4-quantized version of [stepfun-ai/Step-3.5-Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash), an open-source frontier-level reasoning model by StepFun with 196.81B total parameters and ~11B active parameters per token.
+## Model Description
+[Step 3.5 Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) is an open-source foundation model designed for frontier-level reasoning and agentic capabilities with exceptional efficiency. Key highlights from the base model:
+- **AIME 2025**: 97.3%
+- **SWE-bench Verified**: 74.4%
+- **LiveCodeBench-V6**: 86.4%
+- **Terminal-Bench 2.0**: 51.0%
+- **GAIA (no file)**: 84.5
+This NVFP4 quantization reduces the model size from ~372 GB (BF16) to ~105 GB while preserving quality, making it practical to deploy on just 2 GPUs.
+## Quantization Details
+| Property | Value |
+|----------|-------|
+| **Format** | NVFP4 (`nvfp4-pack-quantized`) |
+| **Weight precision** | FP4 E2M1 with FP8 E4M3 block scales (group_size=16) |
+| **Input activations** | FP8 E4M3 dynamic per-tensor-group (group_size=16) |
+| **Quant method** | `compressed-tensors` |
+| **Calibration data** | 512 samples from [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) |
+| **Max calibration seq length** | 2048 |
+| **Quantization tool** | [llm-compressor](https://github.com/vllm-project/llm-compressor) |
+| **Excluded from quantization** | `lm_head`, all MoE router gates (`moe.gate`) |
+During calibration, all 288 experts per MoE layer were activated to ensure every expert received calibration data, using a custom `Step3p5MoEMLP` calibration module.
+## Architecture
+| Component | Details |
+|-----------|---------|
+| **Architecture** | 45-layer Sparse Mixture-of-Experts (MoE) Transformer |
+| **Total parameters** | 196.81B |
+| **Active parameters** | ~11B per token |
+| **Experts** | 288 routed + 1 shared per MoE layer, top-8 selection |
+| **Hidden size** | 4096 |
+| **MoE intermediate size** | 1280 |
+| **Dense intermediate size** | 11264 |
+| **MoE layers** | 3-44 (42 layers) |
+| **Attention** | GQA with 64 heads, 8 KV groups, head dim 128 |
+| **Attention pattern** | 3:1 sliding window (512 tokens) / full attention ratio |
+| **Context window** | 256K tokens (with llama3-style RoPE scaling) |
+| **Vocabulary** | 128,896 tokens |
+| **Multi-Token Prediction** | MTP-3 (predicts 4 tokens simultaneously) |
+Layers 43-44 use a **swiglustep** activation (clipped SwiGLU with limit=7.0) on their MoE experts. All other MoE layers use standard SiLU. This requires vLLM support for swiglustep in the NVFP4 MoE kernels.
+## Requirements
+This model requires vLLM with swiglustep MoE activation support. This is available in the following PR:
+**[vllm-project/vllm#34478](https://github.com/vllm-project/vllm/pull/34478)** -- Add swiglustep activation support for NVFP4 MoE backends
+Until the PR is merged, install vLLM from the PR branch or from source with the changes applied.
+## Usage with vLLM
+### Serving
+```bash
+vllm serve tacos4me/Step-3.5-Flash-NVFP4 \
+  --served-model-name step3p5-flash \
+  --tensor-parallel-size 2 \
+  --trust-remote-code \
+  --reasoning-parser step3p5 \
+  --enable-auto-tool-choice \
+  --tool-call-parser step3p5 \
+  --disable-cascade-attn
+```
+### Offline Inference
+```python
+from vllm import LLM, SamplingParams
+llm = LLM(
+    model="tacos4me/Step-3.5-Flash-NVFP4",
+    tensor_parallel_size=2,
+    trust_remote_code=True,
+    max_model_len=4096,
+    gpu_memory_utilization=0.95,
+)
+output = llm.generate(
+    "Explain the significance of the number 42.",
+    SamplingParams(max_tokens=256),
+)
+print(output[0].outputs[0].text)
+```
+## Performance
+| Metric | Value |
+|--------|-------|
+| **Model size on disk** | ~105 GB (23 safetensors shards) |
+| **Decode throughput** | ~108 tok/s |
+| **Hardware tested** | 2x NVIDIA RTX PRO 6000 Blackwell (TP=2) |
+| **CUDA graphs** | Enabled |
+## Known Issues
+1. **FlashInfer MoE backend on Blackwell**: The FlashInfer CUTLASS MoE backend may crash with illegal memory access on Blackwell GPUs (sm_120). Set `VLLM_USE_FLASHINFER_MOE_FP4=0` as a workaround.
+2. **MTP weights not included**: Speculative decoding (Multi-Token Prediction) weights from the base model are not included in this quantized checkpoint.
+3. **Minimum 2 GPUs required**: The model requires ~105 GB, so it does not fit on a single 80/96 GB GPU. Use `--tensor-parallel-size 2` or higher.
+## Acknowledgments
+- Based on [stepfun-ai/Step-3.5-Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) by StepFun
+- Quantized with [llm-compressor](https://github.com/vllm-project/llm-compressor) by the vLLM project
+- NVFP4 MoE swiglustep activation support contributed to [vLLM](https://github.com/vllm-project/vllm)
+## Citation
+If you use this model, please cite the original Step 3.5 Flash paper:
+```bibtex
+@misc{huang2026step35flashopen,
+  title={Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters},
+  author={Huang, Ailin and Li, Ang and others},
+  year={2026},
+  eprint={2602.10604},
+  archivePrefix={arXiv},
+  primaryClass={cs.CL},
+  url={https://arxiv.org/abs/2602.10604}
+}
+```
+## License
+This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0), same as the base model.