tacos4me commited on
Commit
6cf1954
·
verified ·
1 Parent(s): 012e5c2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +152 -0
README.md ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: stepfun-ai/Step-3.5-Flash
4
+ tags:
5
+ - nvfp4
6
+ - fp4
7
+ - quantized
8
+ - moe
9
+ - compressed-tensors
10
+ - vllm
11
+ - step3p5
12
+ library_name: transformers
13
+ quantized_by: tacos4me
14
+ pipeline_tag: text-generation
15
+ ---
16
+
17
+ # Step-3.5-Flash-NVFP4
18
+
19
+ NVFP4-quantized version of [stepfun-ai/Step-3.5-Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash), an open-source frontier-level reasoning model by StepFun with 196.81B total parameters and ~11B active parameters per token.
20
+
21
+ ## Model Description
22
+
23
+ [Step 3.5 Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) is an open-source foundation model designed for frontier-level reasoning and agentic capabilities with exceptional efficiency. Key highlights from the base model:
24
+
25
+ - **AIME 2025**: 97.3%
26
+ - **SWE-bench Verified**: 74.4%
27
+ - **LiveCodeBench-V6**: 86.4%
28
+ - **Terminal-Bench 2.0**: 51.0%
29
+ - **GAIA (no file)**: 84.5
30
+
31
+ This NVFP4 quantization reduces the model size from ~372 GB (BF16) to ~105 GB while preserving quality, making it practical to deploy on just 2 GPUs.
32
+
33
+ ## Quantization Details
34
+
35
+ | Property | Value |
36
+ |----------|-------|
37
+ | **Format** | NVFP4 (`nvfp4-pack-quantized`) |
38
+ | **Weight precision** | FP4 E2M1 with FP8 E4M3 block scales (group_size=16) |
39
+ | **Input activations** | FP8 E4M3 dynamic per-tensor-group (group_size=16) |
40
+ | **Quant method** | `compressed-tensors` |
41
+ | **Calibration data** | 512 samples from [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) |
42
+ | **Max calibration seq length** | 2048 |
43
+ | **Quantization tool** | [llm-compressor](https://github.com/vllm-project/llm-compressor) |
44
+ | **Excluded from quantization** | `lm_head`, all MoE router gates (`moe.gate`) |
45
+
46
+ During calibration, all 288 experts per MoE layer were activated to ensure every expert received calibration data, using a custom `Step3p5MoEMLP` calibration module.
47
+
48
+ ## Architecture
49
+
50
+ | Component | Details |
51
+ |-----------|---------|
52
+ | **Architecture** | 45-layer Sparse Mixture-of-Experts (MoE) Transformer |
53
+ | **Total parameters** | 196.81B |
54
+ | **Active parameters** | ~11B per token |
55
+ | **Experts** | 288 routed + 1 shared per MoE layer, top-8 selection |
56
+ | **Hidden size** | 4096 |
57
+ | **MoE intermediate size** | 1280 |
58
+ | **Dense intermediate size** | 11264 |
59
+ | **MoE layers** | 3-44 (42 layers) |
60
+ | **Attention** | GQA with 64 heads, 8 KV groups, head dim 128 |
61
+ | **Attention pattern** | 3:1 sliding window (512 tokens) / full attention ratio |
62
+ | **Context window** | 256K tokens (with llama3-style RoPE scaling) |
63
+ | **Vocabulary** | 128,896 tokens |
64
+ | **Multi-Token Prediction** | MTP-3 (predicts 4 tokens simultaneously) |
65
+
66
+ Layers 43-44 use a **swiglustep** activation (clipped SwiGLU with limit=7.0) on their MoE experts. All other MoE layers use standard SiLU. This requires vLLM support for swiglustep in the NVFP4 MoE kernels.
67
+
68
+ ## Requirements
69
+
70
+ This model requires vLLM with swiglustep MoE activation support. This is available in the following PR:
71
+
72
+ **[vllm-project/vllm#34478](https://github.com/vllm-project/vllm/pull/34478)** -- Add swiglustep activation support for NVFP4 MoE backends
73
+
74
+ Until the PR is merged, install vLLM from the PR branch or from source with the changes applied.
75
+
76
+ ## Usage with vLLM
77
+
78
+ ### Serving
79
+
80
+ ```bash
81
+ vllm serve tacos4me/Step-3.5-Flash-NVFP4 \
82
+ --served-model-name step3p5-flash \
83
+ --tensor-parallel-size 2 \
84
+ --trust-remote-code \
85
+ --reasoning-parser step3p5 \
86
+ --enable-auto-tool-choice \
87
+ --tool-call-parser step3p5 \
88
+ --disable-cascade-attn
89
+ ```
90
+
91
+ ### Offline Inference
92
+
93
+ ```python
94
+ from vllm import LLM, SamplingParams
95
+
96
+ llm = LLM(
97
+ model="tacos4me/Step-3.5-Flash-NVFP4",
98
+ tensor_parallel_size=2,
99
+ trust_remote_code=True,
100
+ max_model_len=4096,
101
+ gpu_memory_utilization=0.95,
102
+ )
103
+
104
+ output = llm.generate(
105
+ "Explain the significance of the number 42.",
106
+ SamplingParams(max_tokens=256),
107
+ )
108
+ print(output[0].outputs[0].text)
109
+ ```
110
+
111
+ ## Performance
112
+
113
+ | Metric | Value |
114
+ |--------|-------|
115
+ | **Model size on disk** | ~105 GB (23 safetensors shards) |
116
+ | **Decode throughput** | ~108 tok/s |
117
+ | **Hardware tested** | 2x NVIDIA RTX PRO 6000 Blackwell (TP=2) |
118
+ | **CUDA graphs** | Enabled |
119
+
120
+ ## Known Issues
121
+
122
+ 1. **FlashInfer MoE backend on Blackwell**: The FlashInfer CUTLASS MoE backend may crash with illegal memory access on Blackwell GPUs (sm_120). Set `VLLM_USE_FLASHINFER_MOE_FP4=0` as a workaround.
123
+
124
+ 2. **MTP weights not included**: Speculative decoding (Multi-Token Prediction) weights from the base model are not included in this quantized checkpoint.
125
+
126
+ 3. **Minimum 2 GPUs required**: The model requires ~105 GB, so it does not fit on a single 80/96 GB GPU. Use `--tensor-parallel-size 2` or higher.
127
+
128
+ ## Acknowledgments
129
+
130
+ - Based on [stepfun-ai/Step-3.5-Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) by StepFun
131
+ - Quantized with [llm-compressor](https://github.com/vllm-project/llm-compressor) by the vLLM project
132
+ - NVFP4 MoE swiglustep activation support contributed to [vLLM](https://github.com/vllm-project/vllm)
133
+
134
+ ## Citation
135
+
136
+ If you use this model, please cite the original Step 3.5 Flash paper:
137
+
138
+ ```bibtex
139
+ @misc{huang2026step35flashopen,
140
+ title={Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters},
141
+ author={Huang, Ailin and Li, Ang and others},
142
+ year={2026},
143
+ eprint={2602.10604},
144
+ archivePrefix={arXiv},
145
+ primaryClass={cs.CL},
146
+ url={https://arxiv.org/abs/2602.10604}
147
+ }
148
+ ```
149
+
150
+ ## License
151
+
152
+ This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0), same as the base model.