| --- |
| license: apache-2.0 |
| base_model: |
| - stepfun-ai/Step-3.7-Flash |
| - stepfun-ai/Step-3.7-Flash-NVFP4 |
| tags: |
| - speculative-decoding |
| - mtp |
| - multi-token-prediction |
| - vllm |
| - nvfp4 |
| - step3 |
| language: |
| - en |
| - zh |
| - ja |
| library_name: vllm |
| pipeline_tag: text-generation |
| --- |
| |
| # Step-3.7-Flash MTP draft (for the NVFP4 checkpoint) |
|
|
| A tiny **Multi-Token-Prediction (MTP / nextn) draft** for **`stepfun-ai/Step-3.7-Flash-NVFP4`**, so you can run |
| **speculative decoding** on the NVFP4 checkpoint in vLLM. |
|
|
| > **Why this exists:** the official `Step-3.7-Flash-NVFP4` checkpoint **declares** |
| > `num_nextn_predict_layers: 3` in its config but **ships zero MTP weights** — the |
| > 3 nextn layers were dropped during quantization, and the per-layer config arrays |
| > were truncated to 45 (so even loading them would `IndexError`). The BF16 and FP8 |
| > releases keep the MTP weights, but **the NVFP4 one — the SM120-friendly, smallest |
| > one — cannot do speculative decoding out of the box.** This repo is the missing |
| > piece: the 3 MTP layers extracted from the BF16 release, kept in BF16 (they're |
| > tiny), packaged as a vLLM-loadable draft. |
| |
| - **~5.9 GB**, BF16. Base = NVFP4 (mixed precision is fine; the draft is small). |
| - **Lossless** in the speculative sense: vLLM's rejection sampling provably matches |
| the target distribution; at `temperature=0` it follows the target's greedy tokens. |
| - Drop-in: point vLLM's `--speculative-config` at this directory. |
| |
| ## Usage (vLLM, stepfun37 image / vLLM ≥ the build with `Step3p5MTP`) |
| |
| The draft is auto-routed to vLLM's native `Step3p5MTP` + `Step3p5MTPProposer` |
| because its config is `model_type: step3p7` with `num_nextn_predict_layers > 0`. |
| |
| ```bash |
| docker run -d --gpus all --ipc=host --shm-size=64g --network host \ |
| -v /path/to/Step-3.7-Flash-NVFP4:/model:ro \ |
| -v /path/to/Step-3.7-Flash-MTP-draft:/draft:ro \ |
| vllm/vllm-openai:stepfun37 \ |
| /model \ |
| --served-model-name step3p7 --port 8000 \ |
| --trust-remote-code --tensor-parallel-size 2 --enable-expert-parallel \ |
| --quantization modelopt --kv-cache-dtype fp8 \ |
| --max-model-len 262144 --gpu-memory-utilization 0.92 --async-scheduling \ |
| --speculative-config '{"method":"mtp","model":"/draft","num_speculative_tokens":1}' |
| ``` |
| |
| JSON for `--speculative-config` must have **no spaces** (brace-expansion safety). |
| **`num_speculative_tokens: 1` (K=1) is the sweet spot** — see below. |
| |
| ## Benchmarks (2× RTX PRO 6000 Blackwell, SM120, TP=2) |
| |
| Measured on the NVFP4 base + this draft, K=1, vs. NVFP4 with speculation off. |
| `per_req` = decode tok/s a single user feels (prefill excluded). Acceptance ≈ **0.80** in production traffic. |
|
|
| **Single-stream decode (short context):** |
|
|
| | workload | base | + MTP K=1 | speedup | accept | |
| |---|---|---|---|---| |
| | free-form | 106.8 | **125.5** | +17.5% | 0.77 | |
| | code | 106.7 | **133.7** | +25.3% | 0.88 | |
| | Japanese | 107.0 | **129.3** | +20.9% | 0.80 | |
| | tool-call | 106.9 | **135.4** | +26.6% | 0.90 | |
|
|
| **Decode speedup grows with context length** (longer KV → base is more |
| memory-bound → bigger speculative win): |
|
|
| | context | C=1 | C=2 | C=4 | C=8 | |
| |---|---|---|---|---| |
| | 1K | +20% | +8% | +32% | +34% | |
| | 8K | +22% | +24% | +25% | **+44%** | |
| | 32K | +22% | +26% | +20% | +17% | |
| | **128K** | **+28%** | **+33%** | **+38%** | — | |
|
|
| Net-positive across the whole concurrency range we tested (MoE stays memory-bound |
| to high batch). Best `K`: **K=1** (K=2/K=3 lose to draft cost — later positions |
| have lower acceptance and add forward cost). NaN-free on SM120 (Gate0 5/5). |
|
|
| ## How it was built (reproducible) |
|
|
| The draft is **not retrained** — it's the original StepFun MTP layers, extracted verbatim: |
|
|
| 1. From `stepfun-ai/Step-3.7-Flash` (BF16), take the 52 tensors of |
| `model.layers.{45,46,47}.*` (the 3 nextn layers, dense-MLP, 17 tensors each) |
| plus `model.embed_tokens.weight`. They all live in one shard |
| (`model-00024.safetensors`). |
| 2. Keep the **original BF16 weight names** — vLLM's `Step3p5MTP` loader does its own |
| renaming (`.transformer.` strip, `shared_head.output→head`, `.mtp_block.` insert). |
| 3. `config.json` = the **BF16 original** config (NOT the NVFP4 one): its per-layer |
| arrays (`layer_types`, `partial_rotary_factors`, …) are length 48 and cover the |
| MTP layer indices 45-47. **Strip `quantization_config`** so the draft loads as BF16. |
| |
| Full scripts + benchmark harness: **[GitHub repo](#)** (`build_draft.py`, |
| `launch_mtp.sh`, `eval_mtp.py`, `bench_matrix.py`). |
| |
| ## License & attribution |
| |
| Apache-2.0, inherited from the base model **`stepfun-ai/Step-3.7-Flash`**. These are |
| StepFun's weights, redistributed unchanged (only re-sharded/re-packaged as a draft). |
| All credit for the model and the MTP layers goes to StepFun. |
| |