--- license: apache-2.0 base_model: - stepfun-ai/Step-3.7-Flash - stepfun-ai/Step-3.7-Flash-NVFP4 tags: - speculative-decoding - mtp - multi-token-prediction - vllm - nvfp4 - step3 language: - en - zh - ja library_name: vllm pipeline_tag: text-generation --- # Step-3.7-Flash MTP draft (for the NVFP4 checkpoint) A tiny **Multi-Token-Prediction (MTP / nextn) draft** for **`stepfun-ai/Step-3.7-Flash-NVFP4`**, so you can run **speculative decoding** on the NVFP4 checkpoint in vLLM. > **Why this exists:** the official `Step-3.7-Flash-NVFP4` checkpoint **declares** > `num_nextn_predict_layers: 3` in its config but **ships zero MTP weights** — the > 3 nextn layers were dropped during quantization, and the per-layer config arrays > were truncated to 45 (so even loading them would `IndexError`). The BF16 and FP8 > releases keep the MTP weights, but **the NVFP4 one — the SM120-friendly, smallest > one — cannot do speculative decoding out of the box.** This repo is the missing > piece: the 3 MTP layers extracted from the BF16 release, kept in BF16 (they're > tiny), packaged as a vLLM-loadable draft. - **~5.9 GB**, BF16. Base = NVFP4 (mixed precision is fine; the draft is small). - **Lossless** in the speculative sense: vLLM's rejection sampling provably matches the target distribution; at `temperature=0` it follows the target's greedy tokens. - Drop-in: point vLLM's `--speculative-config` at this directory. ## Usage (vLLM, stepfun37 image / vLLM ≥ the build with `Step3p5MTP`) The draft is auto-routed to vLLM's native `Step3p5MTP` + `Step3p5MTPProposer` because its config is `model_type: step3p7` with `num_nextn_predict_layers > 0`. ```bash docker run -d --gpus all --ipc=host --shm-size=64g --network host \ -v /path/to/Step-3.7-Flash-NVFP4:/model:ro \ -v /path/to/Step-3.7-Flash-MTP-draft:/draft:ro \ vllm/vllm-openai:stepfun37 \ /model \ --served-model-name step3p7 --port 8000 \ --trust-remote-code --tensor-parallel-size 2 --enable-expert-parallel \ --quantization modelopt --kv-cache-dtype fp8 \ --max-model-len 262144 --gpu-memory-utilization 0.92 --async-scheduling \ --speculative-config '{"method":"mtp","model":"/draft","num_speculative_tokens":1}' ``` JSON for `--speculative-config` must have **no spaces** (brace-expansion safety). **`num_speculative_tokens: 1` (K=1) is the sweet spot** — see below. ## Benchmarks (2× RTX PRO 6000 Blackwell, SM120, TP=2) Measured on the NVFP4 base + this draft, K=1, vs. NVFP4 with speculation off. `per_req` = decode tok/s a single user feels (prefill excluded). Acceptance ≈ **0.80** in production traffic. **Single-stream decode (short context):** | workload | base | + MTP K=1 | speedup | accept | |---|---|---|---|---| | free-form | 106.8 | **125.5** | +17.5% | 0.77 | | code | 106.7 | **133.7** | +25.3% | 0.88 | | Japanese | 107.0 | **129.3** | +20.9% | 0.80 | | tool-call | 106.9 | **135.4** | +26.6% | 0.90 | **Decode speedup grows with context length** (longer KV → base is more memory-bound → bigger speculative win): | context | C=1 | C=2 | C=4 | C=8 | |---|---|---|---|---| | 1K | +20% | +8% | +32% | +34% | | 8K | +22% | +24% | +25% | **+44%** | | 32K | +22% | +26% | +20% | +17% | | **128K** | **+28%** | **+33%** | **+38%** | — | Net-positive across the whole concurrency range we tested (MoE stays memory-bound to high batch). Best `K`: **K=1** (K=2/K=3 lose to draft cost — later positions have lower acceptance and add forward cost). NaN-free on SM120 (Gate0 5/5). ## How it was built (reproducible) The draft is **not retrained** — it's the original StepFun MTP layers, extracted verbatim: 1. From `stepfun-ai/Step-3.7-Flash` (BF16), take the 52 tensors of `model.layers.{45,46,47}.*` (the 3 nextn layers, dense-MLP, 17 tensors each) plus `model.embed_tokens.weight`. They all live in one shard (`model-00024.safetensors`). 2. Keep the **original BF16 weight names** — vLLM's `Step3p5MTP` loader does its own renaming (`.transformer.` strip, `shared_head.output→head`, `.mtp_block.` insert). 3. `config.json` = the **BF16 original** config (NOT the NVFP4 one): its per-layer arrays (`layer_types`, `partial_rotary_factors`, …) are length 48 and cover the MTP layer indices 45-47. **Strip `quantization_config`** so the draft loads as BF16. Full scripts + benchmark harness: **[GitHub repo](#)** (`build_draft.py`, `launch_mtp.sh`, `eval_mtp.py`, `bench_matrix.py`). ## License & attribution Apache-2.0, inherited from the base model **`stepfun-ai/Step-3.7-Flash`**. These are StepFun's weights, redistributed unchanged (only re-sharded/re-packaged as a draft). All credit for the model and the MTP layers goes to StepFun.