Hikari07jp's picture
Upload folder using huggingface_hub
6ecd459 verified
---
license: apache-2.0
base_model:
- stepfun-ai/Step-3.7-Flash
- stepfun-ai/Step-3.7-Flash-NVFP4
tags:
- speculative-decoding
- mtp
- multi-token-prediction
- vllm
- nvfp4
- step3
language:
- en
- zh
- ja
library_name: vllm
pipeline_tag: text-generation
---
# Step-3.7-Flash MTP draft (for the NVFP4 checkpoint)
A tiny **Multi-Token-Prediction (MTP / nextn) draft** for **`stepfun-ai/Step-3.7-Flash-NVFP4`**, so you can run
**speculative decoding** on the NVFP4 checkpoint in vLLM.
> **Why this exists:** the official `Step-3.7-Flash-NVFP4` checkpoint **declares**
> `num_nextn_predict_layers: 3` in its config but **ships zero MTP weights** — the
> 3 nextn layers were dropped during quantization, and the per-layer config arrays
> were truncated to 45 (so even loading them would `IndexError`). The BF16 and FP8
> releases keep the MTP weights, but **the NVFP4 one — the SM120-friendly, smallest
> one — cannot do speculative decoding out of the box.** This repo is the missing
> piece: the 3 MTP layers extracted from the BF16 release, kept in BF16 (they're
> tiny), packaged as a vLLM-loadable draft.
- **~5.9 GB**, BF16. Base = NVFP4 (mixed precision is fine; the draft is small).
- **Lossless** in the speculative sense: vLLM's rejection sampling provably matches
the target distribution; at `temperature=0` it follows the target's greedy tokens.
- Drop-in: point vLLM's `--speculative-config` at this directory.
## Usage (vLLM, stepfun37 image / vLLM ≥ the build with `Step3p5MTP`)
The draft is auto-routed to vLLM's native `Step3p5MTP` + `Step3p5MTPProposer`
because its config is `model_type: step3p7` with `num_nextn_predict_layers > 0`.
```bash
docker run -d --gpus all --ipc=host --shm-size=64g --network host \
-v /path/to/Step-3.7-Flash-NVFP4:/model:ro \
-v /path/to/Step-3.7-Flash-MTP-draft:/draft:ro \
vllm/vllm-openai:stepfun37 \
/model \
--served-model-name step3p7 --port 8000 \
--trust-remote-code --tensor-parallel-size 2 --enable-expert-parallel \
--quantization modelopt --kv-cache-dtype fp8 \
--max-model-len 262144 --gpu-memory-utilization 0.92 --async-scheduling \
--speculative-config '{"method":"mtp","model":"/draft","num_speculative_tokens":1}'
```
JSON for `--speculative-config` must have **no spaces** (brace-expansion safety).
**`num_speculative_tokens: 1` (K=1) is the sweet spot** — see below.
## Benchmarks (2× RTX PRO 6000 Blackwell, SM120, TP=2)
Measured on the NVFP4 base + this draft, K=1, vs. NVFP4 with speculation off.
`per_req` = decode tok/s a single user feels (prefill excluded). Acceptance ≈ **0.80** in production traffic.
**Single-stream decode (short context):**
| workload | base | + MTP K=1 | speedup | accept |
|---|---|---|---|---|
| free-form | 106.8 | **125.5** | +17.5% | 0.77 |
| code | 106.7 | **133.7** | +25.3% | 0.88 |
| Japanese | 107.0 | **129.3** | +20.9% | 0.80 |
| tool-call | 106.9 | **135.4** | +26.6% | 0.90 |
**Decode speedup grows with context length** (longer KV → base is more
memory-bound → bigger speculative win):
| context | C=1 | C=2 | C=4 | C=8 |
|---|---|---|---|---|
| 1K | +20% | +8% | +32% | +34% |
| 8K | +22% | +24% | +25% | **+44%** |
| 32K | +22% | +26% | +20% | +17% |
| **128K** | **+28%** | **+33%** | **+38%** | — |
Net-positive across the whole concurrency range we tested (MoE stays memory-bound
to high batch). Best `K`: **K=1** (K=2/K=3 lose to draft cost — later positions
have lower acceptance and add forward cost). NaN-free on SM120 (Gate0 5/5).
## How it was built (reproducible)
The draft is **not retrained** — it's the original StepFun MTP layers, extracted verbatim:
1. From `stepfun-ai/Step-3.7-Flash` (BF16), take the 52 tensors of
`model.layers.{45,46,47}.*` (the 3 nextn layers, dense-MLP, 17 tensors each)
plus `model.embed_tokens.weight`. They all live in one shard
(`model-00024.safetensors`).
2. Keep the **original BF16 weight names** — vLLM's `Step3p5MTP` loader does its own
renaming (`.transformer.` strip, `shared_head.output→head`, `.mtp_block.` insert).
3. `config.json` = the **BF16 original** config (NOT the NVFP4 one): its per-layer
arrays (`layer_types`, `partial_rotary_factors`, …) are length 48 and cover the
MTP layer indices 45-47. **Strip `quantization_config`** so the draft loads as BF16.
Full scripts + benchmark harness: **[GitHub repo](#)** (`build_draft.py`,
`launch_mtp.sh`, `eval_mtp.py`, `bench_matrix.py`).
## License & attribution
Apache-2.0, inherited from the base model **`stepfun-ai/Step-3.7-Flash`**. These are
StepFun's weights, redistributed unchanged (only re-sharded/re-packaged as a draft).
All credit for the model and the MTP layers goes to StepFun.