Upload folder using huggingface_hub

6ecd459 verified 2 days ago

4.73 kB

	---
	license: apache-2.0
	base_model:
	- stepfun-ai/Step-3.7-Flash
	- stepfun-ai/Step-3.7-Flash-NVFP4
	tags:
	- speculative-decoding
	- mtp
	- multi-token-prediction
	- vllm
	- nvfp4
	- step3
	language:
	- en
	- zh
	- ja
	library_name: vllm
	pipeline_tag: text-generation
	---

	# Step-3.7-Flash MTP draft (for the NVFP4 checkpoint)

	A tiny Multi-Token-Prediction (MTP / nextn) draft for `stepfun-ai/Step-3.7-Flash-NVFP4`, so you can run
	speculative decoding on the NVFP4 checkpoint in vLLM.

	> Why this exists: the official `Step-3.7-Flash-NVFP4` checkpoint declares
	> `num_nextn_predict_layers: 3` in its config but ships zero MTP weights — the
	> 3 nextn layers were dropped during quantization, and the per-layer config arrays
	> were truncated to 45 (so even loading them would `IndexError`). The BF16 and FP8
	> releases keep the MTP weights, but **the NVFP4 one — the SM120-friendly, smallest
	> one — cannot do speculative decoding out of the box.** This repo is the missing
	> piece: the 3 MTP layers extracted from the BF16 release, kept in BF16 (they're
	> tiny), packaged as a vLLM-loadable draft.

	- ~5.9 GB, BF16. Base = NVFP4 (mixed precision is fine; the draft is small).
	- Lossless in the speculative sense: vLLM's rejection sampling provably matches
	the target distribution; at `temperature=0` it follows the target's greedy tokens.
	- Drop-in: point vLLM's `--speculative-config` at this directory.

	## Usage (vLLM, stepfun37 image / vLLM ≥ the build with `Step3p5MTP`)

	The draft is auto-routed to vLLM's native `Step3p5MTP` + `Step3p5MTPProposer`
	because its config is `model_type: step3p7` with `num_nextn_predict_layers > 0`.

	```bash
	docker run -d --gpus all --ipc=host --shm-size=64g --network host \
	-v /path/to/Step-3.7-Flash-NVFP4:/model:ro \
	-v /path/to/Step-3.7-Flash-MTP-draft:/draft:ro \
	vllm/vllm-openai:stepfun37 \
	/model \
	--served-model-name step3p7 --port 8000 \
	--trust-remote-code --tensor-parallel-size 2 --enable-expert-parallel \
	--quantization modelopt --kv-cache-dtype fp8 \
	--max-model-len 262144 --gpu-memory-utilization 0.92 --async-scheduling \
	--speculative-config '{"method":"mtp","model":"/draft","num_speculative_tokens":1}'
	```

	JSON for `--speculative-config` must have no spaces (brace-expansion safety).
	`num_speculative_tokens: 1` (K=1) is the sweet spot — see below.

	## Benchmarks (2× RTX PRO 6000 Blackwell, SM120, TP=2)

	Measured on the NVFP4 base + this draft, K=1, vs. NVFP4 with speculation off.
	`per_req` = decode tok/s a single user feels (prefill excluded). Acceptance ≈ 0.80 in production traffic.

	Single-stream decode (short context):

	\| workload \| base \| + MTP K=1 \| speedup \| accept \|
	\|---\|---\|---\|---\|---\|
	\| free-form \| 106.8 \| 125.5 \| +17.5% \| 0.77 \|
	\| code \| 106.7 \| 133.7 \| +25.3% \| 0.88 \|
	\| Japanese \| 107.0 \| 129.3 \| +20.9% \| 0.80 \|
	\| tool-call \| 106.9 \| 135.4 \| +26.6% \| 0.90 \|

	Decode speedup grows with context length (longer KV → base is more
	memory-bound → bigger speculative win):

	\| context \| C=1 \| C=2 \| C=4 \| C=8 \|
	\|---\|---\|---\|---\|---\|
	\| 1K \| +20% \| +8% \| +32% \| +34% \|
	\| 8K \| +22% \| +24% \| +25% \| +44% \|
	\| 32K \| +22% \| +26% \| +20% \| +17% \|
	\| 128K \| +28% \| +33% \| +38% \| — \|

	Net-positive across the whole concurrency range we tested (MoE stays memory-bound
	to high batch). Best `K`: K=1 (K=2/K=3 lose to draft cost — later positions
	have lower acceptance and add forward cost). NaN-free on SM120 (Gate0 5/5).

	## How it was built (reproducible)

	The draft is not retrained — it's the original StepFun MTP layers, extracted verbatim:

	1. From `stepfun-ai/Step-3.7-Flash` (BF16), take the 52 tensors of
	`model.layers.{45,46,47}.*` (the 3 nextn layers, dense-MLP, 17 tensors each)
	plus `model.embed_tokens.weight`. They all live in one shard
	(`model-00024.safetensors`).
	2. Keep the original BF16 weight names — vLLM's `Step3p5MTP` loader does its own
	renaming (`.transformer.` strip, `shared_head.output→head`, `.mtp_block.` insert).
	3. `config.json` = the BF16 original config (NOT the NVFP4 one): its per-layer
	arrays (`layer_types`, `partial_rotary_factors`, …) are length 48 and cover the
	MTP layer indices 45-47. Strip `quantization_config` so the draft loads as BF16.

	Full scripts + benchmark harness: [GitHub repo](#) (`build_draft.py`,
	`launch_mtp.sh`, `eval_mtp.py`, `bench_matrix.py`).

	## License & attribution

	Apache-2.0, inherited from the base model `stepfun-ai/Step-3.7-Flash`. These are
	StepFun's weights, redistributed unchanged (only re-sharded/re-packaged as a draft).
	All credit for the model and the MTP layers goes to StepFun.