File size: 4,727 Bytes
6ecd459
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
license: apache-2.0
base_model:
  - stepfun-ai/Step-3.7-Flash
  - stepfun-ai/Step-3.7-Flash-NVFP4
tags:
  - speculative-decoding
  - mtp
  - multi-token-prediction
  - vllm
  - nvfp4
  - step3
language:
  - en
  - zh
  - ja
library_name: vllm
pipeline_tag: text-generation
---

# Step-3.7-Flash MTP draft (for the NVFP4 checkpoint)

A tiny **Multi-Token-Prediction (MTP / nextn) draft** for **`stepfun-ai/Step-3.7-Flash-NVFP4`**, so you can run
**speculative decoding** on the NVFP4 checkpoint in vLLM.

> **Why this exists:** the official `Step-3.7-Flash-NVFP4` checkpoint **declares**
> `num_nextn_predict_layers: 3` in its config but **ships zero MTP weights** — the
> 3 nextn layers were dropped during quantization, and the per-layer config arrays
> were truncated to 45 (so even loading them would `IndexError`). The BF16 and FP8
> releases keep the MTP weights, but **the NVFP4 one — the SM120-friendly, smallest
> one — cannot do speculative decoding out of the box.** This repo is the missing
> piece: the 3 MTP layers extracted from the BF16 release, kept in BF16 (they're
> tiny), packaged as a vLLM-loadable draft.

- **~5.9 GB**, BF16. Base = NVFP4 (mixed precision is fine; the draft is small).
- **Lossless** in the speculative sense: vLLM's rejection sampling provably matches
  the target distribution; at `temperature=0` it follows the target's greedy tokens.
- Drop-in: point vLLM's `--speculative-config` at this directory.

## Usage (vLLM, stepfun37 image / vLLM ≥ the build with `Step3p5MTP`)

The draft is auto-routed to vLLM's native `Step3p5MTP` + `Step3p5MTPProposer`
because its config is `model_type: step3p7` with `num_nextn_predict_layers > 0`.

```bash
docker run -d --gpus all --ipc=host --shm-size=64g --network host \
  -v /path/to/Step-3.7-Flash-NVFP4:/model:ro \
  -v /path/to/Step-3.7-Flash-MTP-draft:/draft:ro \
  vllm/vllm-openai:stepfun37 \
  /model \
    --served-model-name step3p7 --port 8000 \
    --trust-remote-code --tensor-parallel-size 2 --enable-expert-parallel \
    --quantization modelopt --kv-cache-dtype fp8 \
    --max-model-len 262144 --gpu-memory-utilization 0.92 --async-scheduling \
    --speculative-config '{"method":"mtp","model":"/draft","num_speculative_tokens":1}'
```

JSON for `--speculative-config` must have **no spaces** (brace-expansion safety).
**`num_speculative_tokens: 1` (K=1) is the sweet spot** — see below.

## Benchmarks (2× RTX PRO 6000 Blackwell, SM120, TP=2)

Measured on the NVFP4 base + this draft, K=1, vs. NVFP4 with speculation off.
`per_req` = decode tok/s a single user feels (prefill excluded). Acceptance ≈ **0.80** in production traffic.

**Single-stream decode (short context):**

| workload | base | + MTP K=1 | speedup | accept |
|---|---|---|---|---|
| free-form | 106.8 | **125.5** | +17.5% | 0.77 |
| code | 106.7 | **133.7** | +25.3% | 0.88 |
| Japanese | 107.0 | **129.3** | +20.9% | 0.80 |
| tool-call | 106.9 | **135.4** | +26.6% | 0.90 |

**Decode speedup grows with context length** (longer KV → base is more
memory-bound → bigger speculative win):

| context | C=1 | C=2 | C=4 | C=8 |
|---|---|---|---|---|
| 1K | +20% | +8% | +32% | +34% |
| 8K | +22% | +24% | +25% | **+44%** |
| 32K | +22% | +26% | +20% | +17% |
| **128K** | **+28%** | **+33%** | **+38%** | — |

Net-positive across the whole concurrency range we tested (MoE stays memory-bound
to high batch). Best `K`: **K=1** (K=2/K=3 lose to draft cost — later positions
have lower acceptance and add forward cost). NaN-free on SM120 (Gate0 5/5).

## How it was built (reproducible)

The draft is **not retrained** — it's the original StepFun MTP layers, extracted verbatim:

1. From `stepfun-ai/Step-3.7-Flash` (BF16), take the 52 tensors of
   `model.layers.{45,46,47}.*` (the 3 nextn layers, dense-MLP, 17 tensors each)
   plus `model.embed_tokens.weight`. They all live in one shard
   (`model-00024.safetensors`).
2. Keep the **original BF16 weight names** — vLLM's `Step3p5MTP` loader does its own
   renaming (`.transformer.` strip, `shared_head.output→head`, `.mtp_block.` insert).
3. `config.json` = the **BF16 original** config (NOT the NVFP4 one): its per-layer
   arrays (`layer_types`, `partial_rotary_factors`, …) are length 48 and cover the
   MTP layer indices 45-47. **Strip `quantization_config`** so the draft loads as BF16.

Full scripts + benchmark harness: **[GitHub repo](#)** (`build_draft.py`,
`launch_mtp.sh`, `eval_mtp.py`, `bench_matrix.py`).

## License & attribution

Apache-2.0, inherited from the base model **`stepfun-ai/Step-3.7-Flash`**. These are
StepFun's weights, redistributed unchanged (only re-sharded/re-packaged as a draft).
All credit for the model and the MTP layers goes to StepFun.