Qwen3-Next-80B FastMTP Speculator
FastMTP speculator head extracted from Qwen/Qwen3-Next-80B-A3B-Instruct, packaged in the Speculators format for plug-and-play deployment in vLLM.
FastMTP (Multi-Token Prediction) applies a single shared transformer layer recursively to predict multiple future tokens FastMTP paper.
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-Next-80B-A3B-Instruct |
| Algorithm | FastMTP (Multi-Token Prediction) |
| Speculative tokens | 3 |
| MTP layer hidden size | 2048 |
| MTP layer type | Full-attention + Sparse MoE (512 experts) |
| Vocab size | 151 936 |
| Dtype | bfloat16 |
| Speculators version | 0.4.0 |
Architecture
FastMTP uses a single shared transformer layer (with the same architecture as the last decoder block of Qwen3-Next) applied recursively. At step k:
- The verifier hidden state from the previous step (or the base model's last hidden state
for step 0) is layer-normed and projected together with the ground-truth token embedding
via
input_proj. - The projected representation passes through the shared attention + MoE MLP block.
- The output is passed to the shared
lm_headto produce logits for positiont+k+2.
The checkpoint is self-contained: embed_tokens and lm_head weights are included so the
speculator loads without requiring the full base model.
Usage
Load with Speculators
from speculators import FastMTPSpeculator
model = FastMTPSpeculator.from_pretrained(
"inference-optimization/Qwen3-Next-80B-A3B-Instruct_mtp_speculator"
)
Convert your own checkpoint
To extract the MTP head from a Qwen3-Next checkpoint yourself:
pip install speculators
speculators convert Qwen/Qwen3-Next-80B-A3B-Instruct \
--algorithm mtp \
--verifier Qwen/Qwen3-Next-80B-A3B-Instruct \
--output-path ./qwen3_next_mtp_speculators
Or use the example script from the speculators repository.
Paper
Cai et al., FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction, arXiv:2509.18362, 2025.
@article{cai2025fastmtp,
title={FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction},
author={Cai, Yuxuan and Liang, Xiaozhuan and Wang, Xinghua and Ma, Jin and
Liang, Haijin and Luo, Jinwen and Zuo, Xinyu and Duan, Lisheng and
Yin, Yuyang and Chen, Xi},
journal={arXiv preprint arXiv:2509.18362},
year={2025}
}
- Downloads last month
- 39
Model tree for inference-optimization/Qwen3-Next-80B-A3B-Instruct_mtp_speculator
Base model
Qwen/Qwen3-Next-80B-A3B-Instruct