Qwen3-Next-80B FastMTP Speculator

FastMTP speculator head extracted from Qwen/Qwen3-Next-80B-A3B-Instruct, packaged in the Speculators format for plug-and-play deployment in vLLM.

FastMTP (Multi-Token Prediction) applies a single shared transformer layer recursively to predict multiple future tokens FastMTP paper.

Model Details

Property	Value
Base model	Qwen/Qwen3-Next-80B-A3B-Instruct
Algorithm	FastMTP (Multi-Token Prediction)
Speculative tokens	3
MTP layer hidden size	2048
MTP layer type	Full-attention + Sparse MoE (512 experts)
Vocab size	151 936
Dtype	bfloat16
Speculators version	0.4.0

Architecture

FastMTP uses a single shared transformer layer (with the same architecture as the last decoder block of Qwen3-Next) applied recursively. At step k:

The verifier hidden state from the previous step (or the base model's last hidden state for step 0) is layer-normed and projected together with the ground-truth token embedding via input_proj.
The projected representation passes through the shared attention + MoE MLP block.
The output is passed to the shared lm_head to produce logits for position t+k+2.

The checkpoint is self-contained: embed_tokens and lm_head weights are included so the speculator loads without requiring the full base model.

Usage

Load with Speculators

from speculators import FastMTPSpeculator

model = FastMTPSpeculator.from_pretrained(
    "inference-optimization/Qwen3-Next-80B-A3B-Instruct_mtp_speculator"
)

Convert your own checkpoint

To extract the MTP head from a Qwen3-Next checkpoint yourself:

pip install speculators
speculators convert Qwen/Qwen3-Next-80B-A3B-Instruct \
    --algorithm mtp \
    --verifier  Qwen/Qwen3-Next-80B-A3B-Instruct \
    --output-path ./qwen3_next_mtp_speculators

Or use the example script from the speculators repository.

Paper

Cai et al., FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction, arXiv:2509.18362, 2025.

@article{cai2025fastmtp,
  title={FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction},
  author={Cai, Yuxuan and Liang, Xiaozhuan and Wang, Xinghua and Ma, Jin and
          Liang, Haijin and Luo, Jinwen and Zuo, Xinyu and Duan, Lisheng and
          Yin, Yuyang and Chen, Xi},
  journal={arXiv preprint arXiv:2509.18362},
  year={2025}
}