Qwen3-Next-80B FastMTP Speculator

FastMTP speculator head extracted from Qwen/Qwen3-Next-80B-A3B-Instruct, packaged in the Speculators format for plug-and-play deployment in vLLM.

FastMTP (Multi-Token Prediction) applies a single shared transformer layer recursively to predict multiple future tokens FastMTP paper.

Model Details

Property Value
Base model Qwen/Qwen3-Next-80B-A3B-Instruct
Algorithm FastMTP (Multi-Token Prediction)
Speculative tokens 3
MTP layer hidden size 2048
MTP layer type Full-attention + Sparse MoE (512 experts)
Vocab size 151 936
Dtype bfloat16
Speculators version 0.4.0

Architecture

FastMTP uses a single shared transformer layer (with the same architecture as the last decoder block of Qwen3-Next) applied recursively. At step k:

  1. The verifier hidden state from the previous step (or the base model's last hidden state for step 0) is layer-normed and projected together with the ground-truth token embedding via input_proj.
  2. The projected representation passes through the shared attention + MoE MLP block.
  3. The output is passed to the shared lm_head to produce logits for position t+k+2.

The checkpoint is self-contained: embed_tokens and lm_head weights are included so the speculator loads without requiring the full base model.

Usage

Load with Speculators

from speculators import FastMTPSpeculator

model = FastMTPSpeculator.from_pretrained(
    "inference-optimization/Qwen3-Next-80B-A3B-Instruct_mtp_speculator"
)

Convert your own checkpoint

To extract the MTP head from a Qwen3-Next checkpoint yourself:

pip install speculators
speculators convert Qwen/Qwen3-Next-80B-A3B-Instruct \
    --algorithm mtp \
    --verifier  Qwen/Qwen3-Next-80B-A3B-Instruct \
    --output-path ./qwen3_next_mtp_speculators

Or use the example script from the speculators repository.

Paper

Cai et al., FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction, arXiv:2509.18362, 2025.

@article{cai2025fastmtp,
  title={FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction},
  author={Cai, Yuxuan and Liang, Xiaozhuan and Wang, Xinghua and Ma, Jin and
          Liang, Haijin and Luo, Jinwen and Zuo, Xinyu and Duan, Lisheng and
          Yin, Yuyang and Chen, Xi},
  journal={arXiv preprint arXiv:2509.18362},
  year={2025}
}
Downloads last month
39
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inference-optimization/Qwen3-Next-80B-A3B-Instruct_mtp_speculator

Finetuned
(30)
this model

Collection including inference-optimization/Qwen3-Next-80B-A3B-Instruct_mtp_speculator

Paper for inference-optimization/Qwen3-Next-80B-A3B-Instruct_mtp_speculator