Qwen3.6-27B MTP head (BF16, single file)

Just the multi-token-prediction (MTP) draft head from the official Qwen/Qwen3.6-27B checkpoint, packed into a single ~810 MB safetensors file.

This repo is only useful as a drafter for some other quantized Qwen3.6-27B checkpoint that is missing its MTP weights (e.g. z-lab/Qwen3.6-27B-PARO, which declares mtp_num_hidden_layers: 1 in its config but ships zero mtp.* tensors in its safetensors). With the MTP head re-attached, vLLM's speculative decoding works out of the box.

Contents

File Purpose
mtp.safetensors 15 BF16 tensors, 811 MB, lifted verbatim from the base
SHA256SUMS Per-tensor full SHA256 for audit / authenticity check
extract_from_qwen_base.py Reproducible extraction script (the file was made with this)

Provenance

Extracted from Qwen/Qwen3.6-27B/model-00013-of-00015.safetensors and model-00015-of-00015.safetensors (the two shards that contain the mtp.* weight map entries) on a local copy of the public Qwen base model.

To verify authenticity yourself:

sha256sum -c SHA256SUMS  # against your own download of Qwen/Qwen3.6-27B's mtp.* tensors

We additionally verified by SHA256 that these BF16 tensors are bit-identical to the MTP weights shipped in three independent community quants:

Source matches?
Qwen/Qwen3.6-27B-FP8 (BF16 norms portion) bit-equal
Qwen3.6-27B-GPTQ-8bit (BF16 throughout) bit-equal
Qwen3.6-27B-AWQ-BF16-INT4 (BF16 norms) bit-equal

Linear projections in the FP8 source dequantize to within ~5e-4 mean abs error of the BF16 originals here โ€” pure FP8 round-trip noise. So this MTP is canonical Qwen output, identical to what's already redistributed inside several existing quantized checkpoints.

Tensor manifest

mtp.fc.weight                                         shape=[5120, 10240]   bf16
mtp.norm.weight                                       shape=[5120]          bf16
mtp.pre_fc_norm_embedding.weight                      shape=[5120]          bf16
mtp.pre_fc_norm_hidden.weight                         shape=[5120]          bf16
mtp.layers.0.input_layernorm.weight                   shape=[5120]          bf16
mtp.layers.0.post_attention_layernorm.weight          shape=[5120]          bf16
mtp.layers.0.self_attn.q_norm.weight                  shape=[256]           bf16
mtp.layers.0.self_attn.k_norm.weight                  shape=[256]           bf16
mtp.layers.0.self_attn.q_proj.weight                  shape=[12288, 5120]   bf16
mtp.layers.0.self_attn.k_proj.weight                  shape=[1024, 5120]    bf16
mtp.layers.0.self_attn.v_proj.weight                  shape=[1024, 5120]    bf16
mtp.layers.0.self_attn.o_proj.weight                  shape=[5120, 6144]    bf16
mtp.layers.0.mlp.gate_proj.weight                     shape=[17408, 5120]   bf16
mtp.layers.0.mlp.up_proj.weight                       shape=[17408, 5120]   bf16
mtp.layers.0.mlp.down_proj.weight                     shape=[5120, 17408]   bf16

Usage with inject_mtp.py

inject_mtp.py is a small CLI that wires this MTP head into a paroquant-quantized model dir without copying the LM weights (it produces a sharded safetensors layout with the original LM safetensors symlinked):

# 1. Pull the upstream paroquant model
hf download z-lab/Qwen3.6-27B-PARO --local-dir Qwen3.6-27B-PARO

# 2. Pull this MTP head
hf download guru87/Qwen3.6-27B-MTP --local-dir Qwen3.6-27B-MTP

# 3. Inject
inject_mtp.py \
    --paro     ./Qwen3.6-27B-PARO \
    --mtp-from ./Qwen3.6-27B-MTP/mtp.safetensors \
    --output   ./Qwen3.6-27B-PARO-MTP

# 4. Verify
inject_mtp.py --verify ./Qwen3.6-27B-PARO-MTP

Usage with vLLM (after injection)

vllm serve ./Qwen3.6-27B-PARO-MTP \
    --tensor-parallel-size 2 \
    --disable-custom-all-reduce \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}'

On 2ร— RTX 3090 + paroquant + this MTP head, num_speculative_tokens=2 lands 81-91% draft acceptance at position 0, 71-79% at position 1, with acceptance length 2.5 / 3 โ€” sustained ~580 tok/s gen on vllm bench serve 1k-in/2k-out at parallel=24 (steady-state filtered).

License

Apache 2.0, inherited from the upstream Qwen/Qwen3.6-27B LICENSE. This repo only redistributes a subset of upstream weights with no modification.

Why does this exist as a separate repo?

Qwen/Qwen3.6-27B-FP8 already publishes a standalone mtp.safetensors (456 MB), and that's a perfectly good source. This repo exists specifically to:

  1. Keep MTP fully BF16 (FP8's linear projections are FP8) for the cleanest transplant onto bases like paroquant whose hidden-state distribution differs from the FP8-quantized one.
  2. Be a small, self-describing repo (~810 MB total) that doesn't require downloading a full quantized model just to grab the head.
  3. Document the provenance and audit chain explicitly via SHA256SUMS and the reproducible extraction script.

Authors: guru87 (GitHub: guru1987) and Claude Opus 4.7 (Anthropic, 1M context). Diagnosis, patches, scripting, and docs were developed collaboratively over a single session in May 2026.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for guru87/Qwen3.6-27B-MTP

Base model

Qwen/Qwen3.6-27B
Finetuned
(191)
this model