Qwen3.6-27B MTP head (BF16, single file)
Just the multi-token-prediction (MTP) draft head from the official
Qwen/Qwen3.6-27B checkpoint, packed
into a single ~810 MB safetensors file.
This repo is only useful as a drafter for some other quantized Qwen3.6-27B
checkpoint that is missing its MTP weights (e.g. z-lab/Qwen3.6-27B-PARO,
which declares mtp_num_hidden_layers: 1 in its config but ships zero
mtp.* tensors in its safetensors). With the MTP head re-attached, vLLM's
speculative decoding works out of the box.
Contents
| File | Purpose |
|---|---|
mtp.safetensors |
15 BF16 tensors, 811 MB, lifted verbatim from the base |
SHA256SUMS |
Per-tensor full SHA256 for audit / authenticity check |
extract_from_qwen_base.py |
Reproducible extraction script (the file was made with this) |
Provenance
Extracted from Qwen/Qwen3.6-27B/model-00013-of-00015.safetensors and
model-00015-of-00015.safetensors (the two shards that contain the mtp.*
weight map entries) on a local copy of the public Qwen base model.
To verify authenticity yourself:
sha256sum -c SHA256SUMS # against your own download of Qwen/Qwen3.6-27B's mtp.* tensors
We additionally verified by SHA256 that these BF16 tensors are bit-identical to the MTP weights shipped in three independent community quants:
| Source | matches? |
|---|---|
Qwen/Qwen3.6-27B-FP8 (BF16 norms portion) |
bit-equal |
Qwen3.6-27B-GPTQ-8bit (BF16 throughout) |
bit-equal |
Qwen3.6-27B-AWQ-BF16-INT4 (BF16 norms) |
bit-equal |
Linear projections in the FP8 source dequantize to within ~5e-4 mean abs error of the BF16 originals here โ pure FP8 round-trip noise. So this MTP is canonical Qwen output, identical to what's already redistributed inside several existing quantized checkpoints.
Tensor manifest
mtp.fc.weight shape=[5120, 10240] bf16
mtp.norm.weight shape=[5120] bf16
mtp.pre_fc_norm_embedding.weight shape=[5120] bf16
mtp.pre_fc_norm_hidden.weight shape=[5120] bf16
mtp.layers.0.input_layernorm.weight shape=[5120] bf16
mtp.layers.0.post_attention_layernorm.weight shape=[5120] bf16
mtp.layers.0.self_attn.q_norm.weight shape=[256] bf16
mtp.layers.0.self_attn.k_norm.weight shape=[256] bf16
mtp.layers.0.self_attn.q_proj.weight shape=[12288, 5120] bf16
mtp.layers.0.self_attn.k_proj.weight shape=[1024, 5120] bf16
mtp.layers.0.self_attn.v_proj.weight shape=[1024, 5120] bf16
mtp.layers.0.self_attn.o_proj.weight shape=[5120, 6144] bf16
mtp.layers.0.mlp.gate_proj.weight shape=[17408, 5120] bf16
mtp.layers.0.mlp.up_proj.weight shape=[17408, 5120] bf16
mtp.layers.0.mlp.down_proj.weight shape=[5120, 17408] bf16
Usage with inject_mtp.py
inject_mtp.py
is a small CLI that wires this MTP head into a paroquant-quantized model dir
without copying the LM weights (it produces a sharded safetensors layout with
the original LM safetensors symlinked):
# 1. Pull the upstream paroquant model
hf download z-lab/Qwen3.6-27B-PARO --local-dir Qwen3.6-27B-PARO
# 2. Pull this MTP head
hf download guru87/Qwen3.6-27B-MTP --local-dir Qwen3.6-27B-MTP
# 3. Inject
inject_mtp.py \
--paro ./Qwen3.6-27B-PARO \
--mtp-from ./Qwen3.6-27B-MTP/mtp.safetensors \
--output ./Qwen3.6-27B-PARO-MTP
# 4. Verify
inject_mtp.py --verify ./Qwen3.6-27B-PARO-MTP
Usage with vLLM (after injection)
vllm serve ./Qwen3.6-27B-PARO-MTP \
--tensor-parallel-size 2 \
--disable-custom-all-reduce \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 2}'
On 2ร RTX 3090 + paroquant + this MTP head, num_speculative_tokens=2 lands
81-91% draft acceptance at position 0, 71-79% at position 1, with
acceptance length 2.5 / 3 โ sustained ~580 tok/s gen on vllm bench serve
1k-in/2k-out at parallel=24 (steady-state filtered).
License
Apache 2.0, inherited from the upstream
Qwen/Qwen3.6-27B LICENSE.
This repo only redistributes a subset of upstream weights with no modification.
Why does this exist as a separate repo?
Qwen/Qwen3.6-27B-FP8 already publishes a standalone mtp.safetensors (456 MB),
and that's a perfectly good source. This repo exists specifically to:
- Keep MTP fully BF16 (FP8's linear projections are FP8) for the cleanest transplant onto bases like paroquant whose hidden-state distribution differs from the FP8-quantized one.
- Be a small, self-describing repo (~810 MB total) that doesn't require downloading a full quantized model just to grab the head.
- Document the provenance and audit chain explicitly via
SHA256SUMSand the reproducible extraction script.
Authors: guru87 (GitHub: guru1987) and Claude Opus 4.7 (Anthropic, 1M context). Diagnosis, patches, scripting, and docs were developed collaboratively over a single session in May 2026.
Model tree for guru87/Qwen3.6-27B-MTP
Base model
Qwen/Qwen3.6-27B