Qwen3.5-9B MTP head (BF16, single file)
Just the multi-token-prediction (MTP) draft head from the official
Qwen/Qwen3.5-9B checkpoint, packed into a single
~464 MiB safetensors file.
Why this exists
z-lab/Qwen3.5-9B-PARO declares
mtp_num_hidden_layers: 1 in its config.json but ships zero mtp.*
tensors in the safetensors. vLLM accordingly creates the draft module
from random weights when --speculative-config '{"method": "mtp"}' is
passed โ the verifier rejects every drafted token (0% acceptance), so
speculative decoding becomes pure overhead.
This repo provides the missing weights, extracted verbatim from the upstream BF16 base model with no modification.
Contents
| File | Purpose |
|---|---|
mtp.safetensors |
15 BF16 tensors, ~464 MiB |
SHA256SUMS |
Per-tensor full SHA256 for audit |
Provenance
Extracted verbatim from Qwen/Qwen3.5-9B (the public BF16 base model) by reading
the two safetensors shards that contain mtp.* keys per
model.safetensors.index.json and saving them to a single file. No
weights were modified.
To verify yourself:
sha256sum -c SHA256SUMS # against your own download of Qwen/Qwen3.5-9B's mtp.* tensors
Usage with paroquant-inject-mtp
paroquant-inject-mtp is a
small CLI that wires this MTP head into the paroquant model directory
without copying its 14 GB of weights โ symlinks the upstream LM
safetensors and writes only the MTP shard:
hf download z-lab/Qwen3.5-9B-PARO --local-dir Qwen3.5-9B-PARO
hf download guru87/Qwen3.5-9B-MTP --local-dir Qwen3.5-9B-MTP
paroquant-inject-mtp \
--paro ./Qwen3.5-9B-PARO \
--mtp-from ./Qwen3.5-9B-MTP/mtp.safetensors \
--output ./Qwen3.5-9B-PARO-MTP
vllm serve ./Qwen3.5-9B-PARO-MTP \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 2}'
Measured results (single 24 GB GPU, paroquant + this MTP, num_spec=2)
- Output throughput (single-stream greedy): ~120 tok/s
- Draft acceptance: ~72% averaged across positions
- Multimodal vision encoder: verified working
Acceptance is lower than larger siblings (27B reaches ~92%) โ the MTP head was trained against BF16 hidden states, and paroquant INT4 quantization noise affects predictions proportionally more on smaller models. Empirically the spec-decode net throughput gain may be marginal on the smallest sizes; benchmark for your workload before relying on it.
License
Apache 2.0, inherited from the upstream
Qwen/Qwen3.5-9B LICENSE.
This repo only redistributes a subset of upstream weights with no
modification.
Authors: guru87 (GitHub: guru1987) and Claude Opus 4.7 (Anthropic, 1M context). Diagnosis, patches, scripting, and docs were developed collaboratively over a single session in May 2026.