Qwen3.5-9B MTP head (BF16, single file)

Just the multi-token-prediction (MTP) draft head from the official Qwen/Qwen3.5-9B checkpoint, packed into a single ~464 MiB safetensors file.

Why this exists

z-lab/Qwen3.5-9B-PARO declares mtp_num_hidden_layers: 1 in its config.json but ships zero mtp.* tensors in the safetensors. vLLM accordingly creates the draft module from random weights when --speculative-config '{"method": "mtp"}' is passed โ€” the verifier rejects every drafted token (0% acceptance), so speculative decoding becomes pure overhead.

This repo provides the missing weights, extracted verbatim from the upstream BF16 base model with no modification.

Contents

File Purpose
mtp.safetensors 15 BF16 tensors, ~464 MiB
SHA256SUMS Per-tensor full SHA256 for audit

Provenance

Extracted verbatim from Qwen/Qwen3.5-9B (the public BF16 base model) by reading the two safetensors shards that contain mtp.* keys per model.safetensors.index.json and saving them to a single file. No weights were modified.

To verify yourself:

sha256sum -c SHA256SUMS  # against your own download of Qwen/Qwen3.5-9B's mtp.* tensors

Usage with paroquant-inject-mtp

paroquant-inject-mtp is a small CLI that wires this MTP head into the paroquant model directory without copying its 14 GB of weights โ€” symlinks the upstream LM safetensors and writes only the MTP shard:

hf download z-lab/Qwen3.5-9B-PARO --local-dir Qwen3.5-9B-PARO
hf download guru87/Qwen3.5-9B-MTP --local-dir Qwen3.5-9B-MTP

paroquant-inject-mtp \
    --paro     ./Qwen3.5-9B-PARO \
    --mtp-from ./Qwen3.5-9B-MTP/mtp.safetensors \
    --output   ./Qwen3.5-9B-PARO-MTP

vllm serve ./Qwen3.5-9B-PARO-MTP \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}'

Measured results (single 24 GB GPU, paroquant + this MTP, num_spec=2)

  • Output throughput (single-stream greedy): ~120 tok/s
  • Draft acceptance: ~72% averaged across positions
  • Multimodal vision encoder: verified working

Acceptance is lower than larger siblings (27B reaches ~92%) โ€” the MTP head was trained against BF16 hidden states, and paroquant INT4 quantization noise affects predictions proportionally more on smaller models. Empirically the spec-decode net throughput gain may be marginal on the smallest sizes; benchmark for your workload before relying on it.

License

Apache 2.0, inherited from the upstream Qwen/Qwen3.5-9B LICENSE. This repo only redistributes a subset of upstream weights with no modification.


Authors: guru87 (GitHub: guru1987) and Claude Opus 4.7 (Anthropic, 1M context). Diagnosis, patches, scripting, and docs were developed collaboratively over a single session in May 2026.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for guru87/Qwen3.5-9B-MTP

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(251)
this model