Qwen3.5-9B MTP head (BF16, single file)

Just the multi-token-prediction (MTP) draft head from the official Qwen/Qwen3.5-9B checkpoint, packed into a single ~464 MiB safetensors file.

Why this exists

z-lab/Qwen3.5-9B-PARO declares mtp_num_hidden_layers: 1 in its config.json but ships zero mtp.* tensors in the safetensors. vLLM accordingly creates the draft module from random weights when --speculative-config '{"method": "mtp"}' is passed — the verifier rejects every drafted token (0% acceptance), so speculative decoding becomes pure overhead.

This repo provides the missing weights, extracted verbatim from the upstream BF16 base model with no modification.

File	Purpose
`mtp.safetensors`	15 BF16 tensors, ~464 MiB
`SHA256SUMS`	Per-tensor full SHA256 for audit

Provenance

Extracted verbatim from Qwen/Qwen3.5-9B (the public BF16 base model) by reading the two safetensors shards that contain mtp.* keys per model.safetensors.index.json and saving them to a single file. No weights were modified.

To verify yourself:

sha256sum -c SHA256SUMS  # against your own download of Qwen/Qwen3.5-9B's mtp.* tensors

Usage with `paroquant-inject-mtp`

paroquant-inject-mtp is a small CLI that wires this MTP head into the paroquant model directory without copying its 14 GB of weights — symlinks the upstream LM safetensors and writes only the MTP shard:

hf download z-lab/Qwen3.5-9B-PARO --local-dir Qwen3.5-9B-PARO
hf download guru87/Qwen3.5-9B-MTP --local-dir Qwen3.5-9B-MTP

paroquant-inject-mtp \
    --paro     ./Qwen3.5-9B-PARO \
    --mtp-from ./Qwen3.5-9B-MTP/mtp.safetensors \
    --output   ./Qwen3.5-9B-PARO-MTP

vllm serve ./Qwen3.5-9B-PARO-MTP \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}'

Measured results (single 24 GB GPU, paroquant + this MTP, num_spec=2)

Output throughput (single-stream greedy): ~120 tok/s
Draft acceptance: ~72% averaged across positions
Multimodal vision encoder: verified working

Acceptance is lower than larger siblings (27B reaches ~92%) — the MTP head was trained against BF16 hidden states, and paroquant INT4 quantization noise affects predictions proportionally more on smaller models. Empirically the spec-decode net throughput gain may be marginal on the smallest sizes; benchmark for your workload before relying on it.

License

Apache 2.0, inherited from the upstream Qwen/Qwen3.5-9B LICENSE. This repo only redistributes a subset of upstream weights with no modification.

Authors: guru87 (GitHub: guru1987) and Claude Opus 4.7 (Anthropic, 1M context). Diagnosis, patches, scripting, and docs were developed collaboratively over a single session in May 2026.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for guru87/Qwen3.5-9B-MTP

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Finetuned

(251)

this model

guru87
/

Qwen3.5-9B-MTP

Qwen3.5-9B MTP head (BF16, single file)

Why this exists

Contents

Provenance

Usage with `paroquant-inject-mtp`

Measured results (single 24 GB GPU, paroquant + this MTP, num_spec=2)

License

Model tree for guru87/Qwen3.5-9B-MTP

Qwen3.5-9B MTP head (BF16, single file)

Why this exists

Contents

Provenance

Usage with paroquant-inject-mtp

Measured results (single 24 GB GPU, paroquant + this MTP, num_spec=2)

License

Model tree for guru87/Qwen3.5-9B-MTP

Usage with `paroquant-inject-mtp`