Qwen3.5-27B-NVFP4-MTP

NVFP4-quantized Qwen/Qwen3.5-27B with the original bf16 MTP (Multi-Token Prediction) head grafted back in for speculative decoding.

19 GB total — fits on a single GPU with room to spare. Achieves ~19 tok/s on an NVIDIA DGX Spark (GB10, 128 GB unified memory), which is 3.3x faster than the bf16 model served with tensor parallelism across two nodes.

What's Different About This Quant

Standard NVFP4 quantization via nvidia-modelopt discards the MTP head because AutoModelForCausalLM doesn't load it. This checkpoint restores the MTP weights from the original model in bf16 and adds them to the quantization ignore list, giving you working speculative decoding out of the box.

Quantization Details

Property	Value
Method	NVFP4 via nvidia-modelopt 0.41.0
Group size	16
Calibration	256 samples from CNN/DailyMail (train split), max_seq_len=2048
Excluded from quantization	`lm_head`, all `conv1d` layers (Mamba blocks), all MTP linear layers
MTP head	bf16 (15 tensors, ~850 MB, grafted from original)
Total checkpoint size	~19 GB

Usage (vLLM)

Requires vLLM with Qwen3.5 support (tested with vllm/vllm-openai:qwen3_5-cu130, vLLM 0.16.0rc2).

With MTP speculative decoding (recommended)

vllm serve osoleve/Qwen3.5-27B-NVFP4-MTP \
    --trust-remote-code \
    --gpu-memory-utilization 0.85 \
    --max-model-len 8192 \
    --quantization modelopt \
    --language-model-only \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Without MTP (still faster than bf16)

vllm serve osoleve/Qwen3.5-27B-NVFP4-MTP \
    --trust-remote-code \
    --gpu-memory-utilization 0.85 \
    --max-model-len 8192 \
    --quantization modelopt \
    --language-model-only

Important flags

--language-model-only — Required. Qwen3.5 is natively multimodal; this skips the vision encoder (no vision weights are included).
--quantization modelopt — Required. Tells vLLM to use the NVFP4 weight format.
num_speculative_tokens: 1 — The model has a single MTP layer. Setting this to 2 reuses the same layer twice with degraded acceptance rates.

Disabling thinking mode

Qwen3.5 defaults to thinking/reasoning mode. To disable it per-request:

{
    "chat_template_kwargs": {"enable_thinking": false}
}

Benchmarks

Tested on NVIDIA DGX Spark (GB10 SoC, 128 GB unified CPU/GPU memory). Task: generate 692 tokens with thinking disabled.

Configuration	Nodes	tok/s	Speedup
bf16, TP=2	2	~5.8	1.0x
bf16, TP=2 + MTP(2)	2	~13.0	2.2x
NVFP4 (this model)	1	~10.5	1.8x
NVFP4 + MTP(1) (this model)	1	~19.7	3.4x

Memory footprint: ~18.5 GiB with MTP, leaving ~80 GiB for KV cache on a 128 GB system.

How This Was Made

Loaded Qwen/Qwen3.5-27B in bf16 via AutoModelForCausalLM
Quantized to NVFP4 using modelopt.torch.quantization.quantize() with NVFP4_DEFAULT_CFG
Exported via modelopt.torch.export.unified_export_hf.export_hf_checkpoint()
Grafted 15 MTP tensors from the original checkpoint into the quantized safetensors
Added MTP linear layers to quantization_config.ignore in config.json
Used the original model's config.json structure (ForConditionalGeneration with text_config/vision_config) to match weight naming conventions

License

Apache 2.0, following the original Qwen3.5-27B license.

Downloads last month: 303

Model tree for surogate/Qwen3.5-27B-NVFP4

Base model

Qwen/Qwen3.5-27B

Quantized

(78)

this model