Qwen3.5-27B-NVFP4-MTP
NVFP4-quantized Qwen/Qwen3.5-27B with the original bf16 MTP (Multi-Token Prediction) head grafted back in for speculative decoding.
19 GB total — fits on a single GPU with room to spare. Achieves ~19 tok/s on an NVIDIA DGX Spark (GB10, 128 GB unified memory), which is 3.3x faster than the bf16 model served with tensor parallelism across two nodes.
What's Different About This Quant
Standard NVFP4 quantization via nvidia-modelopt discards the MTP head because AutoModelForCausalLM doesn't load it. This checkpoint restores the MTP weights from the original model in bf16 and adds them to the quantization ignore list, giving you working speculative decoding out of the box.
Quantization Details
| Property | Value |
|---|---|
| Method | NVFP4 via nvidia-modelopt 0.41.0 |
| Group size | 16 |
| Calibration | 256 samples from CNN/DailyMail (train split), max_seq_len=2048 |
| Excluded from quantization | lm_head, all conv1d layers (Mamba blocks), all MTP linear layers |
| MTP head | bf16 (15 tensors, ~850 MB, grafted from original) |
| Total checkpoint size | ~19 GB |
Usage (vLLM)
Requires vLLM with Qwen3.5 support (tested with vllm/vllm-openai:qwen3_5-cu130, vLLM 0.16.0rc2).
With MTP speculative decoding (recommended)
vllm serve osoleve/Qwen3.5-27B-NVFP4-MTP \
--trust-remote-code \
--gpu-memory-utilization 0.85 \
--max-model-len 8192 \
--quantization modelopt \
--language-model-only \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
Without MTP (still faster than bf16)
vllm serve osoleve/Qwen3.5-27B-NVFP4-MTP \
--trust-remote-code \
--gpu-memory-utilization 0.85 \
--max-model-len 8192 \
--quantization modelopt \
--language-model-only
Important flags
--language-model-only— Required. Qwen3.5 is natively multimodal; this skips the vision encoder (no vision weights are included).--quantization modelopt— Required. Tells vLLM to use the NVFP4 weight format.num_speculative_tokens: 1— The model has a single MTP layer. Setting this to 2 reuses the same layer twice with degraded acceptance rates.
Disabling thinking mode
Qwen3.5 defaults to thinking/reasoning mode. To disable it per-request:
{
"chat_template_kwargs": {"enable_thinking": false}
}
Benchmarks
Tested on NVIDIA DGX Spark (GB10 SoC, 128 GB unified CPU/GPU memory). Task: generate 692 tokens with thinking disabled.
| Configuration | Nodes | tok/s | Speedup |
|---|---|---|---|
| bf16, TP=2 | 2 | ~5.8 | 1.0x |
| bf16, TP=2 + MTP(2) | 2 | ~13.0 | 2.2x |
| NVFP4 (this model) | 1 | ~10.5 | 1.8x |
| NVFP4 + MTP(1) (this model) | 1 | ~19.7 | 3.4x |
Memory footprint: ~18.5 GiB with MTP, leaving ~80 GiB for KV cache on a 128 GB system.
How This Was Made
- Loaded
Qwen/Qwen3.5-27Bin bf16 viaAutoModelForCausalLM - Quantized to NVFP4 using
modelopt.torch.quantization.quantize()withNVFP4_DEFAULT_CFG - Exported via
modelopt.torch.export.unified_export_hf.export_hf_checkpoint() - Grafted 15 MTP tensors from the original checkpoint into the quantized safetensors
- Added MTP linear layers to
quantization_config.ignoreinconfig.json - Used the original model's
config.jsonstructure (ForConditionalGeneration with text_config/vision_config) to match weight naming conventions
License
Apache 2.0, following the original Qwen3.5-27B license.
- Downloads last month
- 303
Model tree for surogate/Qwen3.5-27B-NVFP4
Base model
Qwen/Qwen3.5-27B