Qwen3.5-27B-NVFP4-MTP

NVFP4-quantized Qwen/Qwen3.5-27B with the original bf16 MTP (Multi-Token Prediction) head grafted back in for speculative decoding.

19 GB total — fits on a single GPU with room to spare. Achieves ~19 tok/s on an NVIDIA DGX Spark (GB10, 128 GB unified memory), which is 3.3x faster than the bf16 model served with tensor parallelism across two nodes.

What's Different About This Quant

Standard NVFP4 quantization via nvidia-modelopt discards the MTP head because AutoModelForCausalLM doesn't load it. This checkpoint restores the MTP weights from the original model in bf16 and adds them to the quantization ignore list, giving you working speculative decoding out of the box.

Quantization Details

Property Value
Method NVFP4 via nvidia-modelopt 0.41.0
Group size 16
Calibration 256 samples from CNN/DailyMail (train split), max_seq_len=2048
Excluded from quantization lm_head, all conv1d layers (Mamba blocks), all MTP linear layers
MTP head bf16 (15 tensors, ~850 MB, grafted from original)
Total checkpoint size ~19 GB

Usage (vLLM)

Requires vLLM with Qwen3.5 support (tested with vllm/vllm-openai:qwen3_5-cu130, vLLM 0.16.0rc2).

With MTP speculative decoding (recommended)

vllm serve osoleve/Qwen3.5-27B-NVFP4-MTP \
    --trust-remote-code \
    --gpu-memory-utilization 0.85 \
    --max-model-len 8192 \
    --quantization modelopt \
    --language-model-only \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Without MTP (still faster than bf16)

vllm serve osoleve/Qwen3.5-27B-NVFP4-MTP \
    --trust-remote-code \
    --gpu-memory-utilization 0.85 \
    --max-model-len 8192 \
    --quantization modelopt \
    --language-model-only

Important flags

  • --language-model-only — Required. Qwen3.5 is natively multimodal; this skips the vision encoder (no vision weights are included).
  • --quantization modelopt — Required. Tells vLLM to use the NVFP4 weight format.
  • num_speculative_tokens: 1 — The model has a single MTP layer. Setting this to 2 reuses the same layer twice with degraded acceptance rates.

Disabling thinking mode

Qwen3.5 defaults to thinking/reasoning mode. To disable it per-request:

{
    "chat_template_kwargs": {"enable_thinking": false}
}

Benchmarks

Tested on NVIDIA DGX Spark (GB10 SoC, 128 GB unified CPU/GPU memory). Task: generate 692 tokens with thinking disabled.

Configuration Nodes tok/s Speedup
bf16, TP=2 2 ~5.8 1.0x
bf16, TP=2 + MTP(2) 2 ~13.0 2.2x
NVFP4 (this model) 1 ~10.5 1.8x
NVFP4 + MTP(1) (this model) 1 ~19.7 3.4x

Memory footprint: ~18.5 GiB with MTP, leaving ~80 GiB for KV cache on a 128 GB system.

How This Was Made

  1. Loaded Qwen/Qwen3.5-27B in bf16 via AutoModelForCausalLM
  2. Quantized to NVFP4 using modelopt.torch.quantization.quantize() with NVFP4_DEFAULT_CFG
  3. Exported via modelopt.torch.export.unified_export_hf.export_hf_checkpoint()
  4. Grafted 15 MTP tensors from the original checkpoint into the quantized safetensors
  5. Added MTP linear layers to quantization_config.ignore in config.json
  6. Used the original model's config.json structure (ForConditionalGeneration with text_config/vision_config) to match weight naming conventions

License

Apache 2.0, following the original Qwen3.5-27B license.

Downloads last month
303
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for surogate/Qwen3.5-27B-NVFP4

Base model

Qwen/Qwen3.5-27B
Quantized
(78)
this model