Qwen3.5-27B-NVFP4-Full (W4A4)

NVFP4 quantization of Qwen/Qwen3.5-27B with all linear layers quantized, including the DeltaNet linear attention projections that are typically excluded.

Key differences from standard NVFP4 checkpoints

Standard NVFP4 (e.g., Sehyo) This checkpoint
MoE experts FP4 FP4
Shared experts FP4 FP4
Self-attention (q/k/v/o) FP4 FP4
DeltaNet (in_proj_qkv, in_proj_z, out_proj) BF16 FP4
DeltaNet (in_proj_a, in_proj_b) BF16 BF16 (N=48, below CUTLASS tile minimum)
Model size 27 GB 20 GB

Performance (DGX Spark / GB10 / SM121)

Measured with vLLM 0.19.1 + FlashInfer 0.6.7, CUTLASS W4A4 backend, no MTP:

Metric Standard NVFP4 This checkpoint Improvement
Decode (tg32) 7.93 tok/s 11.98 tok/s +51%
Decode @ d4096 7.66 tok/s 11.90 tok/s +55%
Decode @ d8192 7.92 tok/s 11.80 tok/s +49%
Prefill (pp2048) 1855 tok/s 2383 tok/s +28%

The speedup comes from eliminating ~5 GB of BF16 weight loads per token for the DeltaNet layers, replacing them with ~1.4 GB of FP4 loads.

Quality benchmarks (0-shot, 200-sample subsets)

Benchmark Metric This checkpoint BF16 typical Recovery
ARC-Challenge acc_norm 63.5% ~66% ~96%
HellaSwag acc_norm 74.0% ~78% ~95%
TruthfulQA MC2 acc 54.2% ~55% ~99%
Winogrande acc 51.5% ~52% ~99%

95-99% quality recovery across knowledge and reasoning benchmarks. Quantizing the DeltaNet linear attention layers to FP4 is near-lossless.

Note: GSM8k results are excluded as the model's thinking/reasoning output format interferes with lm-eval-harness answer extraction, producing unreliable scores. Subjective quality in interactive use (Open WebUI, chat API) is excellent with reasoning intact.

Quantization details

  • Method: llm-compressor oneshot with calibrated NVFP4 (W4A4)
  • Calibration: 256 samples from HuggingFaceH4/ultrachat_200k, max_seq_length=4096
  • Format: compressed-tensors nvfp4-pack-quantized with calibrated input_global_scale
  • Excluded layers: in_proj_a, in_proj_b (N=48, CUTLASS FP4 requires N%64==0), conv1d (3D), norms, A_log, dt_bias, lm_head, embed_tokens

Usage

vLLM (recommended)

Requires vLLM >= 0.19.1 with PR #38423 (W4A4 SM120/SM121 support) and FlashInfer >= 0.6.7.

vllm serve rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --attention-backend flashinfer \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

Quality notes

FP4 activation quantization on DeltaNet layers was widely assumed to be destructive for model quality. Our analysis shows the quantization error (SNR ~24 dB, relative error ~26%) is comparable to other layer types (SNR ~24 dB, relative error ~26%). The model produces coherent output with reasoning capabilities intact.

Required llm-compressor fix

Quantizing the DeltaNet layers requires vllm-project/llm-compressor#2566, which fixes model_free_ptq for models with non-contiguous fused attention layers (Qwen3.5's interleaved self_attn + linear_attn architecture).

Acknowledgments

  • Sehyo for the original Qwen3.5 NVFP4 quantization work and llm-compressor PR #2383
  • eugr for spark-vllm-docker infrastructure
  • Built on DGX Spark (GB10, SM121)
Downloads last month
109
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included

Base model

Qwen/Qwen3.5-27B
Quantized
(164)
this model