Qwen3.5-27B-NVFP4-Full (W4A4)
NVFP4 quantization of Qwen/Qwen3.5-27B with all linear layers quantized, including the DeltaNet linear attention projections that are typically excluded.
Key differences from standard NVFP4 checkpoints
| Standard NVFP4 (e.g., Sehyo) | This checkpoint | |
|---|---|---|
| MoE experts | FP4 | FP4 |
| Shared experts | FP4 | FP4 |
| Self-attention (q/k/v/o) | FP4 | FP4 |
| DeltaNet (in_proj_qkv, in_proj_z, out_proj) | BF16 | FP4 |
| DeltaNet (in_proj_a, in_proj_b) | BF16 | BF16 (N=48, below CUTLASS tile minimum) |
| Model size | 27 GB | 20 GB |
Performance (DGX Spark / GB10 / SM121)
Measured with vLLM 0.19.1 + FlashInfer 0.6.7, CUTLASS W4A4 backend, no MTP:
| Metric | Standard NVFP4 | This checkpoint | Improvement |
|---|---|---|---|
| Decode (tg32) | 7.93 tok/s | 11.98 tok/s | +51% |
| Decode @ d4096 | 7.66 tok/s | 11.90 tok/s | +55% |
| Decode @ d8192 | 7.92 tok/s | 11.80 tok/s | +49% |
| Prefill (pp2048) | 1855 tok/s | 2383 tok/s | +28% |
The speedup comes from eliminating ~5 GB of BF16 weight loads per token for the DeltaNet layers, replacing them with ~1.4 GB of FP4 loads.
Quality benchmarks (0-shot, 200-sample subsets)
| Benchmark | Metric | This checkpoint | BF16 typical | Recovery |
|---|---|---|---|---|
| ARC-Challenge | acc_norm | 63.5% | ~66% | ~96% |
| HellaSwag | acc_norm | 74.0% | ~78% | ~95% |
| TruthfulQA MC2 | acc | 54.2% | ~55% | ~99% |
| Winogrande | acc | 51.5% | ~52% | ~99% |
95-99% quality recovery across knowledge and reasoning benchmarks. Quantizing the DeltaNet linear attention layers to FP4 is near-lossless.
Note: GSM8k results are excluded as the model's thinking/reasoning output format interferes with lm-eval-harness answer extraction, producing unreliable scores. Subjective quality in interactive use (Open WebUI, chat API) is excellent with reasoning intact.
Quantization details
- Method: llm-compressor
oneshotwith calibrated NVFP4 (W4A4) - Calibration: 256 samples from HuggingFaceH4/ultrachat_200k, max_seq_length=4096
- Format: compressed-tensors
nvfp4-pack-quantizedwith calibratedinput_global_scale - Excluded layers:
in_proj_a,in_proj_b(N=48, CUTLASS FP4 requires N%64==0),conv1d(3D), norms,A_log,dt_bias,lm_head,embed_tokens
Usage
vLLM (recommended)
Requires vLLM >= 0.19.1 with PR #38423 (W4A4 SM120/SM121 support) and FlashInfer >= 0.6.7.
vllm serve rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included \
--trust-remote-code \
--kv-cache-dtype fp8 \
--attention-backend flashinfer \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
Quality notes
FP4 activation quantization on DeltaNet layers was widely assumed to be destructive for model quality. Our analysis shows the quantization error (SNR ~24 dB, relative error ~26%) is comparable to other layer types (SNR ~24 dB, relative error ~26%). The model produces coherent output with reasoning capabilities intact.
Required llm-compressor fix
Quantizing the DeltaNet layers requires vllm-project/llm-compressor#2566, which fixes model_free_ptq for models with non-contiguous fused attention layers (Qwen3.5's interleaved self_attn + linear_attn architecture).
Acknowledgments
- Downloads last month
- 109
Model tree for rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included
Base model
Qwen/Qwen3.5-27B