Qwen3.6-27B-NVFP4

Mixed-precision NVFP4 + FP8 quantization of Qwen/Qwen3.6-27B targeting native Blackwell (SM120) deployment — RTX 5090 32 GB, RTX 6000 Pro 96 GB.

The original BF16 checkpoint needs ~52 GiB of VRAM. This build fits a single RTX 5090 32 GB at 32k context with usable KV cache, multi-turn reasoning, and tool-call support.

Variants

The repository hosts three branches, each tuned for a different deployment profile. Pick one with revision= in from_pretrained or --revision in the HF CLI.

Branch lm_head embed_tokens Use case Status
main BF16 BF16 Workflow / tool-call / structured-output Recommended default
fp8-head FP8_BLOCK [128, 128] BF16 Free-form text generation, more concurrency Stable
fp8-head-embed FP8_BLOCK [128, 128] FP8_BLOCK [128, 128] Maximum VRAM saving / max concurrency Lab only — see caveat below

All three branches share the same inner-layer quantization scheme (NVFP4 on the large Linears, FP8 per-tensor on accuracy-sensitive Linears, BF16 on normalization / GDN sub-projections / MTP / visual tower). They differ only in the precision of the embedding boundary layers.

VRAM and concurrency comparison

Measured on a single RTX 5090 32 GB at max_model_len=32768 with --kv-cache-dtype turboquant_4bit_nc, graph capture on, gpu_memory_utilization=0.92. Decode tok/s is single-request steady-state.

Build Model load KV budget KV tokens Concurrency @ 32k Decode tok/s
Upstream BF16 base (reference) ~52 GiB n/a n/a does not fit
main (bf16head) 22.07 GiB 4.47 GiB 70,656 5.47× 55.2
fp8-head 20.88 GiB ~6 GiB 89,088 6.88× 58.6
fp8-head-embed (lab) 19.70 GiB 6.83 GiB 107,520 8.35× 58.4

For reference, the same main build with --kv-cache-dtype fp8_e4m3 (no TurboQuant — works on stock vLLM 0.20.x without PR #39931) reaches roughly 3.4× concurrency at 32k. TurboQuant gives the bulk of the concurrency gain; the FP8 head / embed tier only adds incremental room.

VRAM saving from the boundary-layer precision step (all relative to the main BF16-head build):

Step VRAM saved Concurrency tax / benefit Notes
mainfp8-head (FP8_BLOCK lm_head) −1.19 GiB +1.41× concurrency, −0.6 tok/s small precision risk on output projection (token logits)
fp8-headfp8-head-embed (FP8_BLOCK embed_tokens) −1.18 GiB +1.47× concurrency, ≤−0.5 tok/s input-side precision risk; see lab-only caveat below

The decode throughput delta between branches is small (≤ 6 %) because the NVFP4 inner-layer GEMM is the actual compute bottleneck on Blackwell, not the lm_head matmul.

Branch selection guidance

  • main (BF16 head + BF16 embed) preserves full output-projection precision. Recommended for structured output where a single mis-projected token can break a tool call (JSON brace, enum value, ID, date).
  • fp8-head (FP8 lm_head + BF16 embed) trades a small amount of output precision for ~+25 % concurrency at the same context length. Safe for free-form text generation in local evaluation.
  • fp8-head-embed (FP8 lm_head + FP8 embed) gives the highest concurrency on the same hardware. A targeted regression test surfaced a stable semantic drift on a short arithmetic prompt — block-FP8 quantization of the embedding table appears to corrupt prompt understanding on number-comparison cases. Kept as a published lab artefact for reproducibility; not recommended for production.

Architecture / quantization summary

This is a compressed-tensors checkpoint with the following per-group scheme:

  • NVFP4 (W4A4, group_size 16, FP8 e4m3 scales) on mlp.{gate_proj, up_proj} and linear_attn.{in_proj_qkv, in_proj_z, out_proj}. These are the largest Linears and where the bulk of VRAM savings comes from on Blackwell GEMM hardware.
  • FP8 W8A8 dynamic, per-tensor strategy on mlp.down_proj and self_attn.{q_proj, k_proj, v_proj, o_proj}. Uses vLLM's CutlassFP8ScaledMMLinearKernel at runtime.
  • FP8_BLOCK [128, 128] on lm_head (and on embed_tokens for the fp8-head-embed branch). Uses vLLM's CutlassFp8BlockScaledMMKernel.
  • BF16 (unquantized) on all normalization layers, GDN conv1d / A_log / dt_bias / in_proj_a / in_proj_b, MTP head, and the full visual tower.

The visual tower is kept in the checkpoint for completeness; for text-only deployments the encoder cache reservation is unused.

Recipe and calibration

The quantization recipe is based on the Red Hat / Neural Magic NVFP4 recipe published in llm-compressor, extended with:

  • FP8 per-tensor strategy for the down/self_attn projections.
  • FP8_BLOCK [128, 128] on lm_head / embed_tokens (variant branches only).
  • BF16 ignore list covering normalization layers, GDN sub-projections, MTP head, and the visual tower.

Activation scales were calibrated one-shot using the LLM Compressor pipeline with sequential subgraph tracing. The calibration corpus is a 1280-sample mix weighted for non-English coverage (Czech and central European multilingual, legal / formal prose, code, math reasoning, instruction following). Default Red Hat calibration corpora are predominantly English; the corpus mix was modified to reduce drift on multilingual reasoning and on legal / structured output tasks. Calibration ran on RTX 6000 Pro 96 GB.

The recipe.yaml shipped with each branch is the exact LLM-Compressor recipe used for that variant.

Files

File Purpose
model.safetensors Quantized weights (compressed-tensors format)
model.safetensors.index.json Tensor index
model_mtp.safetensors BF16 Multi-Token-Prediction head (preserved for speculative decoding)
config.json Model config with quantization_config block
generation_config.json Default generation params
chat_template.jinja Qwen3 chat template with thinking-mode markers
tokenizer*.json / tokenizer_config.json Tokenizer
processor_config.json Processor config
recipe.yaml LLM-Compressor recipe used for the export

Recommended vLLM serve config

Requires vLLM ≥ 0.20.0 with compressed-tensors and NVFP4 support.

The fp8-head and fp8-head-embed branches need an additional small dispatch patch for FP8 ParallelLMHead / VocabParallelEmbedding on the compressed-tensors path; the standard compressed_tensors dispatcher in vLLM 0.20.x routes only LinearBase and ParallelLMHead through FP8 schemes and does not yet have a VocabParallelEmbedding branch. The main branch (BF16 head + BF16 embed) needs no extra patches.

If --kv-cache-dtype is set to a TurboQuant preset (e.g. turboquant_4bit_nc), make sure your vLLM build includes PR #39931 for hybrid attention support (Qwen3.6 mixes standard self-attention with linear-attention / Mamba-style layers). The PR was merged to vllm-project/vllm:main on 2026-05-05, so current vLLM main / nightly builds after that date should already include it. Pinned releases and older vendor images still need either the PR applied or a newer nightly/base image.

Example serve command (single RTX 5090 32 GB, 32k context):

vllm serve /path/to/checkpoint \
  --served-model-name qwen3.6-27b \
  --max-model-len 32768 \
  --max-num-batched-tokens 4096 \
  --gpu-memory-utilization 0.92 \
  --kv-cache-dtype turboquant_4bit_nc \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --max-num-seqs 32 \
  --tensor-parallel-size 1 \
  --dtype bfloat16 \
  --trust-remote-code

For a simpler config without TurboQuant, use --kv-cache-dtype fp8_e4m3 (supported on vLLM 0.20+ without extra patches). Concurrency will be lower.

Runtime requirements per branch

Two independent patches may be needed depending on which branch you serve and which KV cache dtype you choose:

Patch Source When required
TurboQuant hybrid attention vLLM PR #39931 Any branch, if --kv-cache-dtype turboquant_* is set. Qwen3.6 is a hybrid (self-attn + linear-attn / Mamba) architecture; the in-tree TurboQuant rejects hybrid models before this PR. The PR was merged to vllm-project/vllm:main on 2026-05-05, so current vLLM main/nightly builds after that date should already include it. Stock releases cut before that date, including v0.20.x, still need either the PR applied or a newer nightly/base image.
Compressed-tensors FP8 head/embed dispatch vllm_patches/ fp8-head and fp8-head-embed branches only. The in-tree compressed-tensors dispatcher in vLLM 0.20.x routes only LinearBase, ParallelLMHead, Attention, and FusedMoE modules through quant schemes; FP8 weight loading on lm_head (and embed_tokens for fp8-head-embed) needs the additional dispatch patch shipped here. An upstream PR for this is in progress.

Combined matrix:

Branch KV cache dtype Needs PR #39931? Needs vllm_patches/?
main fp8_e4m3 (or default) no no
main turboquant_4bit_nc (recommended for max concurrency) yes no
fp8-head fp8_e4m3 no yes
fp8-head turboquant_4bit_nc yes yes
fp8-head-embed (lab) fp8_e4m3 no yes
fp8-head-embed (lab) turboquant_4bit_nc yes yes

If your vLLM build or base image already includes PR #39931 — for example a vLLM main/nightly build from after 2026-05-05 — you only need the vllm_patches/ overlay for the FP8 head/embed branches. The main branch runs out of the box on such builds.

What vllm_patches/ adds

The fp8-head and fp8-head-embed branches need a small dispatch patch because the in-tree compressed-tensors dispatcher in vLLM 0.20.x routes only LinearBase, ParallelLMHead, Attention, and FusedMoE modules through quant schemes. To activate FP8 weight loading on the LM head and embedding layers on the compressed-tensors path, the dispatcher needs:

  • a quant_config pass-through into the ParallelLMHead constructor (the upstream Fp8Config path already has this since PR #41000 — the compressed-tensors port needs the same wire-up in the model file);
  • a generalized scale-companion loader in VocabParallelEmbedding.weight_loader (analogous to PR #41365 for the legacy Fp8Config path);
  • (for fp8-head-embed only) a new CompressedTensorsFp8EmbeddingMethod plus a VocabParallelEmbedding dispatch branch in CompressedTensorsConfig.get_quant_method.

The full patch ships in this repo under the vllm_patches/ folder. An upstream PR for the compressed-tensors port is in progress; once landed in vllm-project/vllm you can drop the local patch.

Apply the patch (local dev)

# Clone a vLLM source tree for the compressed-tensors FP8 dispatch patch.
# Stock v0.20.2 is fine for fp8_e4m3 KV-cache testing, but it does NOT include
# TurboQuant hybrid attention support. For --kv-cache-dtype turboquant_* use
# vLLM main/nightly from 2026-05-05 or later, or a base image with PR #39931.
git clone --depth 1 --branch v0.20.2 https://github.com/vllm-project/vllm.git vllm-src
cd vllm-src

# Grab the patcher + embedding method file from this repo
curl -O https://huggingface.co/inferRouter/Qwen3.6-27B-NVFP4/resolve/main/vllm_patches/apply_ct_fp8_lmhead_patch.py
curl -O https://huggingface.co/inferRouter/Qwen3.6-27B-NVFP4/resolve/main/vllm_patches/compressed_tensors_embedding.py

# Apply
python3 apply_ct_fp8_lmhead_patch.py .

# Verify
python3 -m py_compile \
  vllm/model_executor/models/qwen3_5.py \
  vllm/model_executor/layers/vocab_parallel_embedding.py \
  vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py \
  vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_embedding.py

# Build local image (or pip install -e .) per your usual flow

The patcher is idempotent and detects when PR #41000 has already wired quant_config into ParallelLMHead; in that case it skips the first patch point. For the compressed-tensors FP8 head/embed dispatch itself, it works on both stock v0.20.2 and on Red Hat / Neural Magic vLLM nightlies that already include the upstream lm_head FP8 work. TurboQuant hybrid support is separate: current vLLM main/nightly builds after 2026-05-05 should already include PR #39931, while older pinned releases need the PR applied or a newer base image.

Docker overlay (recommended)

If you already have a base image with PR #39931 (TurboQuant hybrid) and PR #41000 (FP8 lm_head on the legacy Fp8Config path), the simplest path is to overlay just the dispatch patch on top:

FROM <your-vllm-image-with-PR-39931-and-PR-41000>

ADD https://huggingface.co/inferRouter/Qwen3.6-27B-NVFP4/resolve/main/vllm_patches/apply_ct_fp8_lmhead_patch.py /tmp/
ADD https://huggingface.co/inferRouter/Qwen3.6-27B-NVFP4/resolve/main/vllm_patches/compressed_tensors_embedding.py /tmp/

RUN python3 /tmp/apply_ct_fp8_lmhead_patch.py /usr/local/lib/python3.12/site-packages

Verify dispatch at runtime

After the patch is applied, the engine startup log should include:

Selected CutlassFp8BlockScaledMMKernel for CompressedTensorsW8A8Fp8   ← fp8-head (lm_head FP8_BLOCK)
Selected CutlassFP8ScaledMMLinearKernel for CompressedTensorsW8A8Fp8  ← shared (down_proj / self_attn FP8 per-tensor)
Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM               ← shared (MLP / linear_attn NVFP4)

For fp8-head-embed additionally:

Using CompressedTensorsFp8EmbeddingMethod for language_model.model.embed_tokens
with scheme CompressedTensorsW8A16Fp8

If the engine still logs UnquantizedEmbeddingMethod for embed_tokens on the fp8-head-embed branch, the patcher didn't reach the in-container compressed_tensors.py — re-check the install path.

Constraints (fp8-head, fp8-head-embed)

  • TP = 1 only. The patch raises NotImplementedError on tp_size > 1 for the scale-companion path. TP > 1 needs block-scale sharding on the vocab axis, not yet implemented.
  • FA2 only when TurboQuant is enabled. TurboQuant is not yet compatible with FlashAttention ≥ 3; vLLM auto-overrides to FA2.
  • CUDA graph capture validated for fp8-head (graph capture preserves the FP8_BLOCK lm_head dispatch). For fp8-head-embed, capture works in smoke testing but has not been stress-tested.

Inference example

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("inferRouter/Qwen3.6-27B-NVFP4")
# Default = main branch (BF16 head + BF16 embed). For other variants:
#   revision="fp8-head"        FP8 lm_head, BF16 embed
#   revision="fp8-head-embed"  FP8 lm_head + FP8 embed (lab only)
model = AutoModelForCausalLM.from_pretrained(
    "inferRouter/Qwen3.6-27B-NVFP4",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

The intended runtime for this checkpoint is vLLM. The compressed-tensors NVFP4 path through transformers is not the optimized path on Blackwell; use vLLM for production deploys.

Credits

License

Apache 2.0 (inherited from the Qwen3.6 base model). See the upstream Qwen model card for any base-model-specific usage notes.

Downloads last month
1,496
Safetensors
Model size
20B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inferRouter/Qwen3.6-27B-NVFP4

Base model

Qwen/Qwen3.6-27B
Quantized
(336)
this model