Instructions to use inferRouter/Qwen3.6-27B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use inferRouter/Qwen3.6-27B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="inferRouter/Qwen3.6-27B-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("inferRouter/Qwen3.6-27B-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("inferRouter/Qwen3.6-27B-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use inferRouter/Qwen3.6-27B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "inferRouter/Qwen3.6-27B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inferRouter/Qwen3.6-27B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/inferRouter/Qwen3.6-27B-NVFP4

SGLang

How to use inferRouter/Qwen3.6-27B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "inferRouter/Qwen3.6-27B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inferRouter/Qwen3.6-27B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "inferRouter/Qwen3.6-27B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inferRouter/Qwen3.6-27B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use inferRouter/Qwen3.6-27B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/inferRouter/Qwen3.6-27B-NVFP4
```

Qwen3.6-27B-NVFP4

Mixed-precision NVFP4 + FP8 quantization of Qwen/Qwen3.6-27B targeting native Blackwell (SM120) deployment — RTX 5090 32 GB, RTX 6000 Pro 96 GB.

The original BF16 checkpoint needs ~52 GiB of VRAM. This build fits a single RTX 5090 32 GB at 32k context with usable KV cache, multi-turn reasoning, and tool-call support.

Variants

The repository hosts three branches, each tuned for a different deployment profile. Pick one with revision= in from_pretrained or --revision in the HF CLI.

Branch	`lm_head`	`embed_tokens`	Use case	Status
`main`	BF16	BF16	Workflow / tool-call / structured-output	Recommended default
`fp8-head`	FP8_BLOCK [128, 128]	BF16	Free-form text generation, more concurrency	Stable
`fp8-head-embed`	FP8_BLOCK [128, 128]	FP8_BLOCK [128, 128]	Maximum VRAM saving / max concurrency	Lab only — see caveat below

All three branches share the same inner-layer quantization scheme (NVFP4 on the large Linears, FP8 per-tensor on accuracy-sensitive Linears, BF16 on normalization / GDN sub-projections / MTP / visual tower). They differ only in the precision of the embedding boundary layers.

VRAM and concurrency comparison

Measured on a single RTX 5090 32 GB at max_model_len=32768 with --kv-cache-dtype turboquant_4bit_nc, graph capture on, gpu_memory_utilization=0.92. Decode tok/s is single-request steady-state.

Build	Model load	KV budget	KV tokens	Concurrency @ 32k	Decode tok/s
Upstream BF16 base (reference)	~52 GiB	n/a	n/a	does not fit	—
`main` (bf16head)	22.07 GiB	4.47 GiB	70,656	5.47×	55.2
`fp8-head`	20.88 GiB	~6 GiB	89,088	6.88×	58.6
`fp8-head-embed` (lab)	19.70 GiB	6.83 GiB	107,520	8.35×	58.4

For reference, the same main build with --kv-cache-dtype fp8_e4m3 (no TurboQuant — works on stock vLLM 0.20.x without PR #39931) reaches roughly 3.4× concurrency at 32k. TurboQuant gives the bulk of the concurrency gain; the FP8 head / embed tier only adds incremental room.

VRAM saving from the boundary-layer precision step (all relative to the main BF16-head build):

Step	VRAM saved	Concurrency tax / benefit	Notes
`main` → `fp8-head` (FP8_BLOCK lm_head)	−1.19 GiB	+1.41× concurrency, −0.6 tok/s	small precision risk on output projection (token logits)
`fp8-head` → `fp8-head-embed` (FP8_BLOCK embed_tokens)	−1.18 GiB	+1.47× concurrency, ≤−0.5 tok/s	input-side precision risk; see lab-only caveat below

The decode throughput delta between branches is small (≤ 6 %) because the NVFP4 inner-layer GEMM is the actual compute bottleneck on Blackwell, not the lm_head matmul.

Branch selection guidance

main (BF16 head + BF16 embed) preserves full output-projection precision. Recommended for structured output where a single mis-projected token can break a tool call (JSON brace, enum value, ID, date).
fp8-head (FP8 lm_head + BF16 embed) trades a small amount of output precision for ~+25 % concurrency at the same context length. Safe for free-form text generation in local evaluation.
fp8-head-embed (FP8 lm_head + FP8 embed) gives the highest concurrency on the same hardware. A targeted regression test surfaced a stable semantic drift on a short arithmetic prompt — block-FP8 quantization of the embedding table appears to corrupt prompt understanding on number-comparison cases. Kept as a published lab artefact for reproducibility; not recommended for production.

Architecture / quantization summary

This is a compressed-tensors checkpoint with the following per-group scheme:

NVFP4 (W4A4, group_size 16, FP8 e4m3 scales) on mlp.{gate_proj, up_proj} and linear_attn.{in_proj_qkv, in_proj_z, out_proj}. These are the largest Linears and where the bulk of VRAM savings comes from on Blackwell GEMM hardware.
FP8 W8A8 dynamic, per-tensor strategy on mlp.down_proj and self_attn.{q_proj, k_proj, v_proj, o_proj}. Uses vLLM's CutlassFP8ScaledMMLinearKernel at runtime.
FP8_BLOCK [128, 128] on lm_head (and on embed_tokens for the fp8-head-embed branch). Uses vLLM's CutlassFp8BlockScaledMMKernel.
BF16 (unquantized) on all normalization layers, GDN conv1d / A_log / dt_bias / in_proj_a / in_proj_b, MTP head, and the full visual tower.

The visual tower is kept in the checkpoint for completeness; for text-only deployments the encoder cache reservation is unused.

Recipe and calibration

The quantization recipe is based on the Red Hat / Neural Magic NVFP4 recipe published in llm-compressor, extended with:

FP8 per-tensor strategy for the down/self_attn projections.
FP8_BLOCK [128, 128] on lm_head / embed_tokens (variant branches only).
BF16 ignore list covering normalization layers, GDN sub-projections, MTP head, and the visual tower.

Activation scales were calibrated one-shot using the LLM Compressor pipeline with sequential subgraph tracing. The calibration corpus is a 1280-sample mix weighted for non-English coverage (Czech and central European multilingual, legal / formal prose, code, math reasoning, instruction following). Default Red Hat calibration corpora are predominantly English; the corpus mix was modified to reduce drift on multilingual reasoning and on legal / structured output tasks. Calibration ran on RTX 6000 Pro 96 GB.

The recipe.yaml shipped with each branch is the exact LLM-Compressor recipe used for that variant.

Files

File	Purpose
`model.safetensors`	Quantized weights (compressed-tensors format)
`model.safetensors.index.json`	Tensor index
`model_mtp.safetensors`	BF16 Multi-Token-Prediction head (preserved for speculative decoding)
`config.json`	Model config with `quantization_config` block
`generation_config.json`	Default generation params
`chat_template.jinja`	Qwen3 chat template with thinking-mode markers
`tokenizer*.json` / `tokenizer_config.json`	Tokenizer
`processor_config.json`	Processor config
`recipe.yaml`	LLM-Compressor recipe used for the export

Recommended vLLM serve config

Requires vLLM ≥ 0.20.0 with compressed-tensors and NVFP4 support.

The fp8-head and fp8-head-embed branches need an additional small dispatch patch for FP8 ParallelLMHead / VocabParallelEmbedding on the compressed-tensors path; the standard compressed_tensors dispatcher in vLLM 0.20.x routes only LinearBase and ParallelLMHead through FP8 schemes and does not yet have a VocabParallelEmbedding branch. The main branch (BF16 head + BF16 embed) needs no extra patches.

If --kv-cache-dtype is set to a TurboQuant preset (e.g. turboquant_4bit_nc), make sure your vLLM build includes PR #39931 for hybrid attention support (Qwen3.6 mixes standard self-attention with linear-attention / Mamba-style layers). The PR was merged to vllm-project/vllm:main on 2026-05-05, so current vLLM main / nightly builds after that date should already include it. Pinned releases and older vendor images still need either the PR applied or a newer nightly/base image.

Example serve command (single RTX 5090 32 GB, 32k context):

vllm serve /path/to/checkpoint \
  --served-model-name qwen3.6-27b \
  --max-model-len 32768 \
  --max-num-batched-tokens 4096 \
  --gpu-memory-utilization 0.92 \
  --kv-cache-dtype turboquant_4bit_nc \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --max-num-seqs 32 \
  --tensor-parallel-size 1 \
  --dtype bfloat16 \
  --trust-remote-code

For a simpler config without TurboQuant, use --kv-cache-dtype fp8_e4m3 (supported on vLLM 0.20+ without extra patches). Concurrency will be lower.

Runtime requirements per branch

Two independent patches may be needed depending on which branch you serve and which KV cache dtype you choose:

Patch	Source	When required
TurboQuant hybrid attention	vLLM PR #39931	Any branch, if `--kv-cache-dtype turboquant_*` is set. Qwen3.6 is a hybrid (self-attn + linear-attn / Mamba) architecture; the in-tree TurboQuant rejects hybrid models before this PR. The PR was merged to `vllm-project/vllm:main` on 2026-05-05, so current vLLM main/nightly builds after that date should already include it. Stock releases cut before that date, including v0.20.x, still need either the PR applied or a newer nightly/base image.
Compressed-tensors FP8 head/embed dispatch	`vllm_patches/`	`fp8-head` and `fp8-head-embed` branches only. The in-tree `compressed-tensors` dispatcher in vLLM 0.20.x routes only `LinearBase`, `ParallelLMHead`, `Attention`, and `FusedMoE` modules through quant schemes; FP8 weight loading on `lm_head` (and `embed_tokens` for `fp8-head-embed`) needs the additional dispatch patch shipped here. An upstream PR for this is in progress.

Combined matrix:

Branch	KV cache dtype	Needs PR #39931?	Needs `vllm_patches/`?
`main`	`fp8_e4m3` (or default)	no	no
`main`	`turboquant_4bit_nc` (recommended for max concurrency)	yes	no
`fp8-head`	`fp8_e4m3`	no	yes
`fp8-head`	`turboquant_4bit_nc`	yes	yes
`fp8-head-embed` (lab)	`fp8_e4m3`	no	yes
`fp8-head-embed` (lab)	`turboquant_4bit_nc`	yes	yes

If your vLLM build or base image already includes PR #39931 — for example a vLLM main/nightly build from after 2026-05-05 — you only need the vllm_patches/ overlay for the FP8 head/embed branches. The main branch runs out of the box on such builds.

What `vllm_patches/` adds

The fp8-head and fp8-head-embed branches need a small dispatch patch because the in-tree compressed-tensors dispatcher in vLLM 0.20.x routes only LinearBase, ParallelLMHead, Attention, and FusedMoE modules through quant schemes. To activate FP8 weight loading on the LM head and embedding layers on the compressed-tensors path, the dispatcher needs:

a quant_config pass-through into the ParallelLMHead constructor (the upstream Fp8Config path already has this since PR #41000 — the compressed-tensors port needs the same wire-up in the model file);
a generalized scale-companion loader in VocabParallelEmbedding.weight_loader (analogous to PR #41365 for the legacy Fp8Config path);
(for fp8-head-embed only) a new CompressedTensorsFp8EmbeddingMethod plus a VocabParallelEmbedding dispatch branch in CompressedTensorsConfig.get_quant_method.

The full patch ships in this repo under the vllm_patches/ folder. An upstream PR for the compressed-tensors port is in progress; once landed in vllm-project/vllm you can drop the local patch.

Apply the patch (local dev)

# Clone a vLLM source tree for the compressed-tensors FP8 dispatch patch.
# Stock v0.20.2 is fine for fp8_e4m3 KV-cache testing, but it does NOT include
# TurboQuant hybrid attention support. For --kv-cache-dtype turboquant_* use
# vLLM main/nightly from 2026-05-05 or later, or a base image with PR #39931.
git clone --depth 1 --branch v0.20.2 https://github.com/vllm-project/vllm.git vllm-src
cd vllm-src

# Grab the patcher + embedding method file from this repo
curl -O https://huggingface.co/inferRouter/Qwen3.6-27B-NVFP4/resolve/main/vllm_patches/apply_ct_fp8_lmhead_patch.py
curl -O https://huggingface.co/inferRouter/Qwen3.6-27B-NVFP4/resolve/main/vllm_patches/compressed_tensors_embedding.py

# Apply
python3 apply_ct_fp8_lmhead_patch.py .

# Verify
python3 -m py_compile \
  vllm/model_executor/models/qwen3_5.py \
  vllm/model_executor/layers/vocab_parallel_embedding.py \
  vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py \
  vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_embedding.py

# Build local image (or pip install -e .) per your usual flow

The patcher is idempotent and detects when PR #41000 has already wired quant_config into ParallelLMHead; in that case it skips the first patch point. For the compressed-tensors FP8 head/embed dispatch itself, it works on both stock v0.20.2 and on Red Hat / Neural Magic vLLM nightlies that already include the upstream lm_head FP8 work. TurboQuant hybrid support is separate: current vLLM main/nightly builds after 2026-05-05 should already include PR #39931, while older pinned releases need the PR applied or a newer base image.

Docker overlay (recommended)

If you already have a base image with PR #39931 (TurboQuant hybrid) and PR #41000 (FP8 lm_head on the legacy Fp8Config path), the simplest path is to overlay just the dispatch patch on top:

FROM <your-vllm-image-with-PR-39931-and-PR-41000>

ADD https://huggingface.co/inferRouter/Qwen3.6-27B-NVFP4/resolve/main/vllm_patches/apply_ct_fp8_lmhead_patch.py /tmp/
ADD https://huggingface.co/inferRouter/Qwen3.6-27B-NVFP4/resolve/main/vllm_patches/compressed_tensors_embedding.py /tmp/

RUN python3 /tmp/apply_ct_fp8_lmhead_patch.py /usr/local/lib/python3.12/site-packages

Verify dispatch at runtime

After the patch is applied, the engine startup log should include:

Selected CutlassFp8BlockScaledMMKernel for CompressedTensorsW8A8Fp8   ← fp8-head (lm_head FP8_BLOCK)
Selected CutlassFP8ScaledMMLinearKernel for CompressedTensorsW8A8Fp8  ← shared (down_proj / self_attn FP8 per-tensor)
Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM               ← shared (MLP / linear_attn NVFP4)

For fp8-head-embed additionally:

Using CompressedTensorsFp8EmbeddingMethod for language_model.model.embed_tokens
with scheme CompressedTensorsW8A16Fp8

If the engine still logs UnquantizedEmbeddingMethod for embed_tokens on the fp8-head-embed branch, the patcher didn't reach the in-container compressed_tensors.py — re-check the install path.

Constraints (`fp8-head`, `fp8-head-embed`)

TP = 1 only. The patch raises NotImplementedError on tp_size > 1 for the scale-companion path. TP > 1 needs block-scale sharding on the vocab axis, not yet implemented.
FA2 only when TurboQuant is enabled. TurboQuant is not yet compatible with FlashAttention ≥ 3; vLLM auto-overrides to FA2.
CUDA graph capture validated for fp8-head (graph capture preserves the FP8_BLOCK lm_head dispatch). For fp8-head-embed, capture works in smoke testing but has not been stress-tested.

Inference example

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("inferRouter/Qwen3.6-27B-NVFP4")
# Default = main branch (BF16 head + BF16 embed). For other variants:
#   revision="fp8-head"        FP8 lm_head, BF16 embed
#   revision="fp8-head-embed"  FP8 lm_head + FP8 embed (lab only)
model = AutoModelForCausalLM.from_pretrained(
    "inferRouter/Qwen3.6-27B-NVFP4",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

The intended runtime for this checkpoint is vLLM. The compressed-tensors NVFP4 path through transformers is not the optimized path on Blackwell; use vLLM for production deploys.

Credits

Base model: Qwen/Qwen3.6-27B.
Quantization recipe scaffolding: Red Hat / Neural Magic LLM Compressor, specifically the published NVFP4 W4A4 recipe family.
Runtime: vLLM, with the compressed-tensors format.
Hybrid TurboQuant KV cache: vLLM PR #39931.

License

Apache 2.0 (inherited from the Qwen3.6 base model). See the upstream Qwen model card for any base-model-specific usage notes.

Downloads last month: 1,328

Safetensors

Model size

20B params

Tensor type

F32

BF16

F8_E4M3

Model tree for inferRouter/Qwen3.6-27B-NVFP4

Base model

Qwen/Qwen3.6-27B

Quantized

(591)

this model