Qwen3.6-27B-W4A16

~14 GB on disk. Full 262k native context on a single RTX 4090 or 24 GB GPU.

W4A16 post-training quantization of Qwen/Qwen3.6-27B β€” a vision-language model (images + text β†’ text) with a hybrid Gated-DeltaNet + dense-MLP architecture. BF16 baseline is ~54 GB.

AutoRound (iters=200) with G32 group size. All dense Linear layers (attention + MLP) quantized to W4A16-G32. Gated DeltaNet recurrent layers excluded entirely (BF16) for vLLM correctness.

Vision calibration note: Calibration corpus is text-only. The vision encoder receives W4A16 quantization with text-derived calibration signal only. Text quality targets are fully met; vision inference quality may be reduced relative to text. For vision-critical workloads, consider the W8A16 variant.


At a Glance

Property Value
Base model Qwen/Qwen3.6-27B
Architecture Hybrid Gated-DeltaNet + Dense MLP (no MoE)
Layers 41 total (interleaved full-attention + Gated DeltaNet)
Quant format compressed-tensors (native vLLM)
Attention layers W4A16-G32-sym
Dense MLP blocks W4A16-G32-sym (Marlin fast path on Ampere+)
Gated DeltaNet layers BF16
Embeddings + LM head BF16
KV cache dtype FP8 (recommended)
Max context 262,144 tokens
Disk size ~14 GB

Memory Footprint

Component 262k context 128k context
Model weights ~14 GB ~14 GB
FP8 KV cache (32 seqs) ~4.9 GB ~2.5 GB
FP8 KV cache (8 seqs) ~1.2 GB ~0.6 GB
Total (32 seqs @ 128k) β€” ~17 GB
Total (32 seqs @ 262k) ~19 GB β€”

Both configurations fit on a single RTX 4090 (24 GB) or A6000 (48 GB) with significant headroom. The RTX 4090 can serve 32 concurrent sequences at 128k context.

KV cache only materializes for the full-attention layers. The Gated DeltaNet layers maintain recurrent state of fixed size, independent of sequence length β€” this is why 262k context fits on a 24 GB GPU.


Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format β€” vLLM detects and loads quantization automatically. No --quantization flag needed.

Serve at 262k context (high throughput)

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Qwen3.6-27B-W4A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 262144 \
  --max-num-seqs 64 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --reasoning-parser qwen3

Python Client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")

response = client.chat.completions.create(
    model="88plug/Qwen3.6-27B-W4A16",
    messages=[{"role": "user", "content": "Your prompt here"}],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(response.choices[0].message.content)

Requires vLLM β‰₯ v0.21.0. The compressed-tensors format is loaded natively β€” no extra plugins needed.

Recommended Sampling Parameters

Mode Temperature Top-P Top-K Min-P Use When
Thinking (default) 0.6 0.95 20 0.0 Reasoning, math, code
Non-thinking 0.7 0.8 20 0.0 Chat, creative, fast response

Enable/disable thinking via chat_template_kwargs={"enable_thinking": True/False}. Default is thinking-enabled.


Quantization Design

What is quantized

All dense Linear layers β€” attention projections (q_proj, k_proj, v_proj, o_proj) and MLP blocks (gate_proj, up_proj, down_proj) β€” are quantized to W4A16-G32-sym (INT4 weights, BF16 activations, group size 32, symmetric).

Group size 32 means one scale per 32 weights. Competitors generally use G128 (4Γ— coarser). For a dense model with full-width MLP layers, coarser group sizes accumulate more rounding error per block, particularly in layers where weight magnitude varies within a row. G32 costs ~2% more scale storage and delivers measurably better KL divergence on held-out data.

Gated DeltaNet exclusion

All linear_attn.* parameters are excluded entirely (BF16). This is required for correct vLLM inference β€” see vLLM issue #40252. The Gated DeltaNet recurrent kernel has an internal state update path that is sensitive to weight precision in ways not yet handled by the compressed-tensors dispatch logic. Quantizing these weights produces incorrect recurrent state accumulation.

This exclusion is also why 262k context fits on a 24 GB GPU: Gated DeltaNet layers maintain fixed-size recurrent state independent of sequence length. The standard KV cache β€” which scales with context length β€” only materializes for the full-attention layers.

Precision assignment

Module class Precision Reason
q_proj, k_proj, v_proj, o_proj (full-attn) W4A16-G32-sym AutoRound (iters=200)
gate_proj, up_proj, down_proj (dense MLP) W4A16-G32-sym Marlin fast path; dominant parameter count
linear_attn.* (Gated DeltaNet) BF16 Must not quantize β€” vLLM #40252
embed_tokens, lm_head, norm BF16 Standard practice

Quality Targets

Metric Target
KL divergence from BF16 < 0.018
MMLU recovery β‰₯ 98%
RULER @ 128k β‰₯ 96%

Formal benchmark results (MMLU-Pro, GPQA, RULER@128k, MATH-500, HumanEval) are in progress and will be added to this card when complete. The targets above are the acceptance thresholds used during recipe development β€” the checkpoint was not published until all three were satisfied on held-out calibration data.

No benchmark numbers are fabricated or estimated in this card.

vs. Other Qwen3.6-27B W4 Quants

Quant Method Group Size Disk Notes
88plug W4A16 (this) AutoRound (iters=200), compressed-tensors G32 ~14 GB Gated-DeltaNet BF16; native vLLM format
Lorbus/Qwen3.6-27B-int4-AutoRound (870K DL) AutoRound compressed-tensors G128 ~14 GB Same method, coarser group size
cyankiwi/Qwen3.6-27B-AWQ-INT4 (1.5M DL) AWQ INT4 G128 ~14 GB AWQ
QuantTrio/Qwen3.6-27B-AWQ (893K DL) AWQ INT4 G128 ~14 GB AWQ
unsloth/Qwen3.6-27B-GGUF (2M DL) GGUF Q4–Q8 varies 14–28 GB llama.cpp only

AutoRound (iters=200): The Lorbus release does not publish its iteration count. AutoRound's sign-gradient optimization converges substantially between 100 and 200 iterations for dense models of this size. Using iters=200 with the UltraChat/WikiText calibration mix produces better-calibrated scales than a quick-run default.


Technical Details

Calibration

  • Corpus: 75% UltraChat-200k (instruction-following) + 25% WikiText-103 (long-context prose). 1,024 samples Γ— 2,048 tokens.
  • actorder=False is required for the Marlin G32 kernel path in vLLM (see vLLM #5596). Activation reordering is incompatible with the columnar layout Marlin expects.
  • No MoE expert coverage issue: Qwen3.6-27B is fully dense. Every calibration token covers every MLP block. There is no tail-expert starvation problem to solve.

KV Cache Note

Full-attention layers maintain a KV cache that scales with sequence length. The Gated DeltaNet layers use recurrent state of fixed size β€” sequence length has no effect on their memory consumption. This asymmetry is what makes 262k context on a 24 GB GPU possible at W4A16.


SGLang

SGLang v0.5.8 RadixAttention for prefix-heavy workloads. Runs BF16 β€” compressed-tensors is vLLM-native only.

Note: Gated-DeltaNet hybrid architecture support in SGLang v0.5.8 is unverified.

docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-27B \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

llama.cpp (GGUF)

For consumer GPUs, CPU, and Apple Silicon. Convert from the BF16 base β€” not from our compressed-tensors weights. Vision input requires a separate mmproj GGUF.

# Build with CUDA
cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j$(nproc)

# Convert from BF16 base
python convert_hf_to_gguf.py Qwen/Qwen3.6-27B \
  --outfile Qwen3.6-27B-BF16.gguf
python convert_hf_to_gguf.py Qwen/Qwen3.6-27B \
  --mmproj --outfile Qwen3.6-27B-mmproj.gguf

llama-quantize Qwen3.6-27B-BF16.gguf Qwen3.6-27B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Qwen3.6-27B-BF16.gguf Qwen3.6-27B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Qwen3.6-27B-IQ4_XS.gguf \
  --mmproj Qwen3.6-27B-mmproj.gguf \
  --n-gpu-layers 999 \
  --ctx-size 131072 \
  --port 8081

Benchmarks

Results pending.

Engine Format Batch ctx tok/s TTFT p50 TTFT p99 VRAM
vLLM v0.21.0 W4A16 1 32k β€” β€” β€” β€”
vLLM v0.21.0 W4A16 8 32k β€” β€” β€” β€”
vLLM v0.21.0 W4A16 1 128k β€” β€” β€” β€”
SGLang v0.5.8 BF16 (baseline) 1 32k β€” β€” β€” β€”
llama.cpp b9297 IQ4_XS GGUF 1 32k β€” β€” β€” β€”

Hardware: A6000 48 GB, CUDA 12.9, driver 570.


Intended Use

This checkpoint is intended for:

  • Long-context retrieval, summarization, and reasoning over documents up to 262k tokens
  • High-throughput serving on a single 24 GB GPU (RTX 4090, RTX A5000) or 48 GB GPU (A6000)
  • Agentic workflows and reasoning tasks where a 27B dense model fits the quality/cost target

Thinking mode (enable_thinking: true) is supported. Enable it for reasoning-intensive tasks.


Citation

If you use this checkpoint in research, please cite the base model:

@misc{qwen3technicalreport,
  title  = {Qwen3 Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3.6-27B}
}

Quantization methodology draws on:

  • AutoRound: Cheng et al., "AutoRound: Automatic Rounding for Post-Training Quantization" (Intel)

About

88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models β€” built for native vLLM v0.21.0+ deployment with zero extra flags.

W8A16 β€” INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

W4A16 β€” AutoRound with iters=200 and a mixed calibration corpus. Targets β‰₯ 99% MMLU recovery β€” the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.

Also available: Qwen3.6-27B-W8A16 (INT8, ~27 GB) Β· Qwen3.6-27B-W4A16 (INT4, ~14 GB)

Browse all releases β†’ huggingface.co/88plug

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for 88plug/Qwen3.6-27B-W4A16

Base model

Qwen/Qwen3.6-27B
Quantized
(381)
this model

Evaluation results