Qwen3.6-27B-W8A16

INT8 post-training quantization of Qwen/Qwen3.6-27B — a vision-language model (images + text → text) with a hybrid Gated-DeltaNet + dense-MLP architecture. ~27 GB on disk. Runs 262k token context on one A6000 or RTX 4090.

The official FP8 release requires Blackwell or H100 hardware. This quant fills the gap: near-lossless W8A16 that runs on any FP16-capable GPU — A6000, RTX 4090, L40S, A100-40.


What Makes This Different

The Gap This Fills

The official Qwen3.6-27B release ships in BF16 (54 GB) and FP8 (28 GB). The FP8 checkpoint is great — but FP8 inference kernels require Blackwell (RTX 50xx) or Hopper (H100/H200) GPU generation. On Ampere or Ada hardware (RTX 3090/4090, A6000, A100), FP8 activations are not natively supported.

This W8A16 checkpoint lands at the same ~27–28 GB footprint as the official FP8 model, but uses INT8 weights and BF16 activations — a format that Marlin kernels in vLLM serve natively on any Ampere+ GPU. No hardware generation restriction.

The Solutions Applied Here

AutoRound W8A16 — INT8 weights, BF16 activations. Activations are left in BF16. This eliminates the dominant source of quality loss in W8A8 methods. AutoRound's sign-gradient rounding optimization produces significantly better per-weight calibration than GPTQ or AWQ at the same bit width, reducing worst-case outlier distortion.

Group size G32 — fine-grained scale resolution. A group size of 32 weights per scale factor provides 4× finer quantization resolution than the common G128, at the cost of a modest overhead in scale storage. For a 27B dense model this is the right tradeoff: scale storage is negligible, and the quality improvement on long-context tasks is measurable.

Mixed calibration corpus. 75% UltraChat-200k (instruction-following fidelity) + 25% WikiText-103 (long-context fidelity). 1,024 samples at 2,048 tokens each. Text-only calibration data.

What Stays at BF16

Layer Reason
linear_attn.* Gated DeltaNet — must stay BF16 per vLLM #40252
embed_tokens Embedding table — standard practice; disproportionate perplexity impact
lm_head Output projection — standard practice
norm Layer norms — negligible size, high sensitivity

Vision calibration note: Calibration corpus is text-only. The vision encoder (ViT) receives RTN-style INT8 quantization with no calibration signal, which is near-lossless at 8-bit. Text quality is fully calibrated; vision quality is RTN INT8.


Architecture Notes

Qwen3.6-27B is a vision-language model built on a hybrid Gated-DeltaNet + dense-MLP backbone. Key characteristics relevant to serving:

  • Modalities: Image + Text → Text (pipeline tag: image-text-to-text)
  • Model type: qwen3_5 (dense, no MoE)
  • 41 LLM layers: interleaved full-attention and Gated DeltaNet linear-attention layers in a dense stack
  • No MoE: no shared experts, no router gate — standard dense MLP blocks throughout
  • Native context: 262,144 tokens
  • KV cache: only full-attention layers maintain a KV cache; Gated DeltaNet layers use fixed recurrent state independent of sequence length

Because there is no MoE, there is no tail-expert calibration problem. Every parameter in the dense MLP blocks is calibrated with full coverage. The only special-case exclusions are the Gated DeltaNet layers (BF16 by requirement) and the standard embedding/head layers.


Memory Requirements

Configuration BF16 This Quant (W8A16) Official FP8
Weights (disk/VRAM) ~54 GB ~27 GB ~28 GB
KV cache @ 32k ctx (fp8) ~0.3 GB ~0.3 GB ~0.3 GB
KV cache @ 128k ctx (fp8) ~1.2 GB ~1.2 GB ~1.2 GB
KV cache @ 262k ctx (fp8) ~2.4 GB ~2.4 GB ~2.4 GB
Total VRAM @ 32k ctx ~55 GB ~28 GB ~29 GB
Total VRAM @ 262k ctx ~57 GB ~30 GB ~31 GB
Minimum GPU 1× A100 80GB 1× RTX 4090 / A6000 / A100-40 H100 / Blackwell only

KV cache figures are for the full-attention layers only. Linear-attention layers carry state in a fixed recurrent buffer independent of sequence length — this is why 262k context fits on a 24 GB RTX 4090 with room to spare.


Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

Footguns to avoid: Do NOT use --quantization turboquant (vLLM #41560). Do NOT use --tensor-parallel-size > 1 on a single GPU.

262k Context — Full Native Context (Recommended)

Native context window, no rope scaling, maximum quality.

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Qwen3.6-27B-W8A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --reasoning-parser qwen3

Python Client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")

response = client.chat.completions.create(
    model="88plug/Qwen3.6-27B-W8A16",
    messages=[{"role": "user", "content": "Your prompt here"}],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(response.choices[0].message.content)

Recommended Sampling Parameters

Mode Temperature Top-P Top-K Min-P Use When
Thinking (default) 0.6 0.95 20 0.0 Reasoning, math, code
Non-thinking 0.7 0.8 20 0.0 Chat, creative, fast response

Enable/disable thinking via chat_template_kwargs={"enable_thinking": True/False}. Default is thinking-enabled.


Quality

Targets

Metric Target
KL divergence KL(quant‖BF16) < 0.005
MMLU recovery vs BF16 ≥ 99.7%
RULER@128k recovery vs BF16 ≥ 99%

These targets drove the recipe design — W8A16 instead of W8A8 to eliminate activation quantization error, G32 group size for fine-grained scale resolution, and mixed calibration corpus to preserve both instruction-following and long-context fidelity.

Full benchmark results will be added after publication. If you run evals, please open an issue or PR.

vs. Other Qwen3.6-27B Quants

This is the first compressed-tensors W8A16 checkpoint for Qwen3.6-27B. It fills the gap between the 54 GB BF16 base and the hardware-restricted official FP8.

Quant Method Size GPU Compatibility Notes
88plug W8A16 (this) compressed-tensors AutoRound ~27 GB Any Ampere+ (A6000, RTX 4090, A100) First W8A16 for this model
Qwen/Qwen3.6-27B-FP8 (6.7M DL) FP8 ~28 GB Blackwell / H100 only Official; hardware restricted
unsloth/Qwen3.6-27B-GGUF (2M DL) GGUF Q4–Q8 14–28 GB CPU, Apple Silicon, any GPU llama.cpp, no vLLM
cyankiwi/Qwen3.6-27B-AWQ-INT4 (1.5M DL) AWQ INT4 ~14 GB Any GPU 4-bit only
QuantTrio/Qwen3.6-27B-AWQ (893K DL) AWQ INT4 ~14 GB Any GPU 4-bit only
Lorbus/Qwen3.6-27B-int4-AutoRound (870K DL) compressed-tensors W4G128 ~14 GB Any GPU W4 only, G128

Why W8A16 over FP8: FP8 activation quantization on Ampere hardware falls back to BF16 dispatch silently, negating the memory savings. W8A16 is the correct format for Ampere/Ada inference: INT8 weights (Marlin kernel), BF16 activations — no hardware generation requirement, predictable behavior.

Why compressed-tensors over GGUF at this size: Marlin INT8 kernel throughput at batch > 1 significantly exceeds llama.cpp GGUF on GPU. Weights stay on GPU; no CPU↔GPU transfer overhead. vLLM OpenAI-compatible API, prefix caching, chunked prefill — all work natively.


SGLang

SGLang v0.5.8 offers RadixAttention for prefix-heavy workloads. Run against the BF16 base model — compressed-tensors is vLLM-native only.

Note: Gated-DeltaNet hybrid architecture support in SGLang v0.5.8 is unverified. Confirm before production use.

docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-27B \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

llama.cpp (GGUF)

For consumer GPUs, CPU, and Apple Silicon. Convert from the BF16 base checkpoint — not from compressed-tensors weights. Vision requires a separate mmproj GGUF (libmtmd).

# Build with CUDA
cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j$(nproc)

# Convert base model (text trunk)
python convert_hf_to_gguf.py Qwen/Qwen3.6-27B \
  --outfile Qwen3.6-27B-BF16.gguf

# Vision projector (required for image input)
python convert_hf_to_gguf.py Qwen/Qwen3.6-27B \
  --mmproj --outfile Qwen3.6-27B-mmproj.gguf

# Quantize text trunk
llama-quantize Qwen3.6-27B-BF16.gguf Qwen3.6-27B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Qwen3.6-27B-BF16.gguf Qwen3.6-27B-IQ4_XS.gguf IQ4_XS

# Serve (text + vision)
llama-server \
  --model Qwen3.6-27B-Q8_0.gguf \
  --mmproj Qwen3.6-27B-mmproj.gguf \
  --n-gpu-layers 999 \
  --ctx-size 131072 \
  --port 8081

Benchmarks

Results pending. Will be published before first HuggingFace release.

Engine Format Batch ctx tok/s TTFT p50 TTFT p99 VRAM
vLLM v0.21.0 W8A16 1 32k
vLLM v0.21.0 W8A16 8 32k
vLLM v0.21.0 W8A16 1 128k
SGLang v0.5.8 BF16 (baseline) 1 32k
llama.cpp b9297 Q8_0 GGUF 1 32k

Hardware: A6000 48 GB, CUDA 12.9, driver 570.


Limitations

Dense model — no MoE savings on KV. Unlike the 35B-A3B sparse MoE variant, every one of the 41 layers is active on every token. KV cache memory scales with all full-attention layers in the stack, not a reduced subset. The 262k context window is still achievable on a single A6000, but the headroom is tighter than with the MoE variant.

No activation quantization. W8A16 means activations run in BF16. Memory bandwidth savings are on the weight side only; activation memory is unchanged versus BF16. This is a deliberate quality-vs-compression tradeoff.

Linear attention state reset. The Gated-DeltaNet layers maintain recurrent state, not a KV cache. This state is reset between requests. Stateful multi-turn inference within a session works correctly; cross-request state continuity is not supported by the current vLLM serving path.

Calibration distribution. Calibration used UltraChat-200k and WikiText-103. Tasks with significantly different token distributions (e.g., code-heavy, mathematical, or non-English) may see slightly higher KL divergence than the headline targets. Recalibration with domain-specific data is straightforward using the recipe below.


Quantization Recipe (Reproducibility)

# Core configuration
quantization_method = "autoround"
w_bits = 8
w_dtype = "int8"
a_dtype = "bf16"   # activations NOT quantized
group_size = 32
iters = 200
scale_method = "neural_max"  

calibration_dataset = {
    "ultrachat_200k": 0.75,
    "wikitext_103_raw": 0.25,
}
calibration_samples = 1024
calibration_seqlen = 2048

# BF16-preserved layers
skip_layers = [
    "linear_attn.*",  # Gated DeltaNet — required by vLLM
    "lm_head",        # output projection
    "embed_tokens",   # embedding table
    "norm",           # layer norms
]

Related Work


Citation

If you use this model, please cite the base model:

@misc{qwen3technicalreport,
  title  = {Qwen3 Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3.6-27B}
}

About

88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags.

W8A16 — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

W4A16 — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.

Also available: Qwen3.6-27B-W4A16 (INT4, ~14 GB) · Qwen3.6-27B-W8A16 (INT8, ~27 GB)

Browse all releases → huggingface.co/88plug

Downloads last month
10
Safetensors
Model size
27B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 88plug/Qwen3.6-27B-W8A16

Base model

Qwen/Qwen3.6-27B
Quantized
(393)
this model