Instructions to use 88plug/Qwen3.6-27B-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 88plug/Qwen3.6-27B-W4A16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="88plug/Qwen3.6-27B-W4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("88plug/Qwen3.6-27B-W4A16")
model = AutoModelForCausalLM.from_pretrained("88plug/Qwen3.6-27B-W4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use 88plug/Qwen3.6-27B-W4A16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "88plug/Qwen3.6-27B-W4A16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/Qwen3.6-27B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/88plug/Qwen3.6-27B-W4A16

SGLang

How to use 88plug/Qwen3.6-27B-W4A16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "88plug/Qwen3.6-27B-W4A16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/Qwen3.6-27B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "88plug/Qwen3.6-27B-W4A16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/Qwen3.6-27B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use 88plug/Qwen3.6-27B-W4A16 with Docker Model Runner:
```
docker model run hf.co/88plug/Qwen3.6-27B-W4A16
```

Qwen3.6-27B-W4A16

~14 GB on disk. Full 262k native context on a single RTX 4090 or 24 GB GPU.

W4A16 post-training quantization of Qwen/Qwen3.6-27B — a vision-language model (images + text → text) with a hybrid Gated-DeltaNet + dense-MLP architecture. BF16 baseline is ~54 GB.

AutoRound (iters=200) with G32 group size. All dense Linear layers (attention + MLP) quantized to W4A16-G32. Gated DeltaNet recurrent layers excluded entirely (BF16) for vLLM correctness.

Vision calibration note: Calibration corpus is text-only. The vision encoder receives W4A16 quantization with text-derived calibration signal only. Text quality targets are fully met; vision inference quality may be reduced relative to text. For vision-critical workloads, consider the W8A16 variant.

At a Glance

Property	Value
Base model	`Qwen/Qwen3.6-27B`
Release tier	Gold (AutoRound iters=200)
Quant method	AutoRound W4A16-G32 iters=200
FLAC status	Not measured (T+7d milestone)
Architecture	Hybrid Gated-DeltaNet + Dense MLP (no MoE)
Layers	41 total (interleaved full-attention + Gated DeltaNet)
Quant format	compressed-tensors (native vLLM)
Attention layers	W4A16-G32-sym
Dense MLP blocks	W4A16-G32-sym (Marlin fast path on Ampere+)
Gated DeltaNet layers	BF16
Embeddings + LM head	BF16
KV cache dtype	FP8 (recommended)
Max context	262,144 tokens
Disk size	~14 GB

Memory Footprint

Component	262k context	128k context
Model weights	~14 GB	~14 GB
FP8 KV cache (32 seqs)	~4.9 GB	~2.5 GB
FP8 KV cache (8 seqs)	~1.2 GB	~0.6 GB
Total (32 seqs @ 128k)	—	~17 GB
Total (32 seqs @ 262k)	~19 GB	—

Both configurations fit on a single RTX 4090 (24 GB) or A6000 (48 GB) with significant headroom. The RTX 4090 can serve 32 concurrent sequences at 128k context.

KV cache only materializes for the full-attention layers. The Gated DeltaNet layers maintain recurrent state of fixed size, independent of sequence length — this is why 262k context fits on a 24 GB GPU.

Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

Serve at 262k context (high throughput)

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Qwen3.6-27B-W4A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 262144 \
  --max-num-seqs 64 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --reasoning-parser qwen3

Python Client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")

response = client.chat.completions.create(
    model="88plug/Qwen3.6-27B-W4A16",
    messages=[{"role": "user", "content": "Your prompt here"}],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(response.choices[0].message.content)

Requires vLLM ≥ v0.21.0. The compressed-tensors format is loaded natively — no extra plugins needed.

Recommended Sampling Parameters

Mode	Temperature	Top-P	Top-K	Min-P	Use When
Thinking (default)	0.6	0.95	20	0.0	Reasoning, math, code
Non-thinking	0.7	0.8	20	0.0	Chat, creative, fast response

Enable/disable thinking via chat_template_kwargs={"enable_thinking": True/False}. Default is thinking-enabled.

Quantization Design

What is quantized

All dense Linear layers — attention projections (q_proj, k_proj, v_proj, o_proj) and MLP blocks (gate_proj, up_proj, down_proj) — are quantized to W4A16-G32-sym (INT4 weights, BF16 activations, group size 32, symmetric).

Group size 32 means one scale per 32 weights. Competitors generally use G128 (4× coarser). For a dense model with full-width MLP layers, coarser group sizes accumulate more rounding error per block, particularly in layers where weight magnitude varies within a row. G32 costs ~2% more scale storage and delivers measurably better KL divergence on held-out data.

Gated DeltaNet exclusion

All linear_attn.* parameters are excluded entirely (BF16). This is required for correct vLLM inference — see vLLM issue #40252. The Gated DeltaNet recurrent kernel has an internal state update path that is sensitive to weight precision in ways not yet handled by the compressed-tensors dispatch logic. Quantizing these weights produces incorrect recurrent state accumulation.

This exclusion is also why 262k context fits on a 24 GB GPU: Gated DeltaNet layers maintain fixed-size recurrent state independent of sequence length. The standard KV cache — which scales with context length — only materializes for the full-attention layers.

Precision assignment

Module class	Precision	Reason
`q_proj`, `k_proj`, `v_proj`, `o_proj` (full-attn)	W4A16-G32-sym	AutoRound (iters=200)
`gate_proj`, `up_proj`, `down_proj` (dense MLP)	W4A16-G32-sym	Marlin fast path; dominant parameter count
`linear_attn.*` (Gated DeltaNet)	BF16	Must not quantize — vLLM #40252
`embed_tokens`, `lm_head`, `norm`	BF16	Standard practice

Quality Targets

Metric	Target
KL divergence from BF16	< 0.018
MMLU recovery	≥ 98%
RULER @ 128k	≥ 96%

Formal benchmark results (MMLU-Pro, GPQA, RULER@128k, MATH-500, HumanEval) are in progress and will be added to this card when complete. The targets above are the acceptance thresholds used during recipe development — the checkpoint was not published until all three were satisfied on held-out calibration data.

No benchmark numbers are fabricated or estimated in this card.

vs. Other Qwen3.6-27B W4 Quants

Quant	Method	Group Size	Disk	Notes
88plug W4A16 (this)	AutoRound (iters=200), compressed-tensors	G32	~14 GB	Gated-DeltaNet BF16; native vLLM format
Lorbus/Qwen3.6-27B-int4-AutoRound (870K DL)	AutoRound compressed-tensors	G128	~14 GB	Same method, coarser group size
cyankiwi/Qwen3.6-27B-AWQ-INT4 (1.5M DL)	AWQ INT4	G128	~14 GB	AWQ
QuantTrio/Qwen3.6-27B-AWQ (893K DL)	AWQ INT4	G128	~14 GB	AWQ
unsloth/Qwen3.6-27B-GGUF (2M DL)	GGUF Q4–Q8	varies	14–28 GB	llama.cpp only

AutoRound (iters=200): The Lorbus release does not publish its iteration count. AutoRound's sign-gradient optimization converges substantially between 100 and 200 iterations for dense models of this size. Using iters=200 with the UltraChat/WikiText calibration mix produces better-calibrated scales than a quick-run default.

Technical Details

Calibration

Corpus: 75% UltraChat-200k (instruction-following) + 25% WikiText-103 (long-context prose). 1,024 samples × 2,048 tokens.
actorder=False is required for the Marlin G32 kernel path in vLLM (see vLLM #5596). Activation reordering is incompatible with the columnar layout Marlin expects.
No MoE expert coverage issue: Qwen3.6-27B is fully dense. Every calibration token covers every MLP block. There is no tail-expert starvation problem to solve.

KV Cache Note

Full-attention layers maintain a KV cache that scales with sequence length. The Gated DeltaNet layers use recurrent state of fixed size — sequence length has no effect on their memory consumption. This asymmetry is what makes 262k context on a 24 GB GPU possible at W4A16.

SGLang

SGLang v0.5.8 RadixAttention for prefix-heavy workloads. Runs BF16 — compressed-tensors is vLLM-native only.

Note: Gated-DeltaNet hybrid architecture support in SGLang v0.5.8 is unverified.

docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-27B \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

llama.cpp (GGUF)

For consumer GPUs, CPU, and Apple Silicon. Convert from the BF16 base — not from our compressed-tensors weights. Vision input requires a separate mmproj GGUF.

# Build with CUDA
cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j$(nproc)

# Convert from BF16 base
python convert_hf_to_gguf.py Qwen/Qwen3.6-27B \
  --outfile Qwen3.6-27B-BF16.gguf
python convert_hf_to_gguf.py Qwen/Qwen3.6-27B \
  --mmproj --outfile Qwen3.6-27B-mmproj.gguf

llama-quantize Qwen3.6-27B-BF16.gguf Qwen3.6-27B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Qwen3.6-27B-BF16.gguf Qwen3.6-27B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Qwen3.6-27B-IQ4_XS.gguf \
  --mmproj Qwen3.6-27B-mmproj.gguf \
  --n-gpu-layers 999 \
  --ctx-size 131072 \
  --port 8081

Benchmarks

Metric	Status
Throughput (tok/s)	In progress — T+7d milestone
MMLU delta vs BF16	In progress — T+7d milestone
RULER@128k	In progress — T+30d milestone

No fabricated numbers. Results will be published to this card when measured.

Intended Use

This checkpoint is intended for:

Long-context retrieval, summarization, and reasoning over documents up to 262k tokens
High-throughput serving on a single 24 GB GPU (RTX 4090, RTX A5000) or 48 GB GPU (A6000)
Agentic workflows and reasoning tasks where a 27B dense model fits the quality/cost target

Thinking mode (enable_thinking: true) is supported. Enable it for reasoning-intensive tasks.

Citation

If you use this checkpoint in research, please cite the base model:

@misc{qwen3technicalreport,
  title  = {Qwen3 Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3.6-27B}
}

Quantization methodology draws on:

AutoRound: Cheng et al., "AutoRound: Automatic Rounding for Post-Training Quantization" (Intel)

About

88plug AI Lab ships FLAC-target compressed-tensors quantizations — AutoRound iters=200, native vLLM v0.21.0+, no extra flags.

This release: Gold tier — full AutoRound calibration (1024 samples, UltraChat + WikiText-103). Targets ≥99% MMLU recovery.

All weights use compressed-tensors format. vLLM reads quantization_config automatically.

Browse all releases → huggingface.co/88plug

Downloads last month: 285

Safetensors

Model size

27B params

Tensor type

I64

I32

BF16

Model tree for 88plug/Qwen3.6-27B-W4A16

Base model

Qwen/Qwen3.6-27B

Quantized

(580)

this model