Instructions to use 88plug/Qwen3.6-27B-W8A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 88plug/Qwen3.6-27B-W8A16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="88plug/Qwen3.6-27B-W8A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("88plug/Qwen3.6-27B-W8A16")
model = AutoModelForCausalLM.from_pretrained("88plug/Qwen3.6-27B-W8A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use 88plug/Qwen3.6-27B-W8A16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "88plug/Qwen3.6-27B-W8A16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/Qwen3.6-27B-W8A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/88plug/Qwen3.6-27B-W8A16

SGLang

How to use 88plug/Qwen3.6-27B-W8A16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "88plug/Qwen3.6-27B-W8A16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/Qwen3.6-27B-W8A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "88plug/Qwen3.6-27B-W8A16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/Qwen3.6-27B-W8A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use 88plug/Qwen3.6-27B-W8A16 with Docker Model Runner:
```
docker model run hf.co/88plug/Qwen3.6-27B-W8A16
```

Qwen3.6-27B-W8A16

INT8 post-training quantization of Qwen/Qwen3.6-27B — a vision-language model (images + text → text) with a hybrid Gated-DeltaNet + dense-MLP architecture. ~27 GB on disk. Runs 262k token context on one A6000 or RTX 4090.

At a Glance

Property	Value
Base model	`Qwen/Qwen3.6-27B`
Release tier	Gold (AutoRound iters=200)
Quant method	AutoRound W8A16 iters=200
FLAC status	Not measured (T+7d milestone)
Architecture	Hybrid Gated-DeltaNet + dense MLP
Quant format	compressed-tensors (native vLLM v0.21.0+)
Weights	INT8 (W8A16), BF16 activations
Max context	262,144 tokens
Disk size	~27 GB

The official FP8 release requires Blackwell or H100 hardware. This quant fills the gap: near-lossless W8A16 that runs on any FP16-capable GPU — A6000, RTX 4090, L40S, A100-40.

What Makes This Different

The Gap This Fills

The official Qwen3.6-27B release ships in BF16 (~~54 GB) and FP8 (~~28 GB). The FP8 checkpoint is great — but FP8 inference kernels require Blackwell (RTX 50xx) or Hopper (H100/H200) GPU generation. On Ampere or Ada hardware (RTX 3090/4090, A6000, A100), FP8 activations are not natively supported.

This W8A16 checkpoint lands at the same ~27–28 GB footprint as the official FP8 model, but uses INT8 weights and BF16 activations — a format that Marlin kernels in vLLM serve natively on any Ampere+ GPU. No hardware generation restriction.

The Solutions Applied Here

AutoRound W8A16 — INT8 weights, BF16 activations. Activations are left in BF16. This eliminates the dominant source of quality loss in W8A8 methods. AutoRound's sign-gradient rounding optimization produces significantly better per-weight calibration than GPTQ or AWQ at the same bit width, reducing worst-case outlier distortion.

Group size G32 — fine-grained scale resolution. A group size of 32 weights per scale factor provides 4× finer quantization resolution than the common G128, at the cost of a modest overhead in scale storage. For a 27B dense model this is the right tradeoff: scale storage is negligible, and the quality improvement on long-context tasks is measurable.

Mixed calibration corpus. 75% UltraChat-200k (instruction-following fidelity) + 25% WikiText-103 (long-context fidelity). 1,024 samples at 2,048 tokens each. Text-only calibration data.

What Stays at BF16

Layer	Reason
`linear_attn.*`	Gated DeltaNet — must stay BF16 per vLLM #40252
`embed_tokens`	Embedding table — standard practice; disproportionate perplexity impact
`lm_head`	Output projection — standard practice
`norm`	Layer norms — negligible size, high sensitivity

Vision calibration note: Calibration corpus is text-only. The vision encoder (ViT) receives RTN-style INT8 quantization with no calibration signal, which is near-lossless at 8-bit. Text quality is fully calibrated; vision quality is RTN INT8.

Architecture Notes

Qwen3.6-27B is a vision-language model built on a hybrid Gated-DeltaNet + dense-MLP backbone. Key characteristics relevant to serving:

Modalities: Image + Text → Text (pipeline tag: image-text-to-text)
Model type: qwen3_5 (dense, no MoE)
41 LLM layers: interleaved full-attention and Gated DeltaNet linear-attention layers in a dense stack
No MoE: no shared experts, no router gate — standard dense MLP blocks throughout
Native context: 262,144 tokens
KV cache: only full-attention layers maintain a KV cache; Gated DeltaNet layers use fixed recurrent state independent of sequence length

Because there is no MoE, there is no tail-expert calibration problem. Every parameter in the dense MLP blocks is calibrated with full coverage. The only special-case exclusions are the Gated DeltaNet layers (BF16 by requirement) and the standard embedding/head layers.

Memory Requirements

Configuration	BF16	This Quant (W8A16)	Official FP8
Weights (disk/VRAM)	~54 GB	~27 GB	~28 GB
KV cache @ 32k ctx (fp8)	~0.3 GB	~0.3 GB	~0.3 GB
KV cache @ 128k ctx (fp8)	~1.2 GB	~1.2 GB	~1.2 GB
KV cache @ 262k ctx (fp8)	~2.4 GB	~2.4 GB	~2.4 GB
Total VRAM @ 32k ctx	~55 GB	~28 GB	~29 GB
Total VRAM @ 262k ctx	~57 GB	~30 GB	~31 GB
Minimum GPU	1× A100 80GB	1× RTX 4090 / A6000 / A100-40	H100 / Blackwell only

KV cache figures are for the full-attention layers only. Linear-attention layers carry state in a fixed recurrent buffer independent of sequence length — this is why 262k context fits on a 24 GB RTX 4090 with room to spare.

Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

Footguns to avoid: Do NOT use --quantization turboquant (vLLM #41560). Do NOT use --tensor-parallel-size > 1 on a single GPU.

262k Context — Full Native Context (Recommended)

Native context window, no rope scaling, maximum quality.

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Qwen3.6-27B-W8A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --reasoning-parser qwen3

Python Client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")

response = client.chat.completions.create(
    model="88plug/Qwen3.6-27B-W8A16",
    messages=[{"role": "user", "content": "Your prompt here"}],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(response.choices[0].message.content)

Recommended Sampling Parameters

Mode	Temperature	Top-P	Top-K	Min-P	Use When
Thinking (default)	0.6	0.95	20	0.0	Reasoning, math, code
Non-thinking	0.7	0.8	20	0.0	Chat, creative, fast response

Enable/disable thinking via chat_template_kwargs={"enable_thinking": True/False}. Default is thinking-enabled.

Quality

Targets

Metric	Target
KL divergence KL(quant‖BF16)	< 0.005
MMLU recovery vs BF16	≥ 99.7%
RULER@128k recovery vs BF16	≥ 99%

These targets drove the recipe design — W8A16 instead of W8A8 to eliminate activation quantization error, G32 group size for fine-grained scale resolution, and mixed calibration corpus to preserve both instruction-following and long-context fidelity.

Full benchmark results will be added after publication. If you run evals, please open an issue or PR.

vs. Other Qwen3.6-27B Quants

This is the first compressed-tensors W8A16 checkpoint for Qwen3.6-27B. It fills the gap between the 54 GB BF16 base and the hardware-restricted official FP8.

Quant	Method	Size	GPU Compatibility	Notes
88plug W8A16 (this)	compressed-tensors AutoRound	~27 GB	Any Ampere+ (A6000, RTX 4090, A100)	First W8A16 for this model
Qwen/Qwen3.6-27B-FP8 (6.7M DL)	FP8	~28 GB	Blackwell / H100 only	Official; hardware restricted
unsloth/Qwen3.6-27B-GGUF (2M DL)	GGUF Q4–Q8	14–28 GB	CPU, Apple Silicon, any GPU	llama.cpp, no vLLM
cyankiwi/Qwen3.6-27B-AWQ-INT4 (1.5M DL)	AWQ INT4	~14 GB	Any GPU	4-bit only
QuantTrio/Qwen3.6-27B-AWQ (893K DL)	AWQ INT4	~14 GB	Any GPU	4-bit only
Lorbus/Qwen3.6-27B-int4-AutoRound (870K DL)	compressed-tensors W4G128	~14 GB	Any GPU	W4 only, G128

Why W8A16 over FP8: FP8 activation quantization on Ampere hardware falls back to BF16 dispatch silently, negating the memory savings. W8A16 is the correct format for Ampere/Ada inference: INT8 weights (Marlin kernel), BF16 activations — no hardware generation requirement, predictable behavior.

Why compressed-tensors over GGUF at this size: Marlin INT8 kernel throughput at batch > 1 significantly exceeds llama.cpp GGUF on GPU. Weights stay on GPU; no CPU↔GPU transfer overhead. vLLM OpenAI-compatible API, prefix caching, chunked prefill — all work natively.

SGLang

SGLang v0.5.8 offers RadixAttention for prefix-heavy workloads. Run against the BF16 base model — compressed-tensors is vLLM-native only.

Note: Gated-DeltaNet hybrid architecture support in SGLang v0.5.8 is unverified. Confirm before production use.

docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-27B \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

llama.cpp (GGUF)

For consumer GPUs, CPU, and Apple Silicon. Convert from the BF16 base checkpoint — not from compressed-tensors weights. Vision requires a separate mmproj GGUF (libmtmd).

# Build with CUDA
cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j$(nproc)

# Convert base model (text trunk)
python convert_hf_to_gguf.py Qwen/Qwen3.6-27B \
  --outfile Qwen3.6-27B-BF16.gguf

# Vision projector (required for image input)
python convert_hf_to_gguf.py Qwen/Qwen3.6-27B \
  --mmproj --outfile Qwen3.6-27B-mmproj.gguf

# Quantize text trunk
llama-quantize Qwen3.6-27B-BF16.gguf Qwen3.6-27B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Qwen3.6-27B-BF16.gguf Qwen3.6-27B-IQ4_XS.gguf IQ4_XS

# Serve (text + vision)
llama-server \
  --model Qwen3.6-27B-Q8_0.gguf \
  --mmproj Qwen3.6-27B-mmproj.gguf \
  --n-gpu-layers 999 \
  --ctx-size 131072 \
  --port 8081

Benchmarks

Metric	Status
Throughput (tok/s)	In progress — T+7d milestone
MMLU delta vs BF16	In progress — T+7d milestone
RULER@128k	In progress — T+30d milestone

No fabricated numbers. Results will be published to this card when measured.

Limitations

Dense model — no MoE savings on KV. Unlike the 35B-A3B sparse MoE variant, every one of the 41 layers is active on every token. KV cache memory scales with all full-attention layers in the stack, not a reduced subset. The 262k context window is still achievable on a single A6000, but the headroom is tighter than with the MoE variant.

No activation quantization. W8A16 means activations run in BF16. Memory bandwidth savings are on the weight side only; activation memory is unchanged versus BF16. This is a deliberate quality-vs-compression tradeoff.

Linear attention state reset. The Gated-DeltaNet layers maintain recurrent state, not a KV cache. This state is reset between requests. Stateful multi-turn inference within a session works correctly; cross-request state continuity is not supported by the current vLLM serving path.

Calibration distribution. Calibration used UltraChat-200k and WikiText-103. Tasks with significantly different token distributions (e.g., code-heavy, mathematical, or non-English) may see slightly higher KL divergence than the headline targets. Recalibration with domain-specific data is straightforward using the recipe below.

Quantization Recipe (Reproducibility)

# Core configuration
quantization_method = "autoround"
w_bits = 8
w_dtype = "int8"
a_dtype = "bf16"   # activations NOT quantized
group_size = 32
iters = 200
scale_method = "neural_max"  

calibration_dataset = {
    "ultrachat_200k": 0.75,
    "wikitext_103_raw": 0.25,
}
calibration_samples = 1024
calibration_seqlen = 2048

# BF16-preserved layers
skip_layers = [
    "linear_attn.*",  # Gated DeltaNet — required by vLLM
    "lm_head",        # output projection
    "embed_tokens",   # embedding table
    "norm",           # layer norms
]

Related Work

Qwen/Qwen3.6-27B — base model
AutoRound — sign gradient-based weight rounding optimization
vLLM compressed-tensors — inference backend
vLLM #40252 — Gated DeltaNet BF16 requirement

Citation

If you use this model, please cite the base model:

@misc{qwen3technicalreport,
  title  = {Qwen3 Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3.6-27B}
}

About

88plug AI Lab ships FLAC-target compressed-tensors quantizations — AutoRound iters=200, native vLLM v0.21.0+, no extra flags.

This release: Gold tier — full AutoRound calibration (1024 samples, UltraChat + WikiText-103). Targets ≥99% MMLU recovery.

All weights use compressed-tensors format. vLLM reads quantization_config automatically.

Browse all releases → huggingface.co/88plug

Downloads last month: 3,745

Safetensors

Model size

27B params

Tensor type

I64

I32

BF16

Model tree for 88plug/Qwen3.6-27B-W8A16

Base model

Qwen/Qwen3.6-27B

Quantized

(580)

this model