Instructions to use 88plug/Qwen3.6-27B-W8A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 88plug/Qwen3.6-27B-W8A16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="88plug/Qwen3.6-27B-W8A16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("88plug/Qwen3.6-27B-W8A16") model = AutoModelForCausalLM.from_pretrained("88plug/Qwen3.6-27B-W8A16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use 88plug/Qwen3.6-27B-W8A16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "88plug/Qwen3.6-27B-W8A16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen3.6-27B-W8A16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/88plug/Qwen3.6-27B-W8A16
- SGLang
How to use 88plug/Qwen3.6-27B-W8A16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "88plug/Qwen3.6-27B-W8A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen3.6-27B-W8A16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "88plug/Qwen3.6-27B-W8A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen3.6-27B-W8A16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use 88plug/Qwen3.6-27B-W8A16 with Docker Model Runner:
docker model run hf.co/88plug/Qwen3.6-27B-W8A16
Qwen3.6-27B-W8A16
INT8 post-training quantization of Qwen/Qwen3.6-27B — a vision-language model (images + text → text) with a hybrid Gated-DeltaNet + dense-MLP architecture. ~27 GB on disk. Runs 262k token context on one A6000 or RTX 4090.
The official FP8 release requires Blackwell or H100 hardware. This quant fills the gap: near-lossless W8A16 that runs on any FP16-capable GPU — A6000, RTX 4090, L40S, A100-40.
What Makes This Different
The Gap This Fills
The official Qwen3.6-27B release ships in BF16 (54 GB) and FP8 (28 GB). The FP8 checkpoint is great — but FP8 inference kernels require Blackwell (RTX 50xx) or Hopper (H100/H200) GPU generation. On Ampere or Ada hardware (RTX 3090/4090, A6000, A100), FP8 activations are not natively supported.
This W8A16 checkpoint lands at the same ~27–28 GB footprint as the official FP8 model, but uses INT8 weights and BF16 activations — a format that Marlin kernels in vLLM serve natively on any Ampere+ GPU. No hardware generation restriction.
The Solutions Applied Here
AutoRound W8A16 — INT8 weights, BF16 activations. Activations are left in BF16. This eliminates the dominant source of quality loss in W8A8 methods. AutoRound's sign-gradient rounding optimization produces significantly better per-weight calibration than GPTQ or AWQ at the same bit width, reducing worst-case outlier distortion.
Group size G32 — fine-grained scale resolution. A group size of 32 weights per scale factor provides 4× finer quantization resolution than the common G128, at the cost of a modest overhead in scale storage. For a 27B dense model this is the right tradeoff: scale storage is negligible, and the quality improvement on long-context tasks is measurable.
Mixed calibration corpus. 75% UltraChat-200k (instruction-following fidelity) + 25% WikiText-103 (long-context fidelity). 1,024 samples at 2,048 tokens each. Text-only calibration data.
What Stays at BF16
| Layer | Reason |
|---|---|
linear_attn.* |
Gated DeltaNet — must stay BF16 per vLLM #40252 |
embed_tokens |
Embedding table — standard practice; disproportionate perplexity impact |
lm_head |
Output projection — standard practice |
norm |
Layer norms — negligible size, high sensitivity |
Vision calibration note: Calibration corpus is text-only. The vision encoder (ViT) receives RTN-style INT8 quantization with no calibration signal, which is near-lossless at 8-bit. Text quality is fully calibrated; vision quality is RTN INT8.
Architecture Notes
Qwen3.6-27B is a vision-language model built on a hybrid Gated-DeltaNet + dense-MLP backbone. Key characteristics relevant to serving:
- Modalities: Image + Text → Text (pipeline tag:
image-text-to-text) - Model type:
qwen3_5(dense, no MoE) - 41 LLM layers: interleaved full-attention and Gated DeltaNet linear-attention layers in a dense stack
- No MoE: no shared experts, no router gate — standard dense MLP blocks throughout
- Native context: 262,144 tokens
- KV cache: only full-attention layers maintain a KV cache; Gated DeltaNet layers use fixed recurrent state independent of sequence length
Because there is no MoE, there is no tail-expert calibration problem. Every parameter in the dense MLP blocks is calibrated with full coverage. The only special-case exclusions are the Gated DeltaNet layers (BF16 by requirement) and the standard embedding/head layers.
Memory Requirements
| Configuration | BF16 | This Quant (W8A16) | Official FP8 |
|---|---|---|---|
| Weights (disk/VRAM) | ~54 GB | ~27 GB | ~28 GB |
| KV cache @ 32k ctx (fp8) | ~0.3 GB | ~0.3 GB | ~0.3 GB |
| KV cache @ 128k ctx (fp8) | ~1.2 GB | ~1.2 GB | ~1.2 GB |
| KV cache @ 262k ctx (fp8) | ~2.4 GB | ~2.4 GB | ~2.4 GB |
| Total VRAM @ 32k ctx | ~55 GB | ~28 GB | ~29 GB |
| Total VRAM @ 262k ctx | ~57 GB | ~30 GB | ~31 GB |
| Minimum GPU | 1× A100 80GB | 1× RTX 4090 / A6000 / A100-40 | H100 / Blackwell only |
KV cache figures are for the full-attention layers only. Linear-attention layers carry state in a fixed recurrent buffer independent of sequence length — this is why 262k context fits on a 24 GB RTX 4090 with room to spare.
Quick Start
Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.
Footguns to avoid: Do NOT use
--quantization turboquant(vLLM #41560). Do NOT use--tensor-parallel-size > 1on a single GPU.
262k Context — Full Native Context (Recommended)
Native context window, no rope scaling, maximum quality.
docker run --gpus device=0 -p 8080:8080 \
vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
88plug/Qwen3.6-27B-W8A16 \
--kv-cache-dtype fp8 \
--max-model-len 262144 \
--max-num-seqs 32 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--reasoning-parser qwen3
Python Client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
response = client.chat.completions.create(
model="88plug/Qwen3.6-27B-W8A16",
messages=[{"role": "user", "content": "Your prompt here"}],
max_tokens=512,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(response.choices[0].message.content)
Recommended Sampling Parameters
| Mode | Temperature | Top-P | Top-K | Min-P | Use When |
|---|---|---|---|---|---|
| Thinking (default) | 0.6 | 0.95 | 20 | 0.0 | Reasoning, math, code |
| Non-thinking | 0.7 | 0.8 | 20 | 0.0 | Chat, creative, fast response |
Enable/disable thinking via chat_template_kwargs={"enable_thinking": True/False}. Default is thinking-enabled.
Quality
Targets
| Metric | Target |
|---|---|
| KL divergence KL(quant‖BF16) | < 0.005 |
| MMLU recovery vs BF16 | ≥ 99.7% |
| RULER@128k recovery vs BF16 | ≥ 99% |
These targets drove the recipe design — W8A16 instead of W8A8 to eliminate activation quantization error, G32 group size for fine-grained scale resolution, and mixed calibration corpus to preserve both instruction-following and long-context fidelity.
Full benchmark results will be added after publication. If you run evals, please open an issue or PR.
vs. Other Qwen3.6-27B Quants
This is the first compressed-tensors W8A16 checkpoint for Qwen3.6-27B. It fills the gap between the 54 GB BF16 base and the hardware-restricted official FP8.
| Quant | Method | Size | GPU Compatibility | Notes |
|---|---|---|---|---|
| 88plug W8A16 (this) | compressed-tensors AutoRound | ~27 GB | Any Ampere+ (A6000, RTX 4090, A100) | First W8A16 for this model |
| Qwen/Qwen3.6-27B-FP8 (6.7M DL) | FP8 | ~28 GB | Blackwell / H100 only | Official; hardware restricted |
| unsloth/Qwen3.6-27B-GGUF (2M DL) | GGUF Q4–Q8 | 14–28 GB | CPU, Apple Silicon, any GPU | llama.cpp, no vLLM |
| cyankiwi/Qwen3.6-27B-AWQ-INT4 (1.5M DL) | AWQ INT4 | ~14 GB | Any GPU | 4-bit only |
| QuantTrio/Qwen3.6-27B-AWQ (893K DL) | AWQ INT4 | ~14 GB | Any GPU | 4-bit only |
| Lorbus/Qwen3.6-27B-int4-AutoRound (870K DL) | compressed-tensors W4G128 | ~14 GB | Any GPU | W4 only, G128 |
Why W8A16 over FP8: FP8 activation quantization on Ampere hardware falls back to BF16 dispatch silently, negating the memory savings. W8A16 is the correct format for Ampere/Ada inference: INT8 weights (Marlin kernel), BF16 activations — no hardware generation requirement, predictable behavior.
Why compressed-tensors over GGUF at this size: Marlin INT8 kernel throughput at batch > 1 significantly exceeds llama.cpp GGUF on GPU. Weights stay on GPU; no CPU↔GPU transfer overhead. vLLM OpenAI-compatible API, prefix caching, chunked prefill — all work natively.
SGLang
SGLang v0.5.8 offers RadixAttention for prefix-heavy workloads. Run against the BF16 base model — compressed-tensors is vLLM-native only.
Note: Gated-DeltaNet hybrid architecture support in SGLang v0.5.8 is unverified. Confirm before production use.
docker run --gpus device=0 -p 30000:30000 \
lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
--model-path Qwen/Qwen3.6-27B \
--tp 1 \
--mem-fraction-static 0.85 \
--port 30000
llama.cpp (GGUF)
For consumer GPUs, CPU, and Apple Silicon. Convert from the BF16 base checkpoint — not from compressed-tensors weights. Vision requires a separate mmproj GGUF (libmtmd).
# Build with CUDA
cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j$(nproc)
# Convert base model (text trunk)
python convert_hf_to_gguf.py Qwen/Qwen3.6-27B \
--outfile Qwen3.6-27B-BF16.gguf
# Vision projector (required for image input)
python convert_hf_to_gguf.py Qwen/Qwen3.6-27B \
--mmproj --outfile Qwen3.6-27B-mmproj.gguf
# Quantize text trunk
llama-quantize Qwen3.6-27B-BF16.gguf Qwen3.6-27B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
Qwen3.6-27B-BF16.gguf Qwen3.6-27B-IQ4_XS.gguf IQ4_XS
# Serve (text + vision)
llama-server \
--model Qwen3.6-27B-Q8_0.gguf \
--mmproj Qwen3.6-27B-mmproj.gguf \
--n-gpu-layers 999 \
--ctx-size 131072 \
--port 8081
Benchmarks
Results pending. Will be published before first HuggingFace release.
| Engine | Format | Batch | ctx | tok/s | TTFT p50 | TTFT p99 | VRAM |
|---|---|---|---|---|---|---|---|
| vLLM v0.21.0 | W8A16 | 1 | 32k | — | — | — | — |
| vLLM v0.21.0 | W8A16 | 8 | 32k | — | — | — | — |
| vLLM v0.21.0 | W8A16 | 1 | 128k | — | — | — | — |
| SGLang v0.5.8 | BF16 (baseline) | 1 | 32k | — | — | — | — |
| llama.cpp b9297 | Q8_0 GGUF | 1 | 32k | — | — | — | — |
Hardware: A6000 48 GB, CUDA 12.9, driver 570.
Limitations
Dense model — no MoE savings on KV. Unlike the 35B-A3B sparse MoE variant, every one of the 41 layers is active on every token. KV cache memory scales with all full-attention layers in the stack, not a reduced subset. The 262k context window is still achievable on a single A6000, but the headroom is tighter than with the MoE variant.
No activation quantization. W8A16 means activations run in BF16. Memory bandwidth savings are on the weight side only; activation memory is unchanged versus BF16. This is a deliberate quality-vs-compression tradeoff.
Linear attention state reset. The Gated-DeltaNet layers maintain recurrent state, not a KV cache. This state is reset between requests. Stateful multi-turn inference within a session works correctly; cross-request state continuity is not supported by the current vLLM serving path.
Calibration distribution. Calibration used UltraChat-200k and WikiText-103. Tasks with significantly different token distributions (e.g., code-heavy, mathematical, or non-English) may see slightly higher KL divergence than the headline targets. Recalibration with domain-specific data is straightforward using the recipe below.
Quantization Recipe (Reproducibility)
# Core configuration
quantization_method = "autoround"
w_bits = 8
w_dtype = "int8"
a_dtype = "bf16" # activations NOT quantized
group_size = 32
iters = 200
scale_method = "neural_max"
calibration_dataset = {
"ultrachat_200k": 0.75,
"wikitext_103_raw": 0.25,
}
calibration_samples = 1024
calibration_seqlen = 2048
# BF16-preserved layers
skip_layers = [
"linear_attn.*", # Gated DeltaNet — required by vLLM
"lm_head", # output projection
"embed_tokens", # embedding table
"norm", # layer norms
]
Related Work
- Qwen/Qwen3.6-27B — base model
- AutoRound — sign gradient-based weight rounding optimization
- vLLM compressed-tensors — inference backend
- vLLM #40252 — Gated DeltaNet BF16 requirement
Citation
If you use this model, please cite the base model:
@misc{qwen3technicalreport,
title = {Qwen3 Technical Report},
author = {Qwen Team},
year = {2025},
url = {https://huggingface.co/Qwen/Qwen3.6-27B}
}
About
88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags.
W8A16 — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.
W4A16 — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production.
All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.
Also available: Qwen3.6-27B-W4A16 (INT4, ~14 GB) · Qwen3.6-27B-W8A16 (INT8, ~27 GB)
Browse all releases → huggingface.co/88plug
- Downloads last month
- 10
Model tree for 88plug/Qwen3.6-27B-W8A16
Base model
Qwen/Qwen3.6-27B