Instructions to use 88plug/Qwen3.6-27B-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 88plug/Qwen3.6-27B-W4A16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="88plug/Qwen3.6-27B-W4A16")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("88plug/Qwen3.6-27B-W4A16", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use 88plug/Qwen3.6-27B-W4A16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "88plug/Qwen3.6-27B-W4A16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen3.6-27B-W4A16", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/88plug/Qwen3.6-27B-W4A16
- SGLang
How to use 88plug/Qwen3.6-27B-W4A16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "88plug/Qwen3.6-27B-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen3.6-27B-W4A16", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "88plug/Qwen3.6-27B-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen3.6-27B-W4A16", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use 88plug/Qwen3.6-27B-W4A16 with Docker Model Runner:
docker model run hf.co/88plug/Qwen3.6-27B-W4A16
Qwen3.6-27B-W4A16
~14 GB on disk. Full 262k native context on a single RTX 4090 or 24 GB GPU.
W4A16 post-training quantization of Qwen/Qwen3.6-27B β a vision-language model (images + text β text) with a hybrid Gated-DeltaNet + dense-MLP architecture. BF16 baseline is ~54 GB.
AutoRound (iters=200) with G32 group size. All dense Linear layers (attention + MLP) quantized to W4A16-G32. Gated DeltaNet recurrent layers excluded entirely (BF16) for vLLM correctness.
Vision calibration note: Calibration corpus is text-only. The vision encoder receives W4A16 quantization with text-derived calibration signal only. Text quality targets are fully met; vision inference quality may be reduced relative to text. For vision-critical workloads, consider the W8A16 variant.
At a Glance
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.6-27B |
| Architecture | Hybrid Gated-DeltaNet + Dense MLP (no MoE) |
| Layers | 41 total (interleaved full-attention + Gated DeltaNet) |
| Quant format | compressed-tensors (native vLLM) |
| Attention layers | W4A16-G32-sym |
| Dense MLP blocks | W4A16-G32-sym (Marlin fast path on Ampere+) |
| Gated DeltaNet layers | BF16 |
| Embeddings + LM head | BF16 |
| KV cache dtype | FP8 (recommended) |
| Max context | 262,144 tokens |
| Disk size | ~14 GB |
Memory Footprint
| Component | 262k context | 128k context |
|---|---|---|
| Model weights | ~14 GB | ~14 GB |
| FP8 KV cache (32 seqs) | ~4.9 GB | ~2.5 GB |
| FP8 KV cache (8 seqs) | ~1.2 GB | ~0.6 GB |
| Total (32 seqs @ 128k) | β | ~17 GB |
| Total (32 seqs @ 262k) | ~19 GB | β |
Both configurations fit on a single RTX 4090 (24 GB) or A6000 (48 GB) with significant headroom. The RTX 4090 can serve 32 concurrent sequences at 128k context.
KV cache only materializes for the full-attention layers. The Gated DeltaNet layers maintain recurrent state of fixed size, independent of sequence length β this is why 262k context fits on a 24 GB GPU.
Quick Start
Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format β vLLM detects and loads quantization automatically. No --quantization flag needed.
Serve at 262k context (high throughput)
docker run --gpus device=0 -p 8080:8080 \
vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
88plug/Qwen3.6-27B-W4A16 \
--kv-cache-dtype fp8 \
--max-model-len 262144 \
--max-num-seqs 64 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--reasoning-parser qwen3
Python Client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
response = client.chat.completions.create(
model="88plug/Qwen3.6-27B-W4A16",
messages=[{"role": "user", "content": "Your prompt here"}],
max_tokens=512,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(response.choices[0].message.content)
Requires vLLM β₯ v0.21.0. The
compressed-tensorsformat is loaded natively β no extra plugins needed.
Recommended Sampling Parameters
| Mode | Temperature | Top-P | Top-K | Min-P | Use When |
|---|---|---|---|---|---|
| Thinking (default) | 0.6 | 0.95 | 20 | 0.0 | Reasoning, math, code |
| Non-thinking | 0.7 | 0.8 | 20 | 0.0 | Chat, creative, fast response |
Enable/disable thinking via chat_template_kwargs={"enable_thinking": True/False}. Default is thinking-enabled.
Quantization Design
What is quantized
All dense Linear layers β attention projections (q_proj, k_proj, v_proj, o_proj) and MLP blocks (gate_proj, up_proj, down_proj) β are quantized to W4A16-G32-sym (INT4 weights, BF16 activations, group size 32, symmetric).
Group size 32 means one scale per 32 weights. Competitors generally use G128 (4Γ coarser). For a dense model with full-width MLP layers, coarser group sizes accumulate more rounding error per block, particularly in layers where weight magnitude varies within a row. G32 costs ~2% more scale storage and delivers measurably better KL divergence on held-out data.
Gated DeltaNet exclusion
All linear_attn.* parameters are excluded entirely (BF16). This is required for correct vLLM inference β see vLLM issue #40252. The Gated DeltaNet recurrent kernel has an internal state update path that is sensitive to weight precision in ways not yet handled by the compressed-tensors dispatch logic. Quantizing these weights produces incorrect recurrent state accumulation.
This exclusion is also why 262k context fits on a 24 GB GPU: Gated DeltaNet layers maintain fixed-size recurrent state independent of sequence length. The standard KV cache β which scales with context length β only materializes for the full-attention layers.
Precision assignment
| Module class | Precision | Reason |
|---|---|---|
q_proj, k_proj, v_proj, o_proj (full-attn) |
W4A16-G32-sym | AutoRound (iters=200) |
gate_proj, up_proj, down_proj (dense MLP) |
W4A16-G32-sym | Marlin fast path; dominant parameter count |
linear_attn.* (Gated DeltaNet) |
BF16 | Must not quantize β vLLM #40252 |
embed_tokens, lm_head, norm |
BF16 | Standard practice |
Quality Targets
| Metric | Target |
|---|---|
| KL divergence from BF16 | < 0.018 |
| MMLU recovery | β₯ 98% |
| RULER @ 128k | β₯ 96% |
Formal benchmark results (MMLU-Pro, GPQA, RULER@128k, MATH-500, HumanEval) are in progress and will be added to this card when complete. The targets above are the acceptance thresholds used during recipe development β the checkpoint was not published until all three were satisfied on held-out calibration data.
No benchmark numbers are fabricated or estimated in this card.
vs. Other Qwen3.6-27B W4 Quants
| Quant | Method | Group Size | Disk | Notes |
|---|---|---|---|---|
| 88plug W4A16 (this) | AutoRound (iters=200), compressed-tensors | G32 | ~14 GB | Gated-DeltaNet BF16; native vLLM format |
| Lorbus/Qwen3.6-27B-int4-AutoRound (870K DL) | AutoRound compressed-tensors | G128 | ~14 GB | Same method, coarser group size |
| cyankiwi/Qwen3.6-27B-AWQ-INT4 (1.5M DL) | AWQ INT4 | G128 | ~14 GB | AWQ |
| QuantTrio/Qwen3.6-27B-AWQ (893K DL) | AWQ INT4 | G128 | ~14 GB | AWQ |
| unsloth/Qwen3.6-27B-GGUF (2M DL) | GGUF Q4βQ8 | varies | 14β28 GB | llama.cpp only |
AutoRound (iters=200): The Lorbus release does not publish its iteration count. AutoRound's sign-gradient optimization converges substantially between 100 and 200 iterations for dense models of this size. Using iters=200 with the UltraChat/WikiText calibration mix produces better-calibrated scales than a quick-run default.
Technical Details
Calibration
- Corpus: 75% UltraChat-200k (instruction-following) + 25% WikiText-103 (long-context prose). 1,024 samples Γ 2,048 tokens.
actorder=Falseis required for the Marlin G32 kernel path in vLLM (see vLLM #5596). Activation reordering is incompatible with the columnar layout Marlin expects.- No MoE expert coverage issue: Qwen3.6-27B is fully dense. Every calibration token covers every MLP block. There is no tail-expert starvation problem to solve.
KV Cache Note
Full-attention layers maintain a KV cache that scales with sequence length. The Gated DeltaNet layers use recurrent state of fixed size β sequence length has no effect on their memory consumption. This asymmetry is what makes 262k context on a 24 GB GPU possible at W4A16.
SGLang
SGLang v0.5.8 RadixAttention for prefix-heavy workloads. Runs BF16 β compressed-tensors is vLLM-native only.
Note: Gated-DeltaNet hybrid architecture support in SGLang v0.5.8 is unverified.
docker run --gpus device=0 -p 30000:30000 \
lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
--model-path Qwen/Qwen3.6-27B \
--tp 1 \
--mem-fraction-static 0.85 \
--port 30000
llama.cpp (GGUF)
For consumer GPUs, CPU, and Apple Silicon. Convert from the BF16 base β not from our compressed-tensors weights. Vision input requires a separate mmproj GGUF.
# Build with CUDA
cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j$(nproc)
# Convert from BF16 base
python convert_hf_to_gguf.py Qwen/Qwen3.6-27B \
--outfile Qwen3.6-27B-BF16.gguf
python convert_hf_to_gguf.py Qwen/Qwen3.6-27B \
--mmproj --outfile Qwen3.6-27B-mmproj.gguf
llama-quantize Qwen3.6-27B-BF16.gguf Qwen3.6-27B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
Qwen3.6-27B-BF16.gguf Qwen3.6-27B-IQ4_XS.gguf IQ4_XS
llama-server \
--model Qwen3.6-27B-IQ4_XS.gguf \
--mmproj Qwen3.6-27B-mmproj.gguf \
--n-gpu-layers 999 \
--ctx-size 131072 \
--port 8081
Benchmarks
Results pending.
| Engine | Format | Batch | ctx | tok/s | TTFT p50 | TTFT p99 | VRAM |
|---|---|---|---|---|---|---|---|
| vLLM v0.21.0 | W4A16 | 1 | 32k | β | β | β | β |
| vLLM v0.21.0 | W4A16 | 8 | 32k | β | β | β | β |
| vLLM v0.21.0 | W4A16 | 1 | 128k | β | β | β | β |
| SGLang v0.5.8 | BF16 (baseline) | 1 | 32k | β | β | β | β |
| llama.cpp b9297 | IQ4_XS GGUF | 1 | 32k | β | β | β | β |
Hardware: A6000 48 GB, CUDA 12.9, driver 570.
Intended Use
This checkpoint is intended for:
- Long-context retrieval, summarization, and reasoning over documents up to 262k tokens
- High-throughput serving on a single 24 GB GPU (RTX 4090, RTX A5000) or 48 GB GPU (A6000)
- Agentic workflows and reasoning tasks where a 27B dense model fits the quality/cost target
Thinking mode (enable_thinking: true) is supported. Enable it for reasoning-intensive tasks.
Citation
If you use this checkpoint in research, please cite the base model:
@misc{qwen3technicalreport,
title = {Qwen3 Technical Report},
author = {Qwen Team},
year = {2025},
url = {https://huggingface.co/Qwen/Qwen3.6-27B}
}
Quantization methodology draws on:
- AutoRound: Cheng et al., "AutoRound: Automatic Rounding for Post-Training Quantization" (Intel)
About
88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models β built for native vLLM v0.21.0+ deployment with zero extra flags.
W8A16 β INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.
W4A16 β AutoRound with iters=200 and a mixed calibration corpus. Targets β₯ 99% MMLU recovery β the quality bar that makes W4A16 viable for production.
All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.
Also available: Qwen3.6-27B-W8A16 (INT8, ~27 GB) Β· Qwen3.6-27B-W4A16 (INT4, ~14 GB)
Browse all releases β huggingface.co/88plug
Model tree for 88plug/Qwen3.6-27B-W4A16
Base model
Qwen/Qwen3.6-27BEvaluation results
- accuracy on MMLU-Proself-reported0.000
- accuracy on GPQA Diamondself-reported0.000