๐Ÿ”Œ 88plug AI Lab

Production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models โ€” engineered for native vLLM v0.9.0+ deployment.

Why compressed-tensors

Most quantization formats (AWQ, GPTQ, GGUF) target a single inference backend and ship a frozen weight layout that cannot be further composed or modified at load time. compressed-tensors is the format developed by Neural Magic and maintained as a first-class vLLM citizen.

AWQ and GPTQ remain fine for llama.cpp and older toolchains. If you are deploying on vLLM in production, compressed-tensors is the correct choice.

Quality Standard

W8A16
RTN / AutoRound iters=200
>99.5% MMLU recovery

Ampere+ (A100, A6000, RTX 30xx+)

W4A16
AutoRound iters=200 (SignSGD)
โ‰ฅ99% MMLU recovery

Ampere+ (A100, A6000, RTX 30xx+)

AutoRound at iters=200 runs sign-gradient optimization over a calibration set to minimize weight rounding error. At W4A16, this closes most of the gap between naive round-to-nearest and GPTQ/AWQ, while producing a checkpoint that vLLM can load natively.

Model Catalog

All 16 models in compressed-tensors format, validated for vLLM v0.9.0+.

Qwen3.6-35B-A3B โ€” Mixed-Precision MoE, 1M context

PrecisionRepoArchitecture
W8A1688plug/Qwen3.6-35B-A3B-W8A16MoE, 35B total / 3.6B active
W4A1688plug/Qwen3.6-35B-A3B-W4A16MoE, 35B total / 3.6B active

Qwen3.6-27B โ€” Dense Hybrid, 262k context

PrecisionRepoArchitecture
W8A1688plug/Qwen3.6-27B-W8A16Dense, 27B
W4A1688plug/Qwen3.6-27B-W4A16Dense, 27B

Qwen3-Omni-30B-A3B โ€” Audio + Vision + Speech

PrecisionRepoArchitecture
W8A1688plug/Qwen3-Omni-30B-A3B-W8A16Omni MoE, 30B / 3B active
W4A1688plug/Qwen3-Omni-30B-W4A16Omni MoE, 30B / 3B active

Qwen2.5-Omni-7B โ€” Efficient Omni

PrecisionRepoArchitecture
W8A1688plug/Qwen2.5-Omni-7B-W8A16Omni dense, 7B
W4A1688plug/Qwen2.5-Omni-7B-W4A16Omni dense, 7B

Gemma4-E4B-it โ€” Vision-Language Model

PrecisionRepoArchitecture
W8A1688plug/Gemma4-E4B-it-W8A16VLM MoE, 4B active / 28B total
W4A1688plug/Gemma4-E4B-it-W4A16VLM MoE, 4B active / 28B total

Gemma4-E2B-it โ€” Ultra-Efficient VLM

PrecisionRepoArchitecture
W8A1688plug/Gemma4-E2B-it-W8A16VLM MoE, 2B active / 26B total
W4A1688plug/Gemma4-E2B-it-W4A16VLM MoE, 2B active / 26B total

MiniCPM-o-4.5 โ€” Omni Model

PrecisionRepoArchitecture
W8A1688plug/MiniCPM-o-4.5-W8A16Omni dense
W4A1688plug/MiniCPM-o-4.5-W4A16Omni dense

Nemotron-3-Nano-30B-A3B โ€” Hybrid SSM/Attention

PrecisionRepoArchitecture
W8A1688plug/Nemotron-3-Nano-30B-A3B-W8A16Hybrid Mamba2 SSM + Attention MoE
W4A1688plug/Nemotron-3-Nano-30B-A3B-W4A16Hybrid Mamba2 SSM + Attention MoE

Quickstart

Requires vLLM v0.9.0+ and an Ampere-class GPU (A100, A6000, RTX 3090/4090, or equivalent).

Install

pip install vllm>=0.9.0

Offline inference

from vllm import LLM, SamplingParams

llm = LLM(
    model="88plug/Qwen3.6-35B-A3B-W4A16",
    max_model_len=131072,
    tensor_parallel_size=1,
)

sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512)
outputs = llm.generate(["Explain W4A16 vs W8A16 tradeoffs."], sampling_params)
print(outputs[0].outputs[0].text)

OpenAI-compatible server

vllm serve 88plug/Qwen3.6-35B-A3B-W4A16 \
    --max-model-len 131072 \
    --port 8000

Hardware Requirements

Model SizeW8A16 VRAMW4A16 VRAMRecommended
2Bโ€“7B8โ€“16 GB6โ€“10 GBSingle A6000 / RTX 4090
27Bโ€“35B (dense)32โ€“40 GB20โ€“28 GBSingle A100 80G or 2ร— A6000
30Bโ€“35B (MoE, 3B active)28โ€“36 GB18โ€“24 GBSingle A100 80G or 2ร— A6000

Contact
Developer: Andrew Mello  ยท  88plug.com
Issues and model requests: open a Discussion on the relevant model repo.
Uploads automated via 88plug-bot.