AI & ML interests
None defined yet.
Recent Activity
๐ 88plug AI Lab
Production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models โ engineered for native vLLM v0.9.0+ deployment.
Why compressed-tensors
Most quantization formats (AWQ, GPTQ, GGUF) target a single inference backend and ship a frozen weight layout that cannot be further composed or modified at load time. compressed-tensors is the format developed by Neural Magic and maintained as a first-class vLLM citizen.
- Native vLLM integration. No format conversion, no plugin shims. vLLM reads compressed-tensors models directly via its built-in
CompressedTensorsWorker. Full PagedAttention, continuous batching, and tensor parallelism work without modification. - Composable precision. A single checkpoint can carry per-layer or per-group precision assignments. Mixed-precision MoE configurations are expressed in the same file.
- Reproducible calibration metadata. The quantization config, calibration scheme, and per-channel scales are stored inside the checkpoint.
- Forward compatibility. As vLLM adds new kernel support (FP8, INT8, sparse), compressed-tensors models gain that support without re-quantizing.
AWQ and GPTQ remain fine for llama.cpp and older toolchains. If you are deploying on vLLM in production, compressed-tensors is the correct choice.
Quality Standard
Ampere+ (A100, A6000, RTX 30xx+)
Ampere+ (A100, A6000, RTX 30xx+)
AutoRound at iters=200 runs sign-gradient optimization over a calibration set to minimize weight rounding error. At W4A16, this closes most of the gap between naive round-to-nearest and GPTQ/AWQ, while producing a checkpoint that vLLM can load natively.
Model Catalog
All 16 models in compressed-tensors format, validated for vLLM v0.9.0+.
Qwen3.6-35B-A3B โ Mixed-Precision MoE, 1M context
| Precision | Repo | Architecture |
|---|---|---|
| W8A16 | 88plug/Qwen3.6-35B-A3B-W8A16 | MoE, 35B total / 3.6B active |
| W4A16 | 88plug/Qwen3.6-35B-A3B-W4A16 | MoE, 35B total / 3.6B active |
Qwen3.6-27B โ Dense Hybrid, 262k context
| Precision | Repo | Architecture |
|---|---|---|
| W8A16 | 88plug/Qwen3.6-27B-W8A16 | Dense, 27B |
| W4A16 | 88plug/Qwen3.6-27B-W4A16 | Dense, 27B |
Qwen3-Omni-30B-A3B โ Audio + Vision + Speech
| Precision | Repo | Architecture |
|---|---|---|
| W8A16 | 88plug/Qwen3-Omni-30B-A3B-W8A16 | Omni MoE, 30B / 3B active |
| W4A16 | 88plug/Qwen3-Omni-30B-W4A16 | Omni MoE, 30B / 3B active |
Qwen2.5-Omni-7B โ Efficient Omni
| Precision | Repo | Architecture |
|---|---|---|
| W8A16 | 88plug/Qwen2.5-Omni-7B-W8A16 | Omni dense, 7B |
| W4A16 | 88plug/Qwen2.5-Omni-7B-W4A16 | Omni dense, 7B |
Gemma4-E4B-it โ Vision-Language Model
| Precision | Repo | Architecture |
|---|---|---|
| W8A16 | 88plug/Gemma4-E4B-it-W8A16 | VLM MoE, 4B active / 28B total |
| W4A16 | 88plug/Gemma4-E4B-it-W4A16 | VLM MoE, 4B active / 28B total |
Gemma4-E2B-it โ Ultra-Efficient VLM
| Precision | Repo | Architecture |
|---|---|---|
| W8A16 | 88plug/Gemma4-E2B-it-W8A16 | VLM MoE, 2B active / 26B total |
| W4A16 | 88plug/Gemma4-E2B-it-W4A16 | VLM MoE, 2B active / 26B total |
MiniCPM-o-4.5 โ Omni Model
| Precision | Repo | Architecture |
|---|---|---|
| W8A16 | 88plug/MiniCPM-o-4.5-W8A16 | Omni dense |
| W4A16 | 88plug/MiniCPM-o-4.5-W4A16 | Omni dense |
Nemotron-3-Nano-30B-A3B โ Hybrid SSM/Attention
| Precision | Repo | Architecture |
|---|---|---|
| W8A16 | 88plug/Nemotron-3-Nano-30B-A3B-W8A16 | Hybrid Mamba2 SSM + Attention MoE |
| W4A16 | 88plug/Nemotron-3-Nano-30B-A3B-W4A16 | Hybrid Mamba2 SSM + Attention MoE |
Quickstart
Requires vLLM v0.9.0+ and an Ampere-class GPU (A100, A6000, RTX 3090/4090, or equivalent).
Install
pip install vllm>=0.9.0
Offline inference
from vllm import LLM, SamplingParams
llm = LLM(
model="88plug/Qwen3.6-35B-A3B-W4A16",
max_model_len=131072,
tensor_parallel_size=1,
)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512)
outputs = llm.generate(["Explain W4A16 vs W8A16 tradeoffs."], sampling_params)
print(outputs[0].outputs[0].text)
OpenAI-compatible server
vllm serve 88plug/Qwen3.6-35B-A3B-W4A16 \
--max-model-len 131072 \
--port 8000
Hardware Requirements
| Model Size | W8A16 VRAM | W4A16 VRAM | Recommended |
|---|---|---|---|
| 2Bโ7B | 8โ16 GB | 6โ10 GB | Single A6000 / RTX 4090 |
| 27Bโ35B (dense) | 32โ40 GB | 20โ28 GB | Single A100 80G or 2ร A6000 |
| 30Bโ35B (MoE, 3B active) | 28โ36 GB | 18โ24 GB | Single A100 80G or 2ร A6000 |
Developer: Andrew Mello ยท 88plug.com
Issues and model requests: open a Discussion on the relevant model repo.
Uploads automated via 88plug-bot.