| --- |
| title: 88plug AI Lab |
| emoji: π |
| colorFrom: indigo |
| colorTo: purple |
| sdk: static |
| pinned: false |
| --- |
| |
| # 88plug AI Lab |
|
|
| Production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models, engineered for native vLLM v0.9.0+ deployment. Every model is validated against the baseline on MMLU and ships with a complete vLLM-ready configuration. |
|
|
| --- |
|
|
| ## Why compressed-tensors |
|
|
| Most quantization formats (AWQ, GPTQ, GGUF) target a single inference backend and ship a frozen weight layout that cannot be further composed or modified at load time. `compressed-tensors` is the format developed by Neural Magic and maintained as a first-class vLLM citizen. Key differences: |
|
|
| - **Native vLLM integration.** No format conversion, no plugin shims. vLLM reads compressed-tensors models directly via its built-in `CompressedTensorsWorker`. This means full PagedAttention, continuous batching, and tensor parallelism work without modification. |
| - **Composable precision.** A single checkpoint can carry per-layer or per-group precision assignments. Mixed-precision MoE configurations (e.g., FP8 attention + INT4 experts) are expressed in the same file, not hacked around. |
| - **Reproducible calibration metadata.** The quantization config, calibration scheme, and per-channel scales are stored inside the checkpoint. What you see in the config is exactly what ran. |
| - **Forward compatibility.** As vLLM adds new kernel support (FP8, INT8, sparse), compressed-tensors models gain that support without re-quantizing. |
|
|
| AWQ and GPTQ remain fine for llama.cpp and older toolchains. If you are deploying on vLLM in production, compressed-tensors is the correct choice. |
|
|
| --- |
|
|
| ## Quality Standard |
|
|
| All models are quantized with AutoRound (iters=200) or RTN where noted. |
|
|
| | Tier | Method | Target Recovery | Hardware Floor | |
| |------|--------|----------------|----------------| |
| | W8A16 | RTN / AutoRound iters=200 | Near-lossless (>99.5% MMLU) | Ampere (A100, A6000, RTX 30xx+) | |
| | W4A16 | AutoRound iters=200 | β₯99% MMLU vs FP16 baseline | Ampere (A100, A6000, RTX 30xx+) | |
|
|
| AutoRound at iters=200 runs sign-gradient optimization over a calibration set to minimize weight rounding error. At W4A16, this closes most of the gap between naive round-to-nearest and GPTQ/AWQ, while producing a checkpoint that vLLM can load natively. |
|
|
| --- |
|
|
| ## Model Catalog |
|
|
| All 16 models are in compressed-tensors format, validated for vLLM v0.9.0+. |
|
|
| ### Qwen3.6-35B-A3B β Mixed-Precision MoE, 1M context |
|
|
| | Precision | Repo | Architecture | |
| |-----------|------|-------------| |
| | W8A16 | [88plug/Qwen3.6-35B-A3B-W8A16](https://huggingface.co/88plug/Qwen3.6-35B-A3B-W8A16) | MoE, 35B total / 3.6B active | |
| | W4A16 | [88plug/Qwen3.6-35B-A3B-W4A16](https://huggingface.co/88plug/Qwen3.6-35B-A3B-W4A16) | MoE, 35B total / 3.6B active | |
|
|
| ### Qwen3.6-27B β Dense Hybrid, 262k context |
|
|
| | Precision | Repo | Architecture | |
| |-----------|------|-------------| |
| | W8A16 | [88plug/Qwen3.6-27B-W8A16](https://huggingface.co/88plug/Qwen3.6-27B-W8A16) | Dense, 27B | |
| | W4A16 | [88plug/Qwen3.6-27B-W4A16](https://huggingface.co/88plug/Qwen3.6-27B-W4A16) | Dense, 27B | |
|
|
| ### Qwen3-Omni-30B-A3B β Audio + Vision + Speech |
|
|
| | Precision | Repo | Architecture | |
| |-----------|------|-------------| |
| | W8A16 | [88plug/Qwen3-Omni-30B-A3B-W8A16](https://huggingface.co/88plug/Qwen3-Omni-30B-A3B-W8A16) | Omni MoE, 30B / 3B active | |
| | W4A16 | [88plug/Qwen3-Omni-30B-A3B-W4A16](https://huggingface.co/88plug/Qwen3-Omni-30B-A3B-W4A16) | Omni MoE, 30B / 3B active | |
|
|
| ### Qwen2.5-Omni-7B β Efficient Omni |
|
|
| | Precision | Repo | Architecture | |
| |-----------|------|-------------| |
| | W8A16 | [88plug/Qwen2.5-Omni-7B-W8A16](https://huggingface.co/88plug/Qwen2.5-Omni-7B-W8A16) | Omni dense, 7B | |
| | W4A16 | [88plug/Qwen2.5-Omni-7B-W4A16](https://huggingface.co/88plug/Qwen2.5-Omni-7B-W4A16) | Omni dense, 7B | |
|
|
| ### Gemma4-E4B-it β Vision-Language Model |
|
|
| | Precision | Repo | Architecture | |
| |-----------|------|-------------| |
| | W8A16 | [88plug/Gemma4-E4B-it-W8A16](https://huggingface.co/88plug/Gemma4-E4B-it-W8A16) | VLM, 4B | |
| | W4A16 | [88plug/Gemma4-E4B-it-W4A16](https://huggingface.co/88plug/Gemma4-E4B-it-W4A16) | VLM, 4B | |
|
|
| ### Gemma4-E2B-it β Ultra-Efficient VLM |
|
|
| | Precision | Repo | Architecture | |
| |-----------|------|-------------| |
| | W8A16 | [88plug/Gemma4-E2B-it-W8A16](https://huggingface.co/88plug/Gemma4-E2B-it-W8A16) | VLM, 2B | |
| | W4A16 | [88plug/Gemma4-E2B-it-W4A16](https://huggingface.co/88plug/Gemma4-E2B-it-W4A16) | VLM, 2B | |
|
|
| ### MiniCPM-o-4.5 β Omni Model |
|
|
| | Precision | Repo | Architecture | |
| |-----------|------|-------------| |
| | W8A16 | [88plug/MiniCPM-o-4.5-W8A16](https://huggingface.co/88plug/MiniCPM-o-4.5-W8A16) | Omni dense | |
| | W4A16 | [88plug/MiniCPM-o-4.5-W4A16](https://huggingface.co/88plug/MiniCPM-o-4.5-W4A16) | Omni dense | |
|
|
| ### Nemotron-3-Nano-30B-A3B β Hybrid SSM/Attention |
|
|
| | Precision | Repo | Architecture | |
| |-----------|------|-------------| |
| | W8A16 | [88plug/Nemotron-3-Nano-30B-A3B-W8A16](https://huggingface.co/88plug/Nemotron-3-Nano-30B-A3B-W8A16) | Hybrid SSM/Attention MoE | |
| | W4A16 | [88plug/Nemotron-3-Nano-30B-A3B-W4A16](https://huggingface.co/88plug/Nemotron-3-Nano-30B-A3B-W4A16) | Hybrid SSM/Attention MoE | |
|
|
| --- |
|
|
| ## Quickstart |
|
|
| Requires vLLM v0.9.0+ and an Ampere-class GPU (A100, A6000, RTX 3090/4090, or equivalent). |
|
|
| ### Install |
|
|
| ```bash |
| pip install vllm>=0.9.0 |
| ``` |
|
|
| ### Launch (offline inference) |
|
|
| ```python |
| from vllm import LLM, SamplingParams |
| |
| llm = LLM( |
| model="88plug/Qwen3.6-35B-A3B-W4A16", |
| max_model_len=131072, # adjust to available VRAM |
| tensor_parallel_size=1, # increase for multi-GPU |
| ) |
| |
| sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512) |
| |
| outputs = llm.generate( |
| ["Explain the tradeoffs between W4A16 and W8A16 quantization for production inference."], |
| sampling_params, |
| ) |
| |
| print(outputs[0].outputs[0].text) |
| ``` |
|
|
| ### Launch (OpenAI-compatible server) |
|
|
| ```bash |
| vllm serve 88plug/Qwen3.6-35B-A3B-W4A16 \ |
| --max-model-len 131072 \ |
| --tensor-parallel-size 1 \ |
| --port 8000 |
| ``` |
|
|
| ```bash |
| curl http://localhost:8000/v1/chat/completions \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "model": "88plug/Qwen3.6-35B-A3B-W4A16", |
| "messages": [{"role": "user", "content": "What is compressed-tensors?"}], |
| "max_tokens": 256 |
| }' |
| ``` |
|
|
| --- |
|
|
| ## Hardware Requirements |
|
|
| | Model Size | W8A16 VRAM | W4A16 VRAM | Recommended | |
| |-----------|-----------|-----------|-------------| |
| | 2Bβ7B | 8β16 GB | 6β10 GB | Single A6000 / RTX 4090 | |
| | 27Bβ35B (dense) | 32β40 GB | 20β28 GB | Single A100 80G or 2x A6000 | |
| | 30Bβ35B (MoE, 3B active) | 28β36 GB | 18β24 GB | Single A100 80G or 2x A6000 | |
|
|
| Active-parameter MoE models load all expert weights into VRAM but only route through a subset per token. VRAM requirement is determined by total parameters, not active parameters. |
|
|
| --- |
|
|
| ## Contact |
|
|
| Developer: Andrew Mello |
| Organization: [huggingface.co/88plug](https://huggingface.co/88plug) |
| Issues and model requests: open a discussion on the relevant model repo. |
|
|
| Model uploads are automated via the [88plug-bot](https://huggingface.co/88plug-bot) account. |
|
|