README / README.md
88plug-bot's picture
fix: add required frontmatter fields for static space
4042270 verified
---
title: 88plug AI Lab
emoji: πŸ”Œ
colorFrom: indigo
colorTo: purple
sdk: static
pinned: false
---
# 88plug AI Lab
Production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models, engineered for native vLLM v0.9.0+ deployment. Every model is validated against the baseline on MMLU and ships with a complete vLLM-ready configuration.
---
## Why compressed-tensors
Most quantization formats (AWQ, GPTQ, GGUF) target a single inference backend and ship a frozen weight layout that cannot be further composed or modified at load time. `compressed-tensors` is the format developed by Neural Magic and maintained as a first-class vLLM citizen. Key differences:
- **Native vLLM integration.** No format conversion, no plugin shims. vLLM reads compressed-tensors models directly via its built-in `CompressedTensorsWorker`. This means full PagedAttention, continuous batching, and tensor parallelism work without modification.
- **Composable precision.** A single checkpoint can carry per-layer or per-group precision assignments. Mixed-precision MoE configurations (e.g., FP8 attention + INT4 experts) are expressed in the same file, not hacked around.
- **Reproducible calibration metadata.** The quantization config, calibration scheme, and per-channel scales are stored inside the checkpoint. What you see in the config is exactly what ran.
- **Forward compatibility.** As vLLM adds new kernel support (FP8, INT8, sparse), compressed-tensors models gain that support without re-quantizing.
AWQ and GPTQ remain fine for llama.cpp and older toolchains. If you are deploying on vLLM in production, compressed-tensors is the correct choice.
---
## Quality Standard
All models are quantized with AutoRound (iters=200) or RTN where noted.
| Tier | Method | Target Recovery | Hardware Floor |
|------|--------|----------------|----------------|
| W8A16 | RTN / AutoRound iters=200 | Near-lossless (>99.5% MMLU) | Ampere (A100, A6000, RTX 30xx+) |
| W4A16 | AutoRound iters=200 | β‰₯99% MMLU vs FP16 baseline | Ampere (A100, A6000, RTX 30xx+) |
AutoRound at iters=200 runs sign-gradient optimization over a calibration set to minimize weight rounding error. At W4A16, this closes most of the gap between naive round-to-nearest and GPTQ/AWQ, while producing a checkpoint that vLLM can load natively.
---
## Model Catalog
All 16 models are in compressed-tensors format, validated for vLLM v0.9.0+.
### Qwen3.6-35B-A3B β€” Mixed-Precision MoE, 1M context
| Precision | Repo | Architecture |
|-----------|------|-------------|
| W8A16 | [88plug/Qwen3.6-35B-A3B-W8A16](https://huggingface.co/88plug/Qwen3.6-35B-A3B-W8A16) | MoE, 35B total / 3.6B active |
| W4A16 | [88plug/Qwen3.6-35B-A3B-W4A16](https://huggingface.co/88plug/Qwen3.6-35B-A3B-W4A16) | MoE, 35B total / 3.6B active |
### Qwen3.6-27B β€” Dense Hybrid, 262k context
| Precision | Repo | Architecture |
|-----------|------|-------------|
| W8A16 | [88plug/Qwen3.6-27B-W8A16](https://huggingface.co/88plug/Qwen3.6-27B-W8A16) | Dense, 27B |
| W4A16 | [88plug/Qwen3.6-27B-W4A16](https://huggingface.co/88plug/Qwen3.6-27B-W4A16) | Dense, 27B |
### Qwen3-Omni-30B-A3B β€” Audio + Vision + Speech
| Precision | Repo | Architecture |
|-----------|------|-------------|
| W8A16 | [88plug/Qwen3-Omni-30B-A3B-W8A16](https://huggingface.co/88plug/Qwen3-Omni-30B-A3B-W8A16) | Omni MoE, 30B / 3B active |
| W4A16 | [88plug/Qwen3-Omni-30B-A3B-W4A16](https://huggingface.co/88plug/Qwen3-Omni-30B-A3B-W4A16) | Omni MoE, 30B / 3B active |
### Qwen2.5-Omni-7B β€” Efficient Omni
| Precision | Repo | Architecture |
|-----------|------|-------------|
| W8A16 | [88plug/Qwen2.5-Omni-7B-W8A16](https://huggingface.co/88plug/Qwen2.5-Omni-7B-W8A16) | Omni dense, 7B |
| W4A16 | [88plug/Qwen2.5-Omni-7B-W4A16](https://huggingface.co/88plug/Qwen2.5-Omni-7B-W4A16) | Omni dense, 7B |
### Gemma4-E4B-it β€” Vision-Language Model
| Precision | Repo | Architecture |
|-----------|------|-------------|
| W8A16 | [88plug/Gemma4-E4B-it-W8A16](https://huggingface.co/88plug/Gemma4-E4B-it-W8A16) | VLM, 4B |
| W4A16 | [88plug/Gemma4-E4B-it-W4A16](https://huggingface.co/88plug/Gemma4-E4B-it-W4A16) | VLM, 4B |
### Gemma4-E2B-it β€” Ultra-Efficient VLM
| Precision | Repo | Architecture |
|-----------|------|-------------|
| W8A16 | [88plug/Gemma4-E2B-it-W8A16](https://huggingface.co/88plug/Gemma4-E2B-it-W8A16) | VLM, 2B |
| W4A16 | [88plug/Gemma4-E2B-it-W4A16](https://huggingface.co/88plug/Gemma4-E2B-it-W4A16) | VLM, 2B |
### MiniCPM-o-4.5 β€” Omni Model
| Precision | Repo | Architecture |
|-----------|------|-------------|
| W8A16 | [88plug/MiniCPM-o-4.5-W8A16](https://huggingface.co/88plug/MiniCPM-o-4.5-W8A16) | Omni dense |
| W4A16 | [88plug/MiniCPM-o-4.5-W4A16](https://huggingface.co/88plug/MiniCPM-o-4.5-W4A16) | Omni dense |
### Nemotron-3-Nano-30B-A3B β€” Hybrid SSM/Attention
| Precision | Repo | Architecture |
|-----------|------|-------------|
| W8A16 | [88plug/Nemotron-3-Nano-30B-A3B-W8A16](https://huggingface.co/88plug/Nemotron-3-Nano-30B-A3B-W8A16) | Hybrid SSM/Attention MoE |
| W4A16 | [88plug/Nemotron-3-Nano-30B-A3B-W4A16](https://huggingface.co/88plug/Nemotron-3-Nano-30B-A3B-W4A16) | Hybrid SSM/Attention MoE |
---
## Quickstart
Requires vLLM v0.9.0+ and an Ampere-class GPU (A100, A6000, RTX 3090/4090, or equivalent).
### Install
```bash
pip install vllm>=0.9.0
```
### Launch (offline inference)
```python
from vllm import LLM, SamplingParams
llm = LLM(
model="88plug/Qwen3.6-35B-A3B-W4A16",
max_model_len=131072, # adjust to available VRAM
tensor_parallel_size=1, # increase for multi-GPU
)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512)
outputs = llm.generate(
["Explain the tradeoffs between W4A16 and W8A16 quantization for production inference."],
sampling_params,
)
print(outputs[0].outputs[0].text)
```
### Launch (OpenAI-compatible server)
```bash
vllm serve 88plug/Qwen3.6-35B-A3B-W4A16 \
--max-model-len 131072 \
--tensor-parallel-size 1 \
--port 8000
```
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "88plug/Qwen3.6-35B-A3B-W4A16",
"messages": [{"role": "user", "content": "What is compressed-tensors?"}],
"max_tokens": 256
}'
```
---
## Hardware Requirements
| Model Size | W8A16 VRAM | W4A16 VRAM | Recommended |
|-----------|-----------|-----------|-------------|
| 2B–7B | 8–16 GB | 6–10 GB | Single A6000 / RTX 4090 |
| 27B–35B (dense) | 32–40 GB | 20–28 GB | Single A100 80G or 2x A6000 |
| 30B–35B (MoE, 3B active) | 28–36 GB | 18–24 GB | Single A100 80G or 2x A6000 |
Active-parameter MoE models load all expert weights into VRAM but only route through a subset per token. VRAM requirement is determined by total parameters, not active parameters.
---
## Contact
Developer: Andrew Mello
Organization: [huggingface.co/88plug](https://huggingface.co/88plug)
Issues and model requests: open a discussion on the relevant model repo.
Model uploads are automated via the [88plug-bot](https://huggingface.co/88plug-bot) account.