Spaces:

88plug
/

README

Configuration error

App Files Files Community

README / README.md

88plug-bot

fix: add required frontmatter fields for static space

4042270 verified 3 days ago

preview code

raw

history blame contribute delete

7.17 kB

	---
	title: 88plug AI Lab
	emoji: 🔌
	colorFrom: indigo
	colorTo: purple
	sdk: static
	pinned: false
	---

	# 88plug AI Lab

	Production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models, engineered for native vLLM v0.9.0+ deployment. Every model is validated against the baseline on MMLU and ships with a complete vLLM-ready configuration.

	---

	## Why compressed-tensors

	Most quantization formats (AWQ, GPTQ, GGUF) target a single inference backend and ship a frozen weight layout that cannot be further composed or modified at load time. `compressed-tensors` is the format developed by Neural Magic and maintained as a first-class vLLM citizen. Key differences:

	- Native vLLM integration. No format conversion, no plugin shims. vLLM reads compressed-tensors models directly via its built-in `CompressedTensorsWorker`. This means full PagedAttention, continuous batching, and tensor parallelism work without modification.
	- Composable precision. A single checkpoint can carry per-layer or per-group precision assignments. Mixed-precision MoE configurations (e.g., FP8 attention + INT4 experts) are expressed in the same file, not hacked around.
	- Reproducible calibration metadata. The quantization config, calibration scheme, and per-channel scales are stored inside the checkpoint. What you see in the config is exactly what ran.
	- Forward compatibility. As vLLM adds new kernel support (FP8, INT8, sparse), compressed-tensors models gain that support without re-quantizing.

	AWQ and GPTQ remain fine for llama.cpp and older toolchains. If you are deploying on vLLM in production, compressed-tensors is the correct choice.

	---

	## Quality Standard

	All models are quantized with AutoRound (iters=200) or RTN where noted.

	\| Tier \| Method \| Target Recovery \| Hardware Floor \|
	\|------\|--------\|----------------\|----------------\|
	\| W8A16 \| RTN / AutoRound iters=200 \| Near-lossless (>99.5% MMLU) \| Ampere (A100, A6000, RTX 30xx+) \|
	\| W4A16 \| AutoRound iters=200 \| ≥99% MMLU vs FP16 baseline \| Ampere (A100, A6000, RTX 30xx+) \|

	AutoRound at iters=200 runs sign-gradient optimization over a calibration set to minimize weight rounding error. At W4A16, this closes most of the gap between naive round-to-nearest and GPTQ/AWQ, while producing a checkpoint that vLLM can load natively.

	---

	## Model Catalog

	All 16 models are in compressed-tensors format, validated for vLLM v0.9.0+.

	### Qwen3.6-35B-A3B — Mixed-Precision MoE, 1M context

	\| Precision \| Repo \| Architecture \|
	\|-----------\|------\|-------------\|
	\| W8A16 \| [88plug/Qwen3.6-35B-A3B-W8A16](https://huggingface.co/88plug/Qwen3.6-35B-A3B-W8A16) \| MoE, 35B total / 3.6B active \|
	\| W4A16 \| [88plug/Qwen3.6-35B-A3B-W4A16](https://huggingface.co/88plug/Qwen3.6-35B-A3B-W4A16) \| MoE, 35B total / 3.6B active \|

	### Qwen3.6-27B — Dense Hybrid, 262k context

	\| Precision \| Repo \| Architecture \|
	\|-----------\|------\|-------------\|
	\| W8A16 \| [88plug/Qwen3.6-27B-W8A16](https://huggingface.co/88plug/Qwen3.6-27B-W8A16) \| Dense, 27B \|
	\| W4A16 \| [88plug/Qwen3.6-27B-W4A16](https://huggingface.co/88plug/Qwen3.6-27B-W4A16) \| Dense, 27B \|

	### Qwen3-Omni-30B-A3B — Audio + Vision + Speech

	\| Precision \| Repo \| Architecture \|
	\|-----------\|------\|-------------\|
	\| W8A16 \| [88plug/Qwen3-Omni-30B-A3B-W8A16](https://huggingface.co/88plug/Qwen3-Omni-30B-A3B-W8A16) \| Omni MoE, 30B / 3B active \|
	\| W4A16 \| [88plug/Qwen3-Omni-30B-A3B-W4A16](https://huggingface.co/88plug/Qwen3-Omni-30B-A3B-W4A16) \| Omni MoE, 30B / 3B active \|

	### Qwen2.5-Omni-7B — Efficient Omni

	\| Precision \| Repo \| Architecture \|
	\|-----------\|------\|-------------\|
	\| W8A16 \| [88plug/Qwen2.5-Omni-7B-W8A16](https://huggingface.co/88plug/Qwen2.5-Omni-7B-W8A16) \| Omni dense, 7B \|
	\| W4A16 \| [88plug/Qwen2.5-Omni-7B-W4A16](https://huggingface.co/88plug/Qwen2.5-Omni-7B-W4A16) \| Omni dense, 7B \|

	### Gemma4-E4B-it — Vision-Language Model

	\| Precision \| Repo \| Architecture \|
	\|-----------\|------\|-------------\|
	\| W8A16 \| [88plug/Gemma4-E4B-it-W8A16](https://huggingface.co/88plug/Gemma4-E4B-it-W8A16) \| VLM, 4B \|
	\| W4A16 \| [88plug/Gemma4-E4B-it-W4A16](https://huggingface.co/88plug/Gemma4-E4B-it-W4A16) \| VLM, 4B \|

	### Gemma4-E2B-it — Ultra-Efficient VLM

	\| Precision \| Repo \| Architecture \|
	\|-----------\|------\|-------------\|
	\| W8A16 \| [88plug/Gemma4-E2B-it-W8A16](https://huggingface.co/88plug/Gemma4-E2B-it-W8A16) \| VLM, 2B \|
	\| W4A16 \| [88plug/Gemma4-E2B-it-W4A16](https://huggingface.co/88plug/Gemma4-E2B-it-W4A16) \| VLM, 2B \|

	### MiniCPM-o-4.5 — Omni Model

	\| Precision \| Repo \| Architecture \|
	\|-----------\|------\|-------------\|
	\| W8A16 \| [88plug/MiniCPM-o-4.5-W8A16](https://huggingface.co/88plug/MiniCPM-o-4.5-W8A16) \| Omni dense \|
	\| W4A16 \| [88plug/MiniCPM-o-4.5-W4A16](https://huggingface.co/88plug/MiniCPM-o-4.5-W4A16) \| Omni dense \|

	### Nemotron-3-Nano-30B-A3B — Hybrid SSM/Attention

	\| Precision \| Repo \| Architecture \|
	\|-----------\|------\|-------------\|
	\| W8A16 \| [88plug/Nemotron-3-Nano-30B-A3B-W8A16](https://huggingface.co/88plug/Nemotron-3-Nano-30B-A3B-W8A16) \| Hybrid SSM/Attention MoE \|
	\| W4A16 \| [88plug/Nemotron-3-Nano-30B-A3B-W4A16](https://huggingface.co/88plug/Nemotron-3-Nano-30B-A3B-W4A16) \| Hybrid SSM/Attention MoE \|

	---

	## Quickstart

	Requires vLLM v0.9.0+ and an Ampere-class GPU (A100, A6000, RTX 3090/4090, or equivalent).

	### Install

	```bash
	pip install vllm>=0.9.0
	```

	### Launch (offline inference)

	```python
	from vllm import LLM, SamplingParams

	llm = LLM(
	model="88plug/Qwen3.6-35B-A3B-W4A16",
	max_model_len=131072, # adjust to available VRAM
	tensor_parallel_size=1, # increase for multi-GPU
	)

	sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512)

	outputs = llm.generate(
	["Explain the tradeoffs between W4A16 and W8A16 quantization for production inference."],
	sampling_params,
	)

	print(outputs[0].outputs[0].text)
	```

	### Launch (OpenAI-compatible server)

	```bash
	vllm serve 88plug/Qwen3.6-35B-A3B-W4A16 \
	--max-model-len 131072 \
	--tensor-parallel-size 1 \
	--port 8000
	```

	```bash
	curl http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "88plug/Qwen3.6-35B-A3B-W4A16",
	"messages": [{"role": "user", "content": "What is compressed-tensors?"}],
	"max_tokens": 256
	}'
	```

	---

	## Hardware Requirements

	\| Model Size \| W8A16 VRAM \| W4A16 VRAM \| Recommended \|
	\|-----------\|-----------\|-----------\|-------------\|
	\| 2B–7B \| 8–16 GB \| 6–10 GB \| Single A6000 / RTX 4090 \|
	\| 27B–35B (dense) \| 32–40 GB \| 20–28 GB \| Single A100 80G or 2x A6000 \|
	\| 30B–35B (MoE, 3B active) \| 28–36 GB \| 18–24 GB \| Single A100 80G or 2x A6000 \|

	Active-parameter MoE models load all expert weights into VRAM but only route through a subset per token. VRAM requirement is determined by total parameters, not active parameters.

	---

	## Contact

	Developer: Andrew Mello
	Organization: [huggingface.co/88plug](https://huggingface.co/88plug)
	Issues and model requests: open a discussion on the relevant model repo.

	Model uploads are automated via the [88plug-bot](https://huggingface.co/88plug-bot) account.