| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <meta charset="UTF-8"> |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> |
| <title>88plug AI Lab</title> |
| <style> |
| :root { |
| --bg: #0f1117; |
| --surface: #1a1d27; |
| --border: #2a2d3e; |
| --accent: #6366f1; |
| --accent2: #818cf8; |
| --text: #e2e8f0; |
| --muted: #94a3b8; |
| --code-bg: #0d1117; |
| --green: #22c55e; |
| } |
| * { box-sizing: border-box; margin: 0; padding: 0; } |
| body { |
| background: var(--bg); |
| color: var(--text); |
| font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', system-ui, sans-serif; |
| font-size: 15px; |
| line-height: 1.7; |
| max-width: 900px; |
| margin: 0 auto; |
| padding: 40px 24px 80px; |
| } |
| h1 { font-size: 2rem; font-weight: 700; color: #fff; margin-bottom: 4px; } |
| h2 { font-size: 1.25rem; font-weight: 600; color: #fff; margin: 40px 0 12px; padding-bottom: 8px; border-bottom: 1px solid var(--border); } |
| h3 { font-size: 1rem; font-weight: 600; color: var(--accent2); margin: 28px 0 10px; } |
| p { margin-bottom: 14px; color: var(--text); } |
| a { color: var(--accent2); text-decoration: none; } |
| a:hover { text-decoration: underline; } |
| hr { border: none; border-top: 1px solid var(--border); margin: 32px 0; } |
| .header { margin-bottom: 32px; } |
| .subtitle { color: var(--muted); font-size: 0.95rem; margin-top: 6px; } |
| .badge { |
| display: inline-block; |
| background: rgba(99, 102, 241, 0.15); |
| color: var(--accent2); |
| border: 1px solid rgba(99, 102, 241, 0.3); |
| border-radius: 4px; |
| font-size: 0.75rem; |
| font-weight: 600; |
| padding: 2px 8px; |
| margin-right: 6px; |
| letter-spacing: 0.05em; |
| text-transform: uppercase; |
| } |
| table { width: 100%; border-collapse: collapse; margin: 14px 0 24px; font-size: 0.9rem; } |
| th { background: var(--surface); color: var(--muted); font-weight: 600; text-align: left; padding: 8px 12px; border-bottom: 1px solid var(--border); font-size: 0.8rem; letter-spacing: 0.04em; text-transform: uppercase; } |
| td { padding: 8px 12px; border-bottom: 1px solid var(--border); vertical-align: top; } |
| tr:last-child td { border-bottom: none; } |
| tr:hover td { background: rgba(255,255,255,0.02); } |
| code { |
| font-family: 'JetBrains Mono', 'Fira Code', 'Cascadia Code', monospace; |
| font-size: 0.85em; |
| background: var(--code-bg); |
| border: 1px solid var(--border); |
| border-radius: 4px; |
| padding: 1px 5px; |
| color: #e879f9; |
| } |
| pre { |
| background: var(--code-bg); |
| border: 1px solid var(--border); |
| border-radius: 8px; |
| padding: 16px 20px; |
| overflow-x: auto; |
| margin: 12px 0 20px; |
| font-size: 0.85rem; |
| line-height: 1.6; |
| } |
| pre code { |
| background: none; |
| border: none; |
| padding: 0; |
| color: #a5f3fc; |
| font-size: inherit; |
| } |
| .model-family { |
| background: var(--surface); |
| border: 1px solid var(--border); |
| border-radius: 10px; |
| padding: 20px 24px; |
| margin-bottom: 16px; |
| } |
| .model-family h3 { margin-top: 0; } |
| .quality-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 12px; margin: 16px 0 24px; } |
| .quality-card { |
| background: var(--surface); |
| border: 1px solid var(--border); |
| border-radius: 8px; |
| padding: 16px 20px; |
| } |
| .quality-card .tier { font-size: 1.1rem; font-weight: 700; color: #fff; margin-bottom: 4px; } |
| .quality-card .method { color: var(--muted); font-size: 0.85rem; margin-bottom: 8px; } |
| .quality-card .recovery { color: var(--green); font-weight: 600; font-size: 0.9rem; } |
| .contact { background: var(--surface); border: 1px solid var(--border); border-radius: 10px; padding: 20px 24px; } |
| ul { padding-left: 20px; margin-bottom: 14px; } |
| li { margin-bottom: 6px; } |
| @media (max-width: 600px) { |
| .quality-grid { grid-template-columns: 1fr; } |
| body { padding: 24px 16px 60px; } |
| h1 { font-size: 1.5rem; } |
| } |
| </style> |
| </head> |
| <body> |
|
|
| <div class="header"> |
| <h1>π 88plug AI Lab</h1> |
| <p class="subtitle">Production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models β engineered for native vLLM v0.9.0+ deployment.</p> |
| </div> |
|
|
| <h2>Why compressed-tensors</h2> |
| <p>Most quantization formats (AWQ, GPTQ, GGUF) target a single inference backend and ship a frozen weight layout that cannot be further composed or modified at load time. <code>compressed-tensors</code> is the format developed by Neural Magic and maintained as a first-class vLLM citizen.</p> |
| <ul> |
| <li><strong>Native vLLM integration.</strong> No format conversion, no plugin shims. vLLM reads compressed-tensors models directly via its built-in <code>CompressedTensorsWorker</code>. Full PagedAttention, continuous batching, and tensor parallelism work without modification.</li> |
| <li><strong>Composable precision.</strong> A single checkpoint can carry per-layer or per-group precision assignments. Mixed-precision MoE configurations are expressed in the same file.</li> |
| <li><strong>Reproducible calibration metadata.</strong> The quantization config, calibration scheme, and per-channel scales are stored inside the checkpoint.</li> |
| <li><strong>Forward compatibility.</strong> As vLLM adds new kernel support (FP8, INT8, sparse), compressed-tensors models gain that support without re-quantizing.</li> |
| </ul> |
| <p>AWQ and GPTQ remain fine for llama.cpp and older toolchains. If you are deploying on vLLM in production, compressed-tensors is the correct choice.</p> |
|
|
| <h2>Quality Standard</h2> |
| <div class="quality-grid"> |
| <div class="quality-card"> |
| <div class="tier">W8A16</div> |
| <div class="method">RTN / AutoRound iters=200</div> |
| <div class="recovery">>99.5% MMLU recovery</div> |
| <p style="font-size:0.85rem;color:var(--muted);margin:8px 0 0">Ampere+ (A100, A6000, RTX 30xx+)</p> |
| </div> |
| <div class="quality-card"> |
| <div class="tier">W4A16</div> |
| <div class="method">AutoRound iters=200 (SignSGD)</div> |
| <div class="recovery">β₯99% MMLU recovery</div> |
| <p style="font-size:0.85rem;color:var(--muted);margin:8px 0 0">Ampere+ (A100, A6000, RTX 30xx+)</p> |
| </div> |
| </div> |
| <p style="color:var(--muted);font-size:0.875rem">AutoRound at iters=200 runs sign-gradient optimization over a calibration set to minimize weight rounding error. At W4A16, this closes most of the gap between naive round-to-nearest and GPTQ/AWQ, while producing a checkpoint that vLLM can load natively.</p> |
|
|
| <h2>Model Catalog</h2> |
| <p style="color:var(--muted);font-size:0.875rem">All 16 models in compressed-tensors format, validated for vLLM v0.9.0+.</p> |
|
|
| <div class="model-family"> |
| <h3>Qwen3.6-35B-A3B β Mixed-Precision MoE, 1M context</h3> |
| <table> |
| <tr><th>Precision</th><th>Repo</th><th>Architecture</th></tr> |
| <tr><td><span class="badge">W8A16</span></td><td><a href="https://huggingface.co/88plug/Qwen3.6-35B-A3B-W8A16">88plug/Qwen3.6-35B-A3B-W8A16</a></td><td>MoE, 35B total / 3.6B active</td></tr> |
| <tr><td><span class="badge">W4A16</span></td><td><a href="https://huggingface.co/88plug/Qwen3.6-35B-A3B-W4A16">88plug/Qwen3.6-35B-A3B-W4A16</a></td><td>MoE, 35B total / 3.6B active</td></tr> |
| </table> |
| </div> |
|
|
| <div class="model-family"> |
| <h3>Qwen3.6-27B β Dense Hybrid, 262k context</h3> |
| <table> |
| <tr><th>Precision</th><th>Repo</th><th>Architecture</th></tr> |
| <tr><td><span class="badge">W8A16</span></td><td><a href="https://huggingface.co/88plug/Qwen3.6-27B-W8A16">88plug/Qwen3.6-27B-W8A16</a></td><td>Dense, 27B</td></tr> |
| <tr><td><span class="badge">W4A16</span></td><td><a href="https://huggingface.co/88plug/Qwen3.6-27B-W4A16">88plug/Qwen3.6-27B-W4A16</a></td><td>Dense, 27B</td></tr> |
| </table> |
| </div> |
|
|
| <div class="model-family"> |
| <h3>Qwen3-Omni-30B-A3B β Audio + Vision + Speech</h3> |
| <table> |
| <tr><th>Precision</th><th>Repo</th><th>Architecture</th></tr> |
| <tr><td><span class="badge">W8A16</span></td><td><a href="https://huggingface.co/88plug/Qwen3-Omni-30B-A3B-W8A16">88plug/Qwen3-Omni-30B-A3B-W8A16</a></td><td>Omni MoE, 30B / 3B active</td></tr> |
| <tr><td><span class="badge">W4A16</span></td><td><a href="https://huggingface.co/88plug/Qwen3-Omni-30B-W4A16">88plug/Qwen3-Omni-30B-W4A16</a></td><td>Omni MoE, 30B / 3B active</td></tr> |
| </table> |
| </div> |
|
|
| <div class="model-family"> |
| <h3>Qwen2.5-Omni-7B β Efficient Omni</h3> |
| <table> |
| <tr><th>Precision</th><th>Repo</th><th>Architecture</th></tr> |
| <tr><td><span class="badge">W8A16</span></td><td><a href="https://huggingface.co/88plug/Qwen2.5-Omni-7B-W8A16">88plug/Qwen2.5-Omni-7B-W8A16</a></td><td>Omni dense, 7B</td></tr> |
| <tr><td><span class="badge">W4A16</span></td><td><a href="https://huggingface.co/88plug/Qwen2.5-Omni-7B-W4A16">88plug/Qwen2.5-Omni-7B-W4A16</a></td><td>Omni dense, 7B</td></tr> |
| </table> |
| </div> |
|
|
| <div class="model-family"> |
| <h3>Gemma4-E4B-it β Vision-Language Model</h3> |
| <table> |
| <tr><th>Precision</th><th>Repo</th><th>Architecture</th></tr> |
| <tr><td><span class="badge">W8A16</span></td><td><a href="https://huggingface.co/88plug/Gemma4-E4B-it-W8A16">88plug/Gemma4-E4B-it-W8A16</a></td><td>VLM MoE, 4B active / 28B total</td></tr> |
| <tr><td><span class="badge">W4A16</span></td><td><a href="https://huggingface.co/88plug/Gemma4-E4B-it-W4A16">88plug/Gemma4-E4B-it-W4A16</a></td><td>VLM MoE, 4B active / 28B total</td></tr> |
| </table> |
| </div> |
|
|
| <div class="model-family"> |
| <h3>Gemma4-E2B-it β Ultra-Efficient VLM</h3> |
| <table> |
| <tr><th>Precision</th><th>Repo</th><th>Architecture</th></tr> |
| <tr><td><span class="badge">W8A16</span></td><td><a href="https://huggingface.co/88plug/Gemma4-E2B-it-W8A16">88plug/Gemma4-E2B-it-W8A16</a></td><td>VLM MoE, 2B active / 26B total</td></tr> |
| <tr><td><span class="badge">W4A16</span></td><td><a href="https://huggingface.co/88plug/Gemma4-E2B-it-W4A16">88plug/Gemma4-E2B-it-W4A16</a></td><td>VLM MoE, 2B active / 26B total</td></tr> |
| </table> |
| </div> |
|
|
| <div class="model-family"> |
| <h3>MiniCPM-o-4.5 β Omni Model</h3> |
| <table> |
| <tr><th>Precision</th><th>Repo</th><th>Architecture</th></tr> |
| <tr><td><span class="badge">W8A16</span></td><td><a href="https://huggingface.co/88plug/MiniCPM-o-4.5-W8A16">88plug/MiniCPM-o-4.5-W8A16</a></td><td>Omni dense</td></tr> |
| <tr><td><span class="badge">W4A16</span></td><td><a href="https://huggingface.co/88plug/MiniCPM-o-4.5-W4A16">88plug/MiniCPM-o-4.5-W4A16</a></td><td>Omni dense</td></tr> |
| </table> |
| </div> |
|
|
| <div class="model-family"> |
| <h3>Nemotron-3-Nano-30B-A3B β Hybrid SSM/Attention</h3> |
| <table> |
| <tr><th>Precision</th><th>Repo</th><th>Architecture</th></tr> |
| <tr><td><span class="badge">W8A16</span></td><td><a href="https://huggingface.co/88plug/Nemotron-3-Nano-30B-A3B-W8A16">88plug/Nemotron-3-Nano-30B-A3B-W8A16</a></td><td>Hybrid Mamba2 SSM + Attention MoE</td></tr> |
| <tr><td><span class="badge">W4A16</span></td><td><a href="https://huggingface.co/88plug/Nemotron-3-Nano-30B-A3B-W4A16">88plug/Nemotron-3-Nano-30B-A3B-W4A16</a></td><td>Hybrid Mamba2 SSM + Attention MoE</td></tr> |
| </table> |
| </div> |
|
|
| <h2>Quickstart</h2> |
| <p>Requires vLLM v0.9.0+ and an Ampere-class GPU (A100, A6000, RTX 3090/4090, or equivalent).</p> |
|
|
| <h3>Install</h3> |
| <pre><code>pip install vllm>=0.9.0</code></pre> |
|
|
| <h3>Offline inference</h3> |
| <pre><code>from vllm import LLM, SamplingParams |
|
|
| llm = LLM( |
| model="88plug/Qwen3.6-35B-A3B-W4A16", |
| max_model_len=131072, |
| tensor_parallel_size=1, |
| ) |
|
|
| sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512) |
| outputs = llm.generate(["Explain W4A16 vs W8A16 tradeoffs."], sampling_params) |
| print(outputs[0].outputs[0].text)</code></pre> |
|
|
| <h3>OpenAI-compatible server</h3> |
| <pre><code>vllm serve 88plug/Qwen3.6-35B-A3B-W4A16 \ |
| --max-model-len 131072 \ |
| --port 8000</code></pre> |
|
|
| <h2>Hardware Requirements</h2> |
| <table> |
| <tr><th>Model Size</th><th>W8A16 VRAM</th><th>W4A16 VRAM</th><th>Recommended</th></tr> |
| <tr><td>2Bβ7B</td><td>8β16 GB</td><td>6β10 GB</td><td>Single A6000 / RTX 4090</td></tr> |
| <tr><td>27Bβ35B (dense)</td><td>32β40 GB</td><td>20β28 GB</td><td>Single A100 80G or 2Γ A6000</td></tr> |
| <tr><td>30Bβ35B (MoE, 3B active)</td><td>28β36 GB</td><td>18β24 GB</td><td>Single A100 80G or 2Γ A6000</td></tr> |
| </table> |
|
|
| <hr> |
|
|
| <div class="contact"> |
| <strong>Contact</strong><br> |
| Developer: Andrew Mello Β· <a href="https://88plug.com">88plug.com</a><br> |
| Issues and model requests: open a Discussion on the relevant model repo.<br> |
| <span style="color:var(--muted);font-size:0.85rem">Uploads automated via <a href="https://huggingface.co/88plug-bot">88plug-bot</a>.</span> |
| </div> |
|
|
| </body> |
| </html> |
|
|