File size: 5,570 Bytes
8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca 8edd70b 37419ca | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | ---
language:
- en
library_name: vllm
pipeline_tag: text-generation
tags:
- text-generation
- conversational
- moe
- quantized
- compressed-tensors
- awq
- w4a16
- nvfp4
base_model: MiniMaxAI/MiniMax-M2.5
base_model_relation: quantized
quantized_by: TheHouseOfTheDude
license: other
---
# MiniMax-M2.5 — Quantized (compressed-tensors for vLLM)
This repository contains **quantized inference builds** of **MiniMaxAI/MiniMax-M2.5** exported in the **compressed-tensors** layout for **vLLM**.
MiniMax-M2.5 is a large **Mixture-of-Experts (MoE)** model. The attached quant scripts calibrate **all experts** (not just router top-k) to produce more robust scales across the full mixture.
---
## Variants / Branches
This repo publishes **two quant variants**:
- **AWQ-INT4** — weight-only AWQ (**INT4 weights**, FP16/BF16 activations at runtime)
- **NVFP4** — NVFP4 quant (**FP4 weights + FP4 activations**), intended for runtimes that support NVFP4 kernels
> The `main` branch is typically a landing page. The runnable artifacts live under the **AWQ-INT4** and **NVFP4** branches.
---
## What’s inside (per variant)
Each variant branch includes:
- Sharded quantized weights (`*.safetensors`) + `model.safetensors.index.json`
- `config.json` with compressed-tensors quant metadata
- Tokenizer artifacts (and chat template assets if present)
Exports are written with `save_compressed=True` so vLLM can load them as **compressed-tensors**.
---
## Critical MoE detail: all experts are activated during calibration
Calibration is **MoE-aware**:
1. Each MoE block is wrapped/replaced during calibration so **ALL experts execute** for calibration forward passes.
2. The oneshot quant call is configured to **calibrate all experts** end-to-end.
**Why it matters:** If only top-k experts are exercised, rare experts can receive poor scales and quantize badly—leading to instability when those experts trigger at inference time.
---
## Quantization scope: what is and is not quantized
### Shared rule (both variants)
The scripts are designed to quantize **only the MoE expert MLP weights**, e.g.:
- `block_sparse_moe.experts.*.w1`
- `block_sparse_moe.experts.*.w2`
- `block_sparse_moe.experts.*.w3`
Everything else is excluded for stability (embeddings, attention, router/gate, norms, rotary, `lm_head`, etc.).
---
## AWQ-INT4 (W4A16) details
- **Weights:** INT4 (`num_bits=4`, symmetric)
- **Activations:** A16 runtime (FP16/BF16)
- **Grouping:** group-wise AWQ; group size is configured by the script/CLI
- **Targets:** linear layers (restricted to expert MLP linears per scope)
- **Ignored:** attention/embeddings/router/norms/`lm_head` (kept higher precision)
- **Smoothing:** script sets up scaling maps around post-attn norms and expert MLP weights to improve stability
---
## NVFP4 details
- **Weights:** FP4
- **Activations:** FP4
- **Targets:** linear layers (restricted to expert MLP linears per scope)
- **Ignored:** attention/embeddings/router/norms/`lm_head`
- **Runtime:** requires NVFP4-capable kernels (often newer GPU + software stack)
---
## Calibration data, sample count, and sequence length
Both scripts use a **dataset recipe YAML/config** that controls:
- `max_seq_length`
- shuffle + seed
- optional `num_samples`
- dataset sources with formatter/column mapping and per-source sample counts
**Tokenization behavior**
- `padding=False`
- `truncation=True`
- `max_length=MAX_SEQUENCE_LENGTH`
- `add_special_tokens=False`
> The exact dataset names/counts live in your recipe file; this README documents the pipeline and knobs.
---
## FP8 compatibility handling (base stored as FP8)
If the base ships FP8 parameters, the scripts:
- load in BF16,
- convert FP8 parameters to BF16 for quantization compatibility,
- sanitize quantization-related config fields to avoid serialization/tracing issues.
---
## Quickstart (vLLM)
### AWQ-INT4 branch
```bash
pip install -U vllm
vllm serve TheHouseOfTheDude/MiniMax-M2.5:AWQ-INT4 \
--quantization compressed-tensors \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--dtype bfloat16
```
### NVFP4 branch
```bash
pip install -U vllm
vllm serve TheHouseOfTheDude/MiniMax-M2.5:NVFP4 \
--quantization compressed-tensors \
--tensor-parallel-size 8 \
--enable-expert-parallel
```
**Notes**
- MiniMax-M2.5 is extremely large; multi-GPU + expert parallel is strongly recommended.
- Long context is KV-cache heavy; tune `--max-model-len`, batch size, and GPU memory utilization accordingly.
- Serving from a local path works too—point `vllm serve` at the variant directory (e.g., `.../AWQ-INT4` or `.../NVFP4`).
---
## Intended use
- High-throughput instruction/chat inference where MoE efficiency matters
- Large-scale serving stacks that benefit from reduced weight bandwidth and memory
- Long-context workloads (subject to your hardware limits)
Quantization changes **weight representation only**. It does not modify tokenizer, chat template, or safety behavior. Apply your own safety policies/filters as appropriate.
---
## Lineage
- **Base model:** https://huggingface.co/MiniMaxAI/MiniMax-M2.5
- **This repo:** quantized inference variants exported to **compressed-tensors** for vLLM:
- **AWQ-INT4**
- **NVFP4**
---
## Changelog
- **v1 (current)** — Initial release with two quant variants:
- **AWQ-INT4** (expert-only W4A16 AWQ; all-experts calibration; group size configurable in script)
- **NVFP4** (FP4 weights + FP4 activations; expert-only scope; all-experts calibration; requires NVFP4-capable runtime)
|