Fixing Model Card
Browse files
README.md
CHANGED
|
@@ -20,9 +20,9 @@ license: other
|
|
| 20 |
|
| 21 |
# MiniMax-M2.5 — Quantized (compressed-tensors for vLLM)
|
| 22 |
|
| 23 |
-
This repository contains **quantized inference builds** of **
|
| 24 |
|
| 25 |
-
MiniMax-M2.5 is a large **MoE** model
|
| 26 |
|
| 27 |
---
|
| 28 |
|
|
@@ -30,10 +30,10 @@ MiniMax-M2.5 is a large **MoE** model (script notes: **229B params**, **256 expe
|
|
| 30 |
|
| 31 |
This repo publishes **two quant variants**:
|
| 32 |
|
| 33 |
-
- **AWQ-INT4** — weight-only AWQ (INT4 weights
|
| 34 |
-
- **NVFP4** — NVFP4 (FP4 weights + FP4 activations),
|
| 35 |
|
| 36 |
-
> The `main` branch is typically
|
| 37 |
|
| 38 |
---
|
| 39 |
|
|
@@ -45,86 +45,140 @@ Each variant branch includes:
|
|
| 45 |
- `config.json` with compressed-tensors quant metadata
|
| 46 |
- Tokenizer artifacts (and chat template assets if present)
|
| 47 |
|
| 48 |
-
|
| 49 |
|
| 50 |
---
|
| 51 |
|
| 52 |
-
## Critical MoE detail:
|
| 53 |
|
| 54 |
-
|
| 55 |
|
| 56 |
-
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
---
|
| 59 |
|
| 60 |
-
## Quantization scope: what
|
| 61 |
|
| 62 |
### Shared rule (both variants)
|
| 63 |
-
|
|
|
|
|
|
|
| 64 |
- `block_sparse_moe.experts.*.w1`
|
| 65 |
- `block_sparse_moe.experts.*.w2`
|
| 66 |
- `block_sparse_moe.experts.*.w3`
|
| 67 |
|
| 68 |
-
Everything else is excluded for stability (attention,
|
| 69 |
|
| 70 |
-
|
| 71 |
-
AWQ is configured as:
|
| 72 |
-
- **INT4 weights** (`num_bits=4`, `symmetric=True`)
|
| 73 |
-
- **Group-wise quantization** (`strategy="group"`) with the **group size provided by CLI argument**
|
| 74 |
-
- Targets: `["Linear"]`
|
| 75 |
-
- Activations are not quantized (A16 runtime) :contentReference[oaicite:11]{index=11}
|
| 76 |
|
| 77 |
-
|
| 78 |
-
- `lm_head`, embeddings
|
| 79 |
-
- MoE router (`gate`, `e_score_correction_bias`)
|
| 80 |
-
- attention stack (`self_attn.*`)
|
| 81 |
-
- norms / rotary / MTP (if present) :contentReference[oaicite:12]{index=12}
|
| 82 |
|
| 83 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
---
|
| 92 |
|
| 93 |
-
## Calibration data,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
-
|
| 96 |
-
- `max_seq_length` (required)
|
| 97 |
-
- `shuffle` and `seed`
|
| 98 |
-
- optional `num_samples` cap
|
| 99 |
-
- a list of datasets (with formatter + column mapping + per-dataset sample counts) :contentReference[oaicite:16]{index=16} :contentReference[oaicite:17]{index=17}
|
| 100 |
|
| 101 |
-
Datasets are loaded according to the YAML config, formatted into text using formatter functions (ShareGPT / prompt-answer / chat-completion / raw text), concatenated, optionally shuffled, then tokenized with:
|
| 102 |
- `padding=False`
|
| 103 |
- `truncation=True`
|
| 104 |
- `max_length=MAX_SEQUENCE_LENGTH`
|
| 105 |
-
- `add_special_tokens=False`
|
| 106 |
|
| 107 |
-
> The exact dataset names
|
| 108 |
|
| 109 |
---
|
| 110 |
|
| 111 |
-
## FP8 compatibility handling (
|
| 112 |
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
-
|
|
|
|
|
|
|
| 116 |
|
| 117 |
---
|
| 118 |
|
| 119 |
## Quickstart (vLLM)
|
| 120 |
|
| 121 |
### AWQ-INT4 branch
|
| 122 |
-
Use vLLM with compressed-tensors enabled. (Adjust TP/expert-parallel settings to your cluster.)
|
| 123 |
|
| 124 |
```bash
|
| 125 |
pip install -U vllm
|
|
|
|
| 126 |
vllm serve TheHouseOfTheDude/MiniMax-M2.5:AWQ-INT4 \
|
| 127 |
--quantization compressed-tensors \
|
| 128 |
--tensor-parallel-size 8 \
|
| 129 |
--enable-expert-parallel \
|
| 130 |
--dtype bfloat16
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
# MiniMax-M2.5 — Quantized (compressed-tensors for vLLM)
|
| 22 |
|
| 23 |
+
This repository contains **quantized inference builds** of **MiniMaxAI/MiniMax-M2.5** exported in the **compressed-tensors** layout for **vLLM**.
|
| 24 |
|
| 25 |
+
MiniMax-M2.5 is a large **Mixture-of-Experts (MoE)** model. The attached quant scripts calibrate **all experts** (not just router top-k) to produce more robust scales across the full mixture.
|
| 26 |
|
| 27 |
---
|
| 28 |
|
|
|
|
| 30 |
|
| 31 |
This repo publishes **two quant variants**:
|
| 32 |
|
| 33 |
+
- **AWQ-INT4** — weight-only AWQ (**INT4 weights**, FP16/BF16 activations at runtime)
|
| 34 |
+
- **NVFP4** — NVFP4 quant (**FP4 weights + FP4 activations**), intended for runtimes that support NVFP4 kernels
|
| 35 |
|
| 36 |
+
> The `main` branch is typically a landing page. The runnable artifacts live under the **AWQ-INT4** and **NVFP4** branches.
|
| 37 |
|
| 38 |
---
|
| 39 |
|
|
|
|
| 45 |
- `config.json` with compressed-tensors quant metadata
|
| 46 |
- Tokenizer artifacts (and chat template assets if present)
|
| 47 |
|
| 48 |
+
Exports are written with `save_compressed=True` so vLLM can load them as **compressed-tensors**.
|
| 49 |
|
| 50 |
---
|
| 51 |
|
| 52 |
+
## Critical MoE detail: all experts are activated during calibration
|
| 53 |
|
| 54 |
+
Calibration is **MoE-aware**:
|
| 55 |
|
| 56 |
+
1. Each MoE block is wrapped/replaced during calibration so **ALL experts execute** for calibration forward passes.
|
| 57 |
+
2. The oneshot quant call is configured to **calibrate all experts** end-to-end.
|
| 58 |
+
|
| 59 |
+
**Why it matters:** If only top-k experts are exercised, rare experts can receive poor scales and quantize badly—leading to instability when those experts trigger at inference time.
|
| 60 |
|
| 61 |
---
|
| 62 |
|
| 63 |
+
## Quantization scope: what is and is not quantized
|
| 64 |
|
| 65 |
### Shared rule (both variants)
|
| 66 |
+
|
| 67 |
+
The scripts are designed to quantize **only the MoE expert MLP weights**, e.g.:
|
| 68 |
+
|
| 69 |
- `block_sparse_moe.experts.*.w1`
|
| 70 |
- `block_sparse_moe.experts.*.w2`
|
| 71 |
- `block_sparse_moe.experts.*.w3`
|
| 72 |
|
| 73 |
+
Everything else is excluded for stability (embeddings, attention, router/gate, norms, rotary, `lm_head`, etc.).
|
| 74 |
|
| 75 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
+
## AWQ-INT4 (W4A16) details
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
+
- **Weights:** INT4 (`num_bits=4`, symmetric)
|
| 80 |
+
- **Activations:** A16 runtime (FP16/BF16)
|
| 81 |
+
- **Grouping:** group-wise AWQ; group size is configured by the script/CLI
|
| 82 |
+
- **Targets:** linear layers (restricted to expert MLP linears per scope)
|
| 83 |
+
- **Ignored:** attention/embeddings/router/norms/`lm_head` (kept higher precision)
|
| 84 |
+
- **Smoothing:** script sets up scaling maps around post-attn norms and expert MLP weights to improve stability
|
| 85 |
|
| 86 |
+
---
|
| 87 |
+
|
| 88 |
+
## NVFP4 details
|
| 89 |
+
|
| 90 |
+
- **Weights:** FP4
|
| 91 |
+
- **Activations:** FP4
|
| 92 |
+
- **Targets:** linear layers (restricted to expert MLP linears per scope)
|
| 93 |
+
- **Ignored:** attention/embeddings/router/norms/`lm_head`
|
| 94 |
+
- **Runtime:** requires NVFP4-capable kernels (often newer GPU + software stack)
|
| 95 |
|
| 96 |
---
|
| 97 |
|
| 98 |
+
## Calibration data, sample count, and sequence length
|
| 99 |
+
|
| 100 |
+
Both scripts use a **dataset recipe YAML/config** that controls:
|
| 101 |
+
|
| 102 |
+
- `max_seq_length`
|
| 103 |
+
- shuffle + seed
|
| 104 |
+
- optional `num_samples`
|
| 105 |
+
- dataset sources with formatter/column mapping and per-source sample counts
|
| 106 |
|
| 107 |
+
**Tokenization behavior**
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
|
|
|
|
| 109 |
- `padding=False`
|
| 110 |
- `truncation=True`
|
| 111 |
- `max_length=MAX_SEQUENCE_LENGTH`
|
| 112 |
+
- `add_special_tokens=False`
|
| 113 |
|
| 114 |
+
> The exact dataset names/counts live in your recipe file; this README documents the pipeline and knobs.
|
| 115 |
|
| 116 |
---
|
| 117 |
|
| 118 |
+
## FP8 compatibility handling (base stored as FP8)
|
| 119 |
|
| 120 |
+
If the base ships FP8 parameters, the scripts:
|
| 121 |
+
|
| 122 |
+
- load in BF16,
|
| 123 |
+
- convert FP8 parameters to BF16 for quantization compatibility,
|
| 124 |
+
- sanitize quantization-related config fields to avoid serialization/tracing issues.
|
| 125 |
|
| 126 |
---
|
| 127 |
|
| 128 |
## Quickstart (vLLM)
|
| 129 |
|
| 130 |
### AWQ-INT4 branch
|
|
|
|
| 131 |
|
| 132 |
```bash
|
| 133 |
pip install -U vllm
|
| 134 |
+
|
| 135 |
vllm serve TheHouseOfTheDude/MiniMax-M2.5:AWQ-INT4 \
|
| 136 |
--quantization compressed-tensors \
|
| 137 |
--tensor-parallel-size 8 \
|
| 138 |
--enable-expert-parallel \
|
| 139 |
--dtype bfloat16
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
### NVFP4 branch
|
| 143 |
+
|
| 144 |
+
```bash
|
| 145 |
+
pip install -U vllm
|
| 146 |
+
|
| 147 |
+
vllm serve TheHouseOfTheDude/MiniMax-M2.5:NVFP4 \
|
| 148 |
+
--quantization compressed-tensors \
|
| 149 |
+
--tensor-parallel-size 8 \
|
| 150 |
+
--enable-expert-parallel
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
+
**Notes**
|
| 154 |
+
|
| 155 |
+
- MiniMax-M2.5 is extremely large; multi-GPU + expert parallel is strongly recommended.
|
| 156 |
+
- Long context is KV-cache heavy; tune `--max-model-len`, batch size, and GPU memory utilization accordingly.
|
| 157 |
+
- Serving from a local path works too—point `vllm serve` at the variant directory (e.g., `.../AWQ-INT4` or `.../NVFP4`).
|
| 158 |
+
|
| 159 |
+
---
|
| 160 |
+
|
| 161 |
+
## Intended use
|
| 162 |
+
|
| 163 |
+
- High-throughput instruction/chat inference where MoE efficiency matters
|
| 164 |
+
- Large-scale serving stacks that benefit from reduced weight bandwidth and memory
|
| 165 |
+
- Long-context workloads (subject to your hardware limits)
|
| 166 |
+
|
| 167 |
+
Quantization changes **weight representation only**. It does not modify tokenizer, chat template, or safety behavior. Apply your own safety policies/filters as appropriate.
|
| 168 |
+
|
| 169 |
+
---
|
| 170 |
+
|
| 171 |
+
## Lineage
|
| 172 |
+
|
| 173 |
+
- **Base model:** https://huggingface.co/MiniMaxAI/MiniMax-M2.5
|
| 174 |
+
- **This repo:** quantized inference variants exported to **compressed-tensors** for vLLM:
|
| 175 |
+
- **AWQ-INT4**
|
| 176 |
+
- **NVFP4**
|
| 177 |
+
|
| 178 |
+
---
|
| 179 |
+
|
| 180 |
+
## Changelog
|
| 181 |
+
|
| 182 |
+
- **v1 (current)** — Initial release with two quant variants:
|
| 183 |
+
- **AWQ-INT4** (expert-only W4A16 AWQ; all-experts calibration; group size configurable in script)
|
| 184 |
+
- **NVFP4** (FP4 weights + FP4 activations; expert-only scope; all-experts calibration; requires NVFP4-capable runtime)
|