File size: 5,570 Bytes
8edd70b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37419ca
8edd70b
37419ca
8edd70b
 
 
 
 
 
 
37419ca
 
8edd70b
37419ca
8edd70b
 
 
 
 
 
 
 
 
 
 
37419ca
8edd70b
 
 
37419ca
8edd70b
37419ca
8edd70b
37419ca
 
 
 
8edd70b
 
 
37419ca
8edd70b
 
37419ca
 
 
8edd70b
 
 
 
37419ca
8edd70b
37419ca
8edd70b
37419ca
8edd70b
37419ca
 
 
 
 
 
8edd70b
37419ca
 
 
 
 
 
 
 
 
8edd70b
 
 
37419ca
 
 
 
 
 
 
 
8edd70b
37419ca
8edd70b
 
 
 
37419ca
8edd70b
37419ca
8edd70b
 
 
37419ca
8edd70b
37419ca
 
 
 
 
8edd70b
 
 
 
 
 
 
 
 
37419ca
8edd70b
 
 
 
 
37419ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
language:
- en
library_name: vllm
pipeline_tag: text-generation
tags:
  - text-generation
  - conversational
  - moe
  - quantized
  - compressed-tensors
  - awq
  - w4a16
  - nvfp4
base_model: MiniMaxAI/MiniMax-M2.5
base_model_relation: quantized
quantized_by: TheHouseOfTheDude
license: other
---

# MiniMax-M2.5 — Quantized (compressed-tensors for vLLM)

This repository contains **quantized inference builds** of **MiniMaxAI/MiniMax-M2.5** exported in the **compressed-tensors** layout for **vLLM**.

MiniMax-M2.5 is a large **Mixture-of-Experts (MoE)** model. The attached quant scripts calibrate **all experts** (not just router top-k) to produce more robust scales across the full mixture.

---

## Variants / Branches

This repo publishes **two quant variants**:

- **AWQ-INT4** — weight-only AWQ (**INT4 weights**, FP16/BF16 activations at runtime)
- **NVFP4** — NVFP4 quant (**FP4 weights + FP4 activations**), intended for runtimes that support NVFP4 kernels

> The `main` branch is typically a landing page. The runnable artifacts live under the **AWQ-INT4** and **NVFP4** branches.

---

## What’s inside (per variant)

Each variant branch includes:

- Sharded quantized weights (`*.safetensors`) + `model.safetensors.index.json`
- `config.json` with compressed-tensors quant metadata
- Tokenizer artifacts (and chat template assets if present)

Exports are written with `save_compressed=True` so vLLM can load them as **compressed-tensors**.

---

## Critical MoE detail: all experts are activated during calibration

Calibration is **MoE-aware**:

1. Each MoE block is wrapped/replaced during calibration so **ALL experts execute** for calibration forward passes.
2. The oneshot quant call is configured to **calibrate all experts** end-to-end.

**Why it matters:** If only top-k experts are exercised, rare experts can receive poor scales and quantize badly—leading to instability when those experts trigger at inference time.

---

## Quantization scope: what is and is not quantized

### Shared rule (both variants)

The scripts are designed to quantize **only the MoE expert MLP weights**, e.g.:

- `block_sparse_moe.experts.*.w1`
- `block_sparse_moe.experts.*.w2`
- `block_sparse_moe.experts.*.w3`

Everything else is excluded for stability (embeddings, attention, router/gate, norms, rotary, `lm_head`, etc.).

---

## AWQ-INT4 (W4A16) details

- **Weights:** INT4 (`num_bits=4`, symmetric)
- **Activations:** A16 runtime (FP16/BF16)
- **Grouping:** group-wise AWQ; group size is configured by the script/CLI
- **Targets:** linear layers (restricted to expert MLP linears per scope)
- **Ignored:** attention/embeddings/router/norms/`lm_head` (kept higher precision)
- **Smoothing:** script sets up scaling maps around post-attn norms and expert MLP weights to improve stability

---

## NVFP4 details

- **Weights:** FP4
- **Activations:** FP4
- **Targets:** linear layers (restricted to expert MLP linears per scope)
- **Ignored:** attention/embeddings/router/norms/`lm_head`
- **Runtime:** requires NVFP4-capable kernels (often newer GPU + software stack)

---

## Calibration data, sample count, and sequence length

Both scripts use a **dataset recipe YAML/config** that controls:

- `max_seq_length`
- shuffle + seed
- optional `num_samples`
- dataset sources with formatter/column mapping and per-source sample counts

**Tokenization behavior**

- `padding=False`
- `truncation=True`
- `max_length=MAX_SEQUENCE_LENGTH`
- `add_special_tokens=False`

> The exact dataset names/counts live in your recipe file; this README documents the pipeline and knobs.

---

## FP8 compatibility handling (base stored as FP8)

If the base ships FP8 parameters, the scripts:

- load in BF16,
- convert FP8 parameters to BF16 for quantization compatibility,
- sanitize quantization-related config fields to avoid serialization/tracing issues.

---

## Quickstart (vLLM)

### AWQ-INT4 branch

```bash
pip install -U vllm

vllm serve TheHouseOfTheDude/MiniMax-M2.5:AWQ-INT4 \
  --quantization compressed-tensors \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --dtype bfloat16
```

### NVFP4 branch

```bash
pip install -U vllm

vllm serve TheHouseOfTheDude/MiniMax-M2.5:NVFP4 \
  --quantization compressed-tensors \
  --tensor-parallel-size 8 \
  --enable-expert-parallel
```

**Notes**

- MiniMax-M2.5 is extremely large; multi-GPU + expert parallel is strongly recommended.
- Long context is KV-cache heavy; tune `--max-model-len`, batch size, and GPU memory utilization accordingly.
- Serving from a local path works too—point `vllm serve` at the variant directory (e.g., `.../AWQ-INT4` or `.../NVFP4`).

---

## Intended use

- High-throughput instruction/chat inference where MoE efficiency matters
- Large-scale serving stacks that benefit from reduced weight bandwidth and memory
- Long-context workloads (subject to your hardware limits)

Quantization changes **weight representation only**. It does not modify tokenizer, chat template, or safety behavior. Apply your own safety policies/filters as appropriate.

---

## Lineage

- **Base model:** https://huggingface.co/MiniMaxAI/MiniMax-M2.5
- **This repo:** quantized inference variants exported to **compressed-tensors** for vLLM:
  - **AWQ-INT4**
  - **NVFP4**

---

## Changelog

- **v1 (current)** — Initial release with two quant variants:
  - **AWQ-INT4** (expert-only W4A16 AWQ; all-experts calibration; group size configurable in script)
  - **NVFP4** (FP4 weights + FP4 activations; expert-only scope; all-experts calibration; requires NVFP4-capable runtime)