phaedawg commited on
Commit
37419ca
·
verified ·
1 Parent(s): 8edd70b

Fixing Model Card

Browse files
Files changed (1) hide show
  1. README.md +97 -43
README.md CHANGED
@@ -20,9 +20,9 @@ license: other
20
 
21
  # MiniMax-M2.5 — Quantized (compressed-tensors for vLLM)
22
 
23
- This repository contains **quantized inference builds** of **[MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5)** exported in the **compressed-tensors** layout for **vLLM**.
24
 
25
- MiniMax-M2.5 is a large **MoE** model (script notes: **229B params**, **256 experts** with **8 activated per token**, **62 layers**, using `block_sparse_moe` expert MLPs with `w1/w2/w3` in a SwiGLU-style structure). :contentReference[oaicite:0]{index=0}
26
 
27
  ---
28
 
@@ -30,10 +30,10 @@ MiniMax-M2.5 is a large **MoE** model (script notes: **229B params**, **256 expe
30
 
31
  This repo publishes **two quant variants**:
32
 
33
- - **AWQ-INT4** — weight-only AWQ (INT4 weights, FP16/BF16 activations)
34
- - **NVFP4** — NVFP4 (FP4 weights + FP4 activations), optimized for Blackwell-class GPUs :contentReference[oaicite:1]{index=1}
35
 
36
- > The `main` branch is typically used as a landing page; the runnable artifacts are under the variant branches above.
37
 
38
  ---
39
 
@@ -45,86 +45,140 @@ Each variant branch includes:
45
  - `config.json` with compressed-tensors quant metadata
46
  - Tokenizer artifacts (and chat template assets if present)
47
 
48
- Export is done with `save_compressed=True` for vLLM compatibility. :contentReference[oaicite:2]{index=2} :contentReference[oaicite:3]{index=3}
49
 
50
  ---
51
 
52
- ## Critical MoE detail: **All experts activated during calibration**
53
 
54
- MoE calibration is **not** performed with router top-k only. Instead, the scripts replace `MiniMaxM2SparseMoeBlock` with a calibration wrapper that runs **ALL experts** for **every sample**, ensuring reliable scale/activation statistics for every expert. :contentReference[oaicite:4]{index=4} :contentReference[oaicite:5]{index=5} :contentReference[oaicite:6]{index=6}
55
 
56
- The scripts also pass `moe_calibrate_all_experts=True` into the `oneshot(...)` call to enforce this behavior end-to-end. :contentReference[oaicite:7]{index=7} :contentReference[oaicite:8]{index=8}
 
 
 
57
 
58
  ---
59
 
60
- ## Quantization scope: what *is* and *is not* quantized
61
 
62
  ### Shared rule (both variants)
63
- Only the **MoE expert MLP weights** are intended to be quantized:
 
 
64
  - `block_sparse_moe.experts.*.w1`
65
  - `block_sparse_moe.experts.*.w2`
66
  - `block_sparse_moe.experts.*.w3`
67
 
68
- Everything else is excluded for stability (attention, routing/gate, norms, embeddings, lm_head, etc.). :contentReference[oaicite:9]{index=9} :contentReference[oaicite:10]{index=10}
69
 
70
- ### AWQ-INT4 (W4A16)
71
- AWQ is configured as:
72
- - **INT4 weights** (`num_bits=4`, `symmetric=True`)
73
- - **Group-wise quantization** (`strategy="group"`) with the **group size provided by CLI argument**
74
- - Targets: `["Linear"]`
75
- - Activations are not quantized (A16 runtime) :contentReference[oaicite:11]{index=11}
76
 
77
- The AWQ ignore list explicitly excludes:
78
- - `lm_head`, embeddings
79
- - MoE router (`gate`, `e_score_correction_bias`)
80
- - attention stack (`self_attn.*`)
81
- - norms / rotary / MTP (if present) :contentReference[oaicite:12]{index=12}
82
 
83
- AWQ smoothing/balancing mappings are set up around `post_attention_layernorm` and the expert MLP layers (`w1/w2/w3`) with `duo_scaling=True`. :contentReference[oaicite:13]{index=13}
 
 
 
 
 
84
 
85
- ### NVFP4
86
- NVFP4 is configured as:
87
- - `QuantizationModifier(targets="Linear", scheme="NVFP4")`
88
- - Ignore list excludes the same non-expert components (router, attention, norms, lm_head, etc.)
89
- - NVFP4 is explicitly described in-script as **FP4 weights + FP4 activations**, “per-group-16 (fixed), optimized for Blackwell.” :contentReference[oaicite:14]{index=14} :contentReference[oaicite:15]{index=15}
 
 
 
 
90
 
91
  ---
92
 
93
- ## Calibration data, sampling, and sequence length
 
 
 
 
 
 
 
94
 
95
- Both scripts load a **dataset recipe YAML** that specifies:
96
- - `max_seq_length` (required)
97
- - `shuffle` and `seed`
98
- - optional `num_samples` cap
99
- - a list of datasets (with formatter + column mapping + per-dataset sample counts) :contentReference[oaicite:16]{index=16} :contentReference[oaicite:17]{index=17}
100
 
101
- Datasets are loaded according to the YAML config, formatted into text using formatter functions (ShareGPT / prompt-answer / chat-completion / raw text), concatenated, optionally shuffled, then tokenized with:
102
  - `padding=False`
103
  - `truncation=True`
104
  - `max_length=MAX_SEQUENCE_LENGTH`
105
- - `add_special_tokens=False` :contentReference[oaicite:18]{index=18} :contentReference[oaicite:19]{index=19}
106
 
107
- > The exact dataset names and per-source sample counts come from your YAML recipe file. This model card intentionally describes the pipeline (and the knobs) rather than hardcoding recipe contents.
108
 
109
  ---
110
 
111
- ## FP8 compatibility handling (source model stored as FP8)
112
 
113
- The scripts load the model in **BF16** and include safeguards to:
114
- - convert any FP8 parameters (e.g., `float8_e4m3fn`) to BF16 for quantization compatibility
115
- - sanitize `quantization_config` to avoid FX-tracing serialization issues :contentReference[oaicite:20]{index=20} :contentReference[oaicite:21]{index=21}
 
 
116
 
117
  ---
118
 
119
  ## Quickstart (vLLM)
120
 
121
  ### AWQ-INT4 branch
122
- Use vLLM with compressed-tensors enabled. (Adjust TP/expert-parallel settings to your cluster.)
123
 
124
  ```bash
125
  pip install -U vllm
 
126
  vllm serve TheHouseOfTheDude/MiniMax-M2.5:AWQ-INT4 \
127
  --quantization compressed-tensors \
128
  --tensor-parallel-size 8 \
129
  --enable-expert-parallel \
130
  --dtype bfloat16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  # MiniMax-M2.5 — Quantized (compressed-tensors for vLLM)
22
 
23
+ This repository contains **quantized inference builds** of **MiniMaxAI/MiniMax-M2.5** exported in the **compressed-tensors** layout for **vLLM**.
24
 
25
+ MiniMax-M2.5 is a large **Mixture-of-Experts (MoE)** model. The attached quant scripts calibrate **all experts** (not just router top-k) to produce more robust scales across the full mixture.
26
 
27
  ---
28
 
 
30
 
31
  This repo publishes **two quant variants**:
32
 
33
+ - **AWQ-INT4** — weight-only AWQ (**INT4 weights**, FP16/BF16 activations at runtime)
34
+ - **NVFP4** — NVFP4 quant (**FP4 weights + FP4 activations**), intended for runtimes that support NVFP4 kernels
35
 
36
+ > The `main` branch is typically a landing page. The runnable artifacts live under the **AWQ-INT4** and **NVFP4** branches.
37
 
38
  ---
39
 
 
45
  - `config.json` with compressed-tensors quant metadata
46
  - Tokenizer artifacts (and chat template assets if present)
47
 
48
+ Exports are written with `save_compressed=True` so vLLM can load them as **compressed-tensors**.
49
 
50
  ---
51
 
52
+ ## Critical MoE detail: all experts are activated during calibration
53
 
54
+ Calibration is **MoE-aware**:
55
 
56
+ 1. Each MoE block is wrapped/replaced during calibration so **ALL experts execute** for calibration forward passes.
57
+ 2. The oneshot quant call is configured to **calibrate all experts** end-to-end.
58
+
59
+ **Why it matters:** If only top-k experts are exercised, rare experts can receive poor scales and quantize badly—leading to instability when those experts trigger at inference time.
60
 
61
  ---
62
 
63
+ ## Quantization scope: what is and is not quantized
64
 
65
  ### Shared rule (both variants)
66
+
67
+ The scripts are designed to quantize **only the MoE expert MLP weights**, e.g.:
68
+
69
  - `block_sparse_moe.experts.*.w1`
70
  - `block_sparse_moe.experts.*.w2`
71
  - `block_sparse_moe.experts.*.w3`
72
 
73
+ Everything else is excluded for stability (embeddings, attention, router/gate, norms, rotary, `lm_head`, etc.).
74
 
75
+ ---
 
 
 
 
 
76
 
77
+ ## AWQ-INT4 (W4A16) details
 
 
 
 
78
 
79
+ - **Weights:** INT4 (`num_bits=4`, symmetric)
80
+ - **Activations:** A16 runtime (FP16/BF16)
81
+ - **Grouping:** group-wise AWQ; group size is configured by the script/CLI
82
+ - **Targets:** linear layers (restricted to expert MLP linears per scope)
83
+ - **Ignored:** attention/embeddings/router/norms/`lm_head` (kept higher precision)
84
+ - **Smoothing:** script sets up scaling maps around post-attn norms and expert MLP weights to improve stability
85
 
86
+ ---
87
+
88
+ ## NVFP4 details
89
+
90
+ - **Weights:** FP4
91
+ - **Activations:** FP4
92
+ - **Targets:** linear layers (restricted to expert MLP linears per scope)
93
+ - **Ignored:** attention/embeddings/router/norms/`lm_head`
94
+ - **Runtime:** requires NVFP4-capable kernels (often newer GPU + software stack)
95
 
96
  ---
97
 
98
+ ## Calibration data, sample count, and sequence length
99
+
100
+ Both scripts use a **dataset recipe YAML/config** that controls:
101
+
102
+ - `max_seq_length`
103
+ - shuffle + seed
104
+ - optional `num_samples`
105
+ - dataset sources with formatter/column mapping and per-source sample counts
106
 
107
+ **Tokenization behavior**
 
 
 
 
108
 
 
109
  - `padding=False`
110
  - `truncation=True`
111
  - `max_length=MAX_SEQUENCE_LENGTH`
112
+ - `add_special_tokens=False`
113
 
114
+ > The exact dataset names/counts live in your recipe file; this README documents the pipeline and knobs.
115
 
116
  ---
117
 
118
+ ## FP8 compatibility handling (base stored as FP8)
119
 
120
+ If the base ships FP8 parameters, the scripts:
121
+
122
+ - load in BF16,
123
+ - convert FP8 parameters to BF16 for quantization compatibility,
124
+ - sanitize quantization-related config fields to avoid serialization/tracing issues.
125
 
126
  ---
127
 
128
  ## Quickstart (vLLM)
129
 
130
  ### AWQ-INT4 branch
 
131
 
132
  ```bash
133
  pip install -U vllm
134
+
135
  vllm serve TheHouseOfTheDude/MiniMax-M2.5:AWQ-INT4 \
136
  --quantization compressed-tensors \
137
  --tensor-parallel-size 8 \
138
  --enable-expert-parallel \
139
  --dtype bfloat16
140
+ ```
141
+
142
+ ### NVFP4 branch
143
+
144
+ ```bash
145
+ pip install -U vllm
146
+
147
+ vllm serve TheHouseOfTheDude/MiniMax-M2.5:NVFP4 \
148
+ --quantization compressed-tensors \
149
+ --tensor-parallel-size 8 \
150
+ --enable-expert-parallel
151
+ ```
152
+
153
+ **Notes**
154
+
155
+ - MiniMax-M2.5 is extremely large; multi-GPU + expert parallel is strongly recommended.
156
+ - Long context is KV-cache heavy; tune `--max-model-len`, batch size, and GPU memory utilization accordingly.
157
+ - Serving from a local path works too—point `vllm serve` at the variant directory (e.g., `.../AWQ-INT4` or `.../NVFP4`).
158
+
159
+ ---
160
+
161
+ ## Intended use
162
+
163
+ - High-throughput instruction/chat inference where MoE efficiency matters
164
+ - Large-scale serving stacks that benefit from reduced weight bandwidth and memory
165
+ - Long-context workloads (subject to your hardware limits)
166
+
167
+ Quantization changes **weight representation only**. It does not modify tokenizer, chat template, or safety behavior. Apply your own safety policies/filters as appropriate.
168
+
169
+ ---
170
+
171
+ ## Lineage
172
+
173
+ - **Base model:** https://huggingface.co/MiniMaxAI/MiniMax-M2.5
174
+ - **This repo:** quantized inference variants exported to **compressed-tensors** for vLLM:
175
+ - **AWQ-INT4**
176
+ - **NVFP4**
177
+
178
+ ---
179
+
180
+ ## Changelog
181
+
182
+ - **v1 (current)** — Initial release with two quant variants:
183
+ - **AWQ-INT4** (expert-only W4A16 AWQ; all-experts calibration; group size configurable in script)
184
+ - **NVFP4** (FP4 weights + FP4 activations; expert-only scope; all-experts calibration; requires NVFP4-capable runtime)