First README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
library_name: vllm
|
| 5 |
+
pipeline_tag: text-generation
|
| 6 |
+
tags:
|
| 7 |
+
- text-generation
|
| 8 |
+
- conversational
|
| 9 |
+
- moe
|
| 10 |
+
- quantized
|
| 11 |
+
- compressed-tensors
|
| 12 |
+
- awq
|
| 13 |
+
- w4a16
|
| 14 |
+
- nvfp4
|
| 15 |
+
base_model: MiniMaxAI/MiniMax-M2.5
|
| 16 |
+
base_model_relation: quantized
|
| 17 |
+
quantized_by: TheHouseOfTheDude
|
| 18 |
+
license: other
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# MiniMax-M2.5 — Quantized (compressed-tensors for vLLM)
|
| 22 |
+
|
| 23 |
+
This repository contains **quantized inference builds** of **[MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5)** exported in the **compressed-tensors** layout for **vLLM**.
|
| 24 |
+
|
| 25 |
+
MiniMax-M2.5 is a large **MoE** model (script notes: **229B params**, **256 experts** with **8 activated per token**, **62 layers**, using `block_sparse_moe` expert MLPs with `w1/w2/w3` in a SwiGLU-style structure). :contentReference[oaicite:0]{index=0}
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## Variants / Branches
|
| 30 |
+
|
| 31 |
+
This repo publishes **two quant variants**:
|
| 32 |
+
|
| 33 |
+
- **AWQ-INT4** — weight-only AWQ (INT4 weights, FP16/BF16 activations)
|
| 34 |
+
- **NVFP4** — NVFP4 (FP4 weights + FP4 activations), optimized for Blackwell-class GPUs :contentReference[oaicite:1]{index=1}
|
| 35 |
+
|
| 36 |
+
> The `main` branch is typically used as a landing page; the runnable artifacts are under the variant branches above.
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## What’s inside (per variant)
|
| 41 |
+
|
| 42 |
+
Each variant branch includes:
|
| 43 |
+
|
| 44 |
+
- Sharded quantized weights (`*.safetensors`) + `model.safetensors.index.json`
|
| 45 |
+
- `config.json` with compressed-tensors quant metadata
|
| 46 |
+
- Tokenizer artifacts (and chat template assets if present)
|
| 47 |
+
|
| 48 |
+
Export is done with `save_compressed=True` for vLLM compatibility. :contentReference[oaicite:2]{index=2} :contentReference[oaicite:3]{index=3}
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
|
| 52 |
+
## Critical MoE detail: **All experts activated during calibration**
|
| 53 |
+
|
| 54 |
+
MoE calibration is **not** performed with router top-k only. Instead, the scripts replace `MiniMaxM2SparseMoeBlock` with a calibration wrapper that runs **ALL experts** for **every sample**, ensuring reliable scale/activation statistics for every expert. :contentReference[oaicite:4]{index=4} :contentReference[oaicite:5]{index=5} :contentReference[oaicite:6]{index=6}
|
| 55 |
+
|
| 56 |
+
The scripts also pass `moe_calibrate_all_experts=True` into the `oneshot(...)` call to enforce this behavior end-to-end. :contentReference[oaicite:7]{index=7} :contentReference[oaicite:8]{index=8}
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
## Quantization scope: what *is* and *is not* quantized
|
| 61 |
+
|
| 62 |
+
### Shared rule (both variants)
|
| 63 |
+
Only the **MoE expert MLP weights** are intended to be quantized:
|
| 64 |
+
- `block_sparse_moe.experts.*.w1`
|
| 65 |
+
- `block_sparse_moe.experts.*.w2`
|
| 66 |
+
- `block_sparse_moe.experts.*.w3`
|
| 67 |
+
|
| 68 |
+
Everything else is excluded for stability (attention, routing/gate, norms, embeddings, lm_head, etc.). :contentReference[oaicite:9]{index=9} :contentReference[oaicite:10]{index=10}
|
| 69 |
+
|
| 70 |
+
### AWQ-INT4 (W4A16)
|
| 71 |
+
AWQ is configured as:
|
| 72 |
+
- **INT4 weights** (`num_bits=4`, `symmetric=True`)
|
| 73 |
+
- **Group-wise quantization** (`strategy="group"`) with the **group size provided by CLI argument**
|
| 74 |
+
- Targets: `["Linear"]`
|
| 75 |
+
- Activations are not quantized (A16 runtime) :contentReference[oaicite:11]{index=11}
|
| 76 |
+
|
| 77 |
+
The AWQ ignore list explicitly excludes:
|
| 78 |
+
- `lm_head`, embeddings
|
| 79 |
+
- MoE router (`gate`, `e_score_correction_bias`)
|
| 80 |
+
- attention stack (`self_attn.*`)
|
| 81 |
+
- norms / rotary / MTP (if present) :contentReference[oaicite:12]{index=12}
|
| 82 |
+
|
| 83 |
+
AWQ smoothing/balancing mappings are set up around `post_attention_layernorm` and the expert MLP layers (`w1/w2/w3`) with `duo_scaling=True`. :contentReference[oaicite:13]{index=13}
|
| 84 |
+
|
| 85 |
+
### NVFP4
|
| 86 |
+
NVFP4 is configured as:
|
| 87 |
+
- `QuantizationModifier(targets="Linear", scheme="NVFP4")`
|
| 88 |
+
- Ignore list excludes the same non-expert components (router, attention, norms, lm_head, etc.)
|
| 89 |
+
- NVFP4 is explicitly described in-script as **FP4 weights + FP4 activations**, “per-group-16 (fixed), optimized for Blackwell.” :contentReference[oaicite:14]{index=14} :contentReference[oaicite:15]{index=15}
|
| 90 |
+
|
| 91 |
+
---
|
| 92 |
+
|
| 93 |
+
## Calibration data, sampling, and sequence length
|
| 94 |
+
|
| 95 |
+
Both scripts load a **dataset recipe YAML** that specifies:
|
| 96 |
+
- `max_seq_length` (required)
|
| 97 |
+
- `shuffle` and `seed`
|
| 98 |
+
- optional `num_samples` cap
|
| 99 |
+
- a list of datasets (with formatter + column mapping + per-dataset sample counts) :contentReference[oaicite:16]{index=16} :contentReference[oaicite:17]{index=17}
|
| 100 |
+
|
| 101 |
+
Datasets are loaded according to the YAML config, formatted into text using formatter functions (ShareGPT / prompt-answer / chat-completion / raw text), concatenated, optionally shuffled, then tokenized with:
|
| 102 |
+
- `padding=False`
|
| 103 |
+
- `truncation=True`
|
| 104 |
+
- `max_length=MAX_SEQUENCE_LENGTH`
|
| 105 |
+
- `add_special_tokens=False` :contentReference[oaicite:18]{index=18} :contentReference[oaicite:19]{index=19}
|
| 106 |
+
|
| 107 |
+
> The exact dataset names and per-source sample counts come from your YAML recipe file. This model card intentionally describes the pipeline (and the knobs) rather than hardcoding recipe contents.
|
| 108 |
+
|
| 109 |
+
---
|
| 110 |
+
|
| 111 |
+
## FP8 compatibility handling (source model stored as FP8)
|
| 112 |
+
|
| 113 |
+
The scripts load the model in **BF16** and include safeguards to:
|
| 114 |
+
- convert any FP8 parameters (e.g., `float8_e4m3fn`) to BF16 for quantization compatibility
|
| 115 |
+
- sanitize `quantization_config` to avoid FX-tracing serialization issues :contentReference[oaicite:20]{index=20} :contentReference[oaicite:21]{index=21}
|
| 116 |
+
|
| 117 |
+
---
|
| 118 |
+
|
| 119 |
+
## Quickstart (vLLM)
|
| 120 |
+
|
| 121 |
+
### AWQ-INT4 branch
|
| 122 |
+
Use vLLM with compressed-tensors enabled. (Adjust TP/expert-parallel settings to your cluster.)
|
| 123 |
+
|
| 124 |
+
```bash
|
| 125 |
+
pip install -U vllm
|
| 126 |
+
vllm serve TheHouseOfTheDude/MiniMax-M2.5:AWQ-INT4 \
|
| 127 |
+
--quantization compressed-tensors \
|
| 128 |
+
--tensor-parallel-size 8 \
|
| 129 |
+
--enable-expert-parallel \
|
| 130 |
+
--dtype bfloat16
|