phaedawg commited on
Commit
8edd70b
·
verified ·
1 Parent(s): f243408

First README.md

Browse files
Files changed (1) hide show
  1. README.md +130 -0
README.md ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: vllm
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - text-generation
8
+ - conversational
9
+ - moe
10
+ - quantized
11
+ - compressed-tensors
12
+ - awq
13
+ - w4a16
14
+ - nvfp4
15
+ base_model: MiniMaxAI/MiniMax-M2.5
16
+ base_model_relation: quantized
17
+ quantized_by: TheHouseOfTheDude
18
+ license: other
19
+ ---
20
+
21
+ # MiniMax-M2.5 — Quantized (compressed-tensors for vLLM)
22
+
23
+ This repository contains **quantized inference builds** of **[MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5)** exported in the **compressed-tensors** layout for **vLLM**.
24
+
25
+ MiniMax-M2.5 is a large **MoE** model (script notes: **229B params**, **256 experts** with **8 activated per token**, **62 layers**, using `block_sparse_moe` expert MLPs with `w1/w2/w3` in a SwiGLU-style structure). :contentReference[oaicite:0]{index=0}
26
+
27
+ ---
28
+
29
+ ## Variants / Branches
30
+
31
+ This repo publishes **two quant variants**:
32
+
33
+ - **AWQ-INT4** — weight-only AWQ (INT4 weights, FP16/BF16 activations)
34
+ - **NVFP4** — NVFP4 (FP4 weights + FP4 activations), optimized for Blackwell-class GPUs :contentReference[oaicite:1]{index=1}
35
+
36
+ > The `main` branch is typically used as a landing page; the runnable artifacts are under the variant branches above.
37
+
38
+ ---
39
+
40
+ ## What’s inside (per variant)
41
+
42
+ Each variant branch includes:
43
+
44
+ - Sharded quantized weights (`*.safetensors`) + `model.safetensors.index.json`
45
+ - `config.json` with compressed-tensors quant metadata
46
+ - Tokenizer artifacts (and chat template assets if present)
47
+
48
+ Export is done with `save_compressed=True` for vLLM compatibility. :contentReference[oaicite:2]{index=2} :contentReference[oaicite:3]{index=3}
49
+
50
+ ---
51
+
52
+ ## Critical MoE detail: **All experts activated during calibration**
53
+
54
+ MoE calibration is **not** performed with router top-k only. Instead, the scripts replace `MiniMaxM2SparseMoeBlock` with a calibration wrapper that runs **ALL experts** for **every sample**, ensuring reliable scale/activation statistics for every expert. :contentReference[oaicite:4]{index=4} :contentReference[oaicite:5]{index=5} :contentReference[oaicite:6]{index=6}
55
+
56
+ The scripts also pass `moe_calibrate_all_experts=True` into the `oneshot(...)` call to enforce this behavior end-to-end. :contentReference[oaicite:7]{index=7} :contentReference[oaicite:8]{index=8}
57
+
58
+ ---
59
+
60
+ ## Quantization scope: what *is* and *is not* quantized
61
+
62
+ ### Shared rule (both variants)
63
+ Only the **MoE expert MLP weights** are intended to be quantized:
64
+ - `block_sparse_moe.experts.*.w1`
65
+ - `block_sparse_moe.experts.*.w2`
66
+ - `block_sparse_moe.experts.*.w3`
67
+
68
+ Everything else is excluded for stability (attention, routing/gate, norms, embeddings, lm_head, etc.). :contentReference[oaicite:9]{index=9} :contentReference[oaicite:10]{index=10}
69
+
70
+ ### AWQ-INT4 (W4A16)
71
+ AWQ is configured as:
72
+ - **INT4 weights** (`num_bits=4`, `symmetric=True`)
73
+ - **Group-wise quantization** (`strategy="group"`) with the **group size provided by CLI argument**
74
+ - Targets: `["Linear"]`
75
+ - Activations are not quantized (A16 runtime) :contentReference[oaicite:11]{index=11}
76
+
77
+ The AWQ ignore list explicitly excludes:
78
+ - `lm_head`, embeddings
79
+ - MoE router (`gate`, `e_score_correction_bias`)
80
+ - attention stack (`self_attn.*`)
81
+ - norms / rotary / MTP (if present) :contentReference[oaicite:12]{index=12}
82
+
83
+ AWQ smoothing/balancing mappings are set up around `post_attention_layernorm` and the expert MLP layers (`w1/w2/w3`) with `duo_scaling=True`. :contentReference[oaicite:13]{index=13}
84
+
85
+ ### NVFP4
86
+ NVFP4 is configured as:
87
+ - `QuantizationModifier(targets="Linear", scheme="NVFP4")`
88
+ - Ignore list excludes the same non-expert components (router, attention, norms, lm_head, etc.)
89
+ - NVFP4 is explicitly described in-script as **FP4 weights + FP4 activations**, “per-group-16 (fixed), optimized for Blackwell.” :contentReference[oaicite:14]{index=14} :contentReference[oaicite:15]{index=15}
90
+
91
+ ---
92
+
93
+ ## Calibration data, sampling, and sequence length
94
+
95
+ Both scripts load a **dataset recipe YAML** that specifies:
96
+ - `max_seq_length` (required)
97
+ - `shuffle` and `seed`
98
+ - optional `num_samples` cap
99
+ - a list of datasets (with formatter + column mapping + per-dataset sample counts) :contentReference[oaicite:16]{index=16} :contentReference[oaicite:17]{index=17}
100
+
101
+ Datasets are loaded according to the YAML config, formatted into text using formatter functions (ShareGPT / prompt-answer / chat-completion / raw text), concatenated, optionally shuffled, then tokenized with:
102
+ - `padding=False`
103
+ - `truncation=True`
104
+ - `max_length=MAX_SEQUENCE_LENGTH`
105
+ - `add_special_tokens=False` :contentReference[oaicite:18]{index=18} :contentReference[oaicite:19]{index=19}
106
+
107
+ > The exact dataset names and per-source sample counts come from your YAML recipe file. This model card intentionally describes the pipeline (and the knobs) rather than hardcoding recipe contents.
108
+
109
+ ---
110
+
111
+ ## FP8 compatibility handling (source model stored as FP8)
112
+
113
+ The scripts load the model in **BF16** and include safeguards to:
114
+ - convert any FP8 parameters (e.g., `float8_e4m3fn`) to BF16 for quantization compatibility
115
+ - sanitize `quantization_config` to avoid FX-tracing serialization issues :contentReference[oaicite:20]{index=20} :contentReference[oaicite:21]{index=21}
116
+
117
+ ---
118
+
119
+ ## Quickstart (vLLM)
120
+
121
+ ### AWQ-INT4 branch
122
+ Use vLLM with compressed-tensors enabled. (Adjust TP/expert-parallel settings to your cluster.)
123
+
124
+ ```bash
125
+ pip install -U vllm
126
+ vllm serve TheHouseOfTheDude/MiniMax-M2.5:AWQ-INT4 \
127
+ --quantization compressed-tensors \
128
+ --tensor-parallel-size 8 \
129
+ --enable-expert-parallel \
130
+ --dtype bfloat16