|
|
--- |
|
|
pipeline_tag: text-generation |
|
|
license: other |
|
|
license_name: modified-mit |
|
|
license_link: https://github.com/MiniMax-AI/MiniMax-M2.1/blob/main/LICENSE |
|
|
library_name: llm-compressor |
|
|
tags: |
|
|
- fp8 |
|
|
- awq |
|
|
- conversational |
|
|
- vllm |
|
|
- code |
|
|
- devops |
|
|
- software engineering |
|
|
- engineer |
|
|
- developer |
|
|
- architect |
|
|
- stem |
|
|
- agent |
|
|
datasets: |
|
|
- HuggingFaceH4/ultrachat_200k |
|
|
- databricks/databricks-dolly-15k |
|
|
- neuralmagic/calibration |
|
|
- HuggingFaceH4/no_robots |
|
|
- nvidia/HelpSteer |
|
|
- garage-bAInd/Open-Platypus |
|
|
- PJMixers/grimulkan_physical-reasoning-ShareGPT |
|
|
- PJMixers/grimulkan_theory-of-mind-ShareGPT |
|
|
- HuggingFaceH4/Multilingual-Thinking |
|
|
- ServiceNow-AI/M2Lingual |
|
|
- interstellarninja/hermes_reasoning_tool_use |
|
|
- deepmind/code_contests |
|
|
- dh02391735/stackoverflow-kubernetes-questions |
|
|
- diversoailab/humaneval-rust |
|
|
- ammarnasr/the-stack-rust-clean |
|
|
- CSJianYang/CodeArena |
|
|
- nvidia/OpenCodeInstruct |
|
|
- nvidia/Llama-Nemotron-Post-Training-Dataset |
|
|
- nvidia/Nemotron-Competitive-Programming-v1 |
|
|
- rombodawg/code_bagel_hermes-2.5 |
|
|
- MathArena/project_euler |
|
|
- nvidia/Nemotron-Math-Proofs-v1 |
|
|
- nvidia/OpenMathInstruct-2 |
|
|
- nvidia/OpenScienceReasoning-2 |
|
|
- MegaScience/MegaScience |
|
|
- OpenMed/Medical-Reasoning-SFT-GPT-OSS-120B |
|
|
- ccdv/pubmed-summarization |
|
|
- gbharti/finance-alpaca |
|
|
- vladlen32230/summarization-yahoo-stock-finance-article-text |
|
|
- fka/awesome-chatgpt-prompts |
|
|
- theoldmandthesea/17k_business_book |
|
|
- ruggsea/stanford-encyclopedia-of-philosophy_instruct |
|
|
- mlfoundations-dev/stackexchange_philosophy |
|
|
- FreedomIntelligence/SocraticChat |
|
|
- Gryphe/Opus-WritingPrompts |
|
|
- anthracite-org/nopm_claude_writing_fixed |
|
|
- zerofata/Roleplay-Anime-Characters |
|
|
- zerofata/Instruct-Anime |
|
|
- zerofata/Instruct-Anime-CreativeWriting |
|
|
- sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo |
|
|
- PocketDoc/Dans-Prosemaxx-Adventure |
|
|
- anthracite-org/stheno-filtered-v1.1 |
|
|
- KaraKaraWitch/TvTroper-2025 |
|
|
- AquaV/US-Army-Survival-Sharegpt |
|
|
- AquaV/Interrogation-Sharegpt |
|
|
- AquaV/Multi-Environment-Operations-Sharegpt |
|
|
- AquaV/Resistance-Sharegpt |
|
|
- PocketDoc/Dans-Kinomaxx-VanillaBackrooms |
|
|
base_model: |
|
|
- MiniMaxAI/MiniMax-M2.1 |
|
|
--- |
|
|
|
|
|
# MiniMax M2.1 (Mixed-Precision FP8 + INT4 AWQ FrankenQuant) |
|
|
|
|
|
This strives to be the highest quality quant that can run on 192GiB VRAM |
|
|
|
|
|
> [!TIP] |
|
|
> 💡 A non-FP8 version is available at [mratsim/MiniMax-M2.1-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ) \ |
|
|
> That version is compatible with 8x RTX 3090s and with SGLang (which doesn't support mixed quantization yet) for an extra 3GiB in VRAM. \ |
|
|
> This FP8+INT4 AWQ was build by merging the original FP8 self-attention weights and [mratsim/MiniMax-M2.1-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ) experts. |
|
|
|
|
|
It features: |
|
|
- That model has ensured that all experts are calibrated, not doing so is extremely detrimental, PR: https://github.com/vllm-project/llm-compressor/pull/2171 |
|
|
<details> |
|
|
<summary>Visual showcase of why ensuring quantization of all MoE experts is important</summary> |
|
|
|
|
|
- Source: https://avtc.github.io/aquarium-side-by-side/ |
|
|
- Context: https://github.com/ModelCloud/GPTQModel/pull/2235 |
|
|
|
|
|
 |
|
|
|
|
|
</details> |
|
|
- Mixed precision with: |
|
|
- self-attention weights copied directly from the official version (default FP8 with 2D-blocks) |
|
|
- experts weights quantized using AWQ W4A16G32 scheme (4-bit weights, 16-bit activations, scaling factor per group of 32 weights) |
|
|
- High-quality large and diverse dataset with programming and devops focus |
|
|
as well as domain-specific knowledge (math, sciences, medical, finance, business, humanities, philosophy, creative writing), general knowledge, pop culture and behavioral situations because we never code in a vacuum. And we want to make sure all experts are calibrated to the full range of their activations. |
|
|
- Calibration explicitly tests multilingual capabilities: |
|
|
- Asia: Chinese, Hindi, Korean, Japanese |
|
|
- Europe: French, German, Portuguese, Russian, Spanish |
|
|
- Middle-East: Arabic, Hebrew, Turkish |
|
|
- Calibration explicitly tests 60 programming languages and not just Python: |
|
|
- Imperative programming: C, C++, Go, Zig, ... |
|
|
- Functional programming: Haskell, F#, OCaml, Erlang, Lisp, Clojure ... |
|
|
- Web-focused: HTML/CSS, Typescript, PHP, ... |
|
|
- Mixed paradigm: D, Kotlin, Nim, Rust, Swift, ... |
|
|
- Theorem provers: Coq, Lean |
|
|
- Low-level: ARM64 assembly, x86-64 assembly, LLVM IR |
|
|
- GPU Programming: Cuda, Vulkan, Apple Metal |
|
|
- Game Programming: GDScript, GLSL |
|
|
- Domain-specific: MATLAB, Julia, Solidity, R |
|
|
- Calibration tries to ensure coverage for a wide variety of experience (from explaining concepts to your grandmother to debugging Kubernetes logs) |
|
|
- Built by a dev, for devs (and it looks very good for STEM as well) |
|
|
|
|
|
It uses my new declarative quantization framework https://github.com/mratsim/quantizers which facilitates highly-tuned calibration sets: [calibrate_software_engineer.yaml](./calibrate_software_engineer.yaml) |
|
|
|
|
|
<details> |
|
|
<summary>This has taken several days and contribution and bug reports to the ecosystem, I hope you find it useful.</summary> |
|
|
|
|
|
- https://github.com/vllm-project/llm-compressor/pull/2171 |
|
|
- https://github.com/vllm-project/llm-compressor/issues/2172 |
|
|
- https://github.com/vllm-project/vllm/issues/31623 |
|
|
- https://github.com/sgl-project/sglang/issues/16276 |
|
|
- https://github.com/sgl-project/sglang/issues/16295 |
|
|
|
|
|
</details> |
|
|
|
|
|
## 📥 Usage & Running Instructions |
|
|
|
|
|
The model was tested with vLLM + 2x RTX Pro 6000, here is a script suitable for such configuration with the maximum 196,608 context length. This uses 92.5GiB of VRAM with the flashinfer backend. |
|
|
|
|
|
> [!WARNING] |
|
|
> ⚠️ Due to rope_parameters change, at the moment this model is incompatible with transformers V5.\ |
|
|
This makes it incompatible with GLM-4.6V which requires transformers V5. Use different Docker images. |
|
|
|
|
|
> [!WARNING] |
|
|
> ⚠️ SGLang does not support this model due to missing mixed precision support. Feature request raised at https://github.com/sgl-project/sglang/issues/16276.\ |
|
|
> Please use [mratsim/MiniMax-M2.1-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ) in the meantime. |
|
|
|
|
|
### Running script |
|
|
|
|
|
`--trust-remote-code` is necessary until the transformers team merges github.com/huggingface/transformers/pull/42028 |
|
|
|
|
|
You have 2 reasoning parsers; |
|
|
- `minimax_m2`, puts the reasoning content in a special field like DeepSeek models that is usually rendered in a specific manner in frontends. |
|
|
- `minimax_m2_append_think`, puts the reasoning into `<think>reasoning_content</think>` and that is sent as normal text. Few frontends properly render that, I'm aware of [Cherry Studio](https://github.com/CherryHQ/cherry-studio) on Desktop and [ChatterUI](https://github.com/Vali-98/ChatterUI) on Android. |
|
|
|
|
|
The reason why `minimax_m2_append_think` was introduced was Interleaved Thinking and having the model build upon it's previous thinking (usually frontends discard the thinking trace) |
|
|
|
|
|
> [!TIP] |
|
|
> 💡With the recommended parameters the model tends to get stuck in repetition loops.\ |
|
|
> It seems like repetition_penalty: 1.10, frequency_penalty: 0.40 avoids that |
|
|
|
|
|
```bash |
|
|
# Model configuration (Mandatory) |
|
|
MODEL="mratsim/MiniMax-M2.1-FP8-INT4-AWQ" |
|
|
MODELNAME="MiniMax-M2.1" |
|
|
GPU_UTIL=0.93 |
|
|
SAMPLER_OVERRIDE='{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}' |
|
|
|
|
|
# Prevent memory fragmentation |
|
|
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512 |
|
|
|
|
|
# Prevent vLLM from using 100% CPU when idle (Very Recommended) |
|
|
export VLLM_SLEEP_WHEN_IDLE=1 |
|
|
|
|
|
vllm serve "${MODEL}" \ |
|
|
--served-model-name "${MODELNAME}" \ |
|
|
--trust-remote-code \ |
|
|
--gpu-memory-utilization ${GPU_UTIL} \ |
|
|
--tp 2 \ |
|
|
--override-generation-config "${SAMPLER_OVERRIDE}" \ |
|
|
--enable-auto-tool-choice \ |
|
|
--tool-call-parser minimax_m2 \ |
|
|
--reasoning-parser minimax_m2 |
|
|
# --reasoning-parser minimax_m2_append_think |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
On dual RTX Pro 6000, I can reach over 5500 prefill/prompt/context processing and over 100 tok/s token generation for a single request. |
|
|
|
|
|
 |
|
|
|
|
|
With PagedAttention in action you can reach over 25000 tok/s in prompt processing speed. |
|
|
|
|
|
 |
|
|
|
|
|
When batching, with default config, you can reach over 6000 even 8000 tok/s and 1200 tok/s generation speed.\ |
|
|
Tune prefill vs decode prioritization with `--max_num_batched_tokens` see [Performance & Tuning | vLLM](https://docs.vllm.ai/en/v0.4.2/models/performance.html) |
|
|
|
|
|
 |
|
|
|
|
|
In a steady state with interleaved prefill and decode requests that interrupt each other, you can get ~2400 tok/s context processing and 800 tok/s generation |
|
|
|
|
|
 |
|
|
|
|
|
Note: vLLM supports prefill-decode disaggregation for high throughput serving if you have double the minimum hardware: |
|
|
- https://pytorch.org/blog/disaggregated-inference-at-scale-with-pytorch-vllm/ |
|
|
- https://github.com/vllm-project/production-stack |
|
|
- Prefill/decode disaggregation |
|
|
- Multi-Tier KV-cache via [LMCache](https://github.com/LMCache/LMCache) (GPU > CPU > Local Disk) |
|
|
- Cache aware router |
|
|
- Multi-model dispatch via single interface |
|
|
|
|
|
## 🔬 Quantization method |
|
|
|
|
|
Quantization was quite complex for this model and was done in 3 steps: |
|
|
1. Original weights are in FP8, they were dequantized to FP16 due to llm-compressor not being able to process FP8. |
|
|
2. llm-compressor was used to quantize the MLP experts projection using AWQ, with [PR #2171](https://github.com/vllm-project/llm-compressor/pull/2171) to ensure they were all activated. |
|
|
3. Stitching the FrankenQuant: I combined the original weights, including the 2D-block FP8, with the experts-only AWQ weights. |
|
|
|
|
|
The llmcompressor library was used with the following recipe: |
|
|
|
|
|
```yaml |
|
|
default_stage: |
|
|
default_modifiers: |
|
|
AWQModifier: |
|
|
config_groups: |
|
|
mlp_experts_projections: |
|
|
# Include only MLP expert weights for 4-bit quantization |
|
|
targets: ["re:.*block_sparse_moe\\.experts\\.\\d+\\.(w1|w2|w3)$"] |
|
|
weights: |
|
|
num_bits: 4 |
|
|
type: int |
|
|
symmetric: true |
|
|
group_size: 32 |
|
|
strategy: group |
|
|
dynamic: false |
|
|
# actorder: group |
|
|
observer: memoryless_minmax |
|
|
|
|
|
mappings: |
|
|
- smooth_layer: re:.*post_attention_layernorm$ |
|
|
balance_layers: ["re:.*w1$", "re:.*w3$"] |
|
|
- smooth_layer: re:.*w3$ |
|
|
balance_layers: ["re:.*w2$"] |
|
|
duo_scaling: true |
|
|
``` |
|
|
|
|
|
The calibration set had 590 examples, 8192 sequence length, 60 programming languages, 12 spoken languages and is detailed at [calibrate_software_engineer.yaml](./calibrate_software_engineer.yaml) |
|
|
|
|
|
## Quantization theory and heuristics for manual tuning |
|
|
|
|
|
<details> |
|
|
<summary>In-depth overview of quantization theory and heuristics for manual tuning</summary> |
|
|
|
|
|
### Layers to quantize |
|
|
|
|
|
Quantization should be focused on Linear layers (also called Dense or Fully-Connected layers i.e. MatMul+Bias) |
|
|
In particular quantizing LayerNorm/RMSnorm layer is strongly discouraged, see [1] |
|
|
> LayerNorm in Quantization. Kovaleva et al. (2021); Wei et al. (2022) find that outliers in the |
|
|
> LayerNorm parameters of BERT (Devlin et al., 2019) cause difficulties in model compression. |
|
|
> Given the importance of LayerNorm, all the quantization methods we discuss above leave LayerNorm unquantized. |
|
|
|
|
|
This is also reported in Intel and Nvidia repo: |
|
|
- https://github.com/intel/neural-compressor/issues/1963#issuecomment-2274873441 |
|
|
- https://github.com/NVIDIA/TensorRT/issues/4084#issuecomment-2294513950 |
|
|
|
|
|
### Tensors to up-quantize |
|
|
|
|
|
If there is enough bits, down projections should be prioritized. |
|
|
|
|
|
According to [4] |
|
|
> Fig. 3: Maximum absolute value over layers for a LLaMA3-8B. |
|
|
> Each color represent a different projection and we clearly see that down_proj has the biggest |
|
|
> spikes in input and output. We also observe that RMSNorm propagate spikes through the entire model |
|
|
|
|
|
According to [5] |
|
|
> Figure 5(a) illustrates the extremal ratio across layers and modules in LLaMA2-7B, highlighting |
|
|
> that weight outliers are concentrated in the down-projection matrices Wdown |
|
|
> ℓ of the second layer and |
|
|
> the last two layers. Figures 5(b) and 5(c) provide detailed visualizations of these outliers in the last |
|
|
> two layers. |
|
|
|
|
|
### Mixture-of-Experts quantization (MoE) |
|
|
|
|
|
Mixture-of-Experts require specific quantization techniques. |
|
|
|
|
|
#### Mixed-precision quantization |
|
|
|
|
|
Some layers have a higher impact on LLM performance. |
|
|
According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers. |
|
|
According to [3] on 2-bit quantization: |
|
|
- quantizing expert FFN layers do not seriously impact model quality |
|
|
- quantizing cross-attention has some impact |
|
|
- quantizing self-attention has a large impact |
|
|
- quantizing dense FFN has a very significant impact |
|
|
|
|
|
Hence to preserve model quality we should choose not to quantize dense FFN layers and self-attention layers. |
|
|
|
|
|
We notice that: |
|
|
- official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16: |
|
|
- https://huggingface.co/openai/gpt-oss-120b/blob/main/model.safetensors.index.json |
|
|
- NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16: |
|
|
- https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4/blob/main/model.safetensors.index.json |
|
|
|
|
|
#### Layers with high-impact |
|
|
|
|
|
According to [2], giving more bits to the first `k` blocks have a significantly higher impact on model quality than for the same last `k` blocks. |
|
|
|
|
|
#### Expert quantization |
|
|
|
|
|
When quantizing MoE, quantizing activations is tricky as only a subset of experts are activated per request. You have to make sure all experts are calibrated. |
|
|
|
|
|
<details> |
|
|
<summary>Visual showcase of why ensuring quantization of all MoE experts is important</summary> |
|
|
|
|
|
- Source: https://avtc.github.io/aquarium-side-by-side/ |
|
|
- Context: https://github.com/ModelCloud/GPTQModel/pull/2235 |
|
|
|
|
|
 |
|
|
|
|
|
</details> |
|
|
|
|
|
## References |
|
|
|
|
|
1. Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)\ |
|
|
Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia\ |
|
|
https://arxiv.org/pdf/2506.12044 |
|
|
|
|
|
2. Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)\ |
|
|
Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen\ |
|
|
https://arxiv.org/pdf/2406.08155v1 |
|
|
|
|
|
3. Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)\ |
|
|
Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla\ |
|
|
https://arxiv.org/pdf/2310.02410 |
|
|
|
|
|
4. Precision Where It Matters: A Novel Spike\ |
|
|
Aware Mixed-Precision Quantization Strategy for\ |
|
|
LLaMA-based Language Models (2025)\ |
|
|
Lucas Maisonnave, Cyril Moineau, Olivier Bichler, and Fabrice Rastello\ |
|
|
https://arxiv.org/pdf/2504.21553 |
|
|
|
|
|
5. Systematic Outliers in Large Language Models (2025)\ |
|
|
Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang\ |
|
|
https://arxiv.org/pdf/2502.06415v2 |
|
|
|
|
|
</details> |