mratsim's picture
mention VRAM cost of BF16
5b18eb5 verified
---
pipeline_tag: text-generation
license: other
license_name: modified-mit
license_link: https://github.com/MiniMax-AI/MiniMax-M2.1/blob/main/LICENSE
library_name: llm-compressor
tags:
- fp8
- awq
- conversational
- vllm
- code
- devops
- software engineering
- engineer
- developer
- architect
- stem
- agent
datasets:
- HuggingFaceH4/ultrachat_200k
- databricks/databricks-dolly-15k
- neuralmagic/calibration
- HuggingFaceH4/no_robots
- nvidia/HelpSteer
- garage-bAInd/Open-Platypus
- PJMixers/grimulkan_physical-reasoning-ShareGPT
- PJMixers/grimulkan_theory-of-mind-ShareGPT
- HuggingFaceH4/Multilingual-Thinking
- ServiceNow-AI/M2Lingual
- interstellarninja/hermes_reasoning_tool_use
- deepmind/code_contests
- dh02391735/stackoverflow-kubernetes-questions
- diversoailab/humaneval-rust
- ammarnasr/the-stack-rust-clean
- CSJianYang/CodeArena
- nvidia/OpenCodeInstruct
- nvidia/Llama-Nemotron-Post-Training-Dataset
- nvidia/Nemotron-Competitive-Programming-v1
- rombodawg/code_bagel_hermes-2.5
- MathArena/project_euler
- nvidia/Nemotron-Math-Proofs-v1
- nvidia/OpenMathInstruct-2
- nvidia/OpenScienceReasoning-2
- MegaScience/MegaScience
- OpenMed/Medical-Reasoning-SFT-GPT-OSS-120B
- ccdv/pubmed-summarization
- gbharti/finance-alpaca
- vladlen32230/summarization-yahoo-stock-finance-article-text
- fka/awesome-chatgpt-prompts
- theoldmandthesea/17k_business_book
- ruggsea/stanford-encyclopedia-of-philosophy_instruct
- mlfoundations-dev/stackexchange_philosophy
- FreedomIntelligence/SocraticChat
- Gryphe/Opus-WritingPrompts
- anthracite-org/nopm_claude_writing_fixed
- zerofata/Roleplay-Anime-Characters
- zerofata/Instruct-Anime
- zerofata/Instruct-Anime-CreativeWriting
- sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo
- PocketDoc/Dans-Prosemaxx-Adventure
- anthracite-org/stheno-filtered-v1.1
- KaraKaraWitch/TvTroper-2025
- AquaV/US-Army-Survival-Sharegpt
- AquaV/Interrogation-Sharegpt
- AquaV/Multi-Environment-Operations-Sharegpt
- AquaV/Resistance-Sharegpt
- PocketDoc/Dans-Kinomaxx-VanillaBackrooms
base_model:
- MiniMaxAI/MiniMax-M2.1
---
# MiniMax M2.1 (Mixed-Precision FP8 + INT4 AWQ FrankenQuant)
This strives to be the highest quality quant that can run on 192GiB VRAM
> [!TIP]
> 💡 A non-FP8 version is available at [mratsim/MiniMax-M2.1-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ) \
> That version is compatible with 8x RTX 3090s and with SGLang (which doesn't support mixed quantization yet) for an extra 3GiB in VRAM. \
> This FP8+INT4 AWQ was build by merging the original FP8 self-attention weights and [mratsim/MiniMax-M2.1-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ) experts.
It features:
- That model has ensured that all experts are calibrated, not doing so is extremely detrimental, PR: https://github.com/vllm-project/llm-compressor/pull/2171
<details>
<summary>Visual showcase of why ensuring quantization of all MoE experts is important</summary>
- Source: https://avtc.github.io/aquarium-side-by-side/
- Context: https://github.com/ModelCloud/GPTQModel/pull/2235
![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/BDc3-0m3_WLl3ZmbBMhmd.png)
</details>
- Mixed precision with:
- self-attention weights copied directly from the official version (default FP8 with 2D-blocks)
- experts weights quantized using AWQ W4A16G32 scheme (4-bit weights, 16-bit activations, scaling factor per group of 32 weights)
- High-quality large and diverse dataset with programming and devops focus
as well as domain-specific knowledge (math, sciences, medical, finance, business, humanities, philosophy, creative writing), general knowledge, pop culture and behavioral situations because we never code in a vacuum. And we want to make sure all experts are calibrated to the full range of their activations.
- Calibration explicitly tests multilingual capabilities:
- Asia: Chinese, Hindi, Korean, Japanese
- Europe: French, German, Portuguese, Russian, Spanish
- Middle-East: Arabic, Hebrew, Turkish
- Calibration explicitly tests 60 programming languages and not just Python:
- Imperative programming: C, C++, Go, Zig, ...
- Functional programming: Haskell, F#, OCaml, Erlang, Lisp, Clojure ...
- Web-focused: HTML/CSS, Typescript, PHP, ...
- Mixed paradigm: D, Kotlin, Nim, Rust, Swift, ...
- Theorem provers: Coq, Lean
- Low-level: ARM64 assembly, x86-64 assembly, LLVM IR
- GPU Programming: Cuda, Vulkan, Apple Metal
- Game Programming: GDScript, GLSL
- Domain-specific: MATLAB, Julia, Solidity, R
- Calibration tries to ensure coverage for a wide variety of experience (from explaining concepts to your grandmother to debugging Kubernetes logs)
- Built by a dev, for devs (and it looks very good for STEM as well)
It uses my new declarative quantization framework https://github.com/mratsim/quantizers which facilitates highly-tuned calibration sets: [calibrate_software_engineer.yaml](./calibrate_software_engineer.yaml)
<details>
<summary>This has taken several days and contribution and bug reports to the ecosystem, I hope you find it useful.</summary>
- https://github.com/vllm-project/llm-compressor/pull/2171
- https://github.com/vllm-project/llm-compressor/issues/2172
- https://github.com/vllm-project/vllm/issues/31623
- https://github.com/sgl-project/sglang/issues/16276
- https://github.com/sgl-project/sglang/issues/16295
</details>
## 📥 Usage & Running Instructions
The model was tested with vLLM + 2x RTX Pro 6000, here is a script suitable for such configuration with the maximum 196,608 context length. This uses 92.5GiB of VRAM with the flashinfer backend.
> [!WARNING]
> ⚠️ Due to rope_parameters change, at the moment this model is incompatible with transformers V5.\
This makes it incompatible with GLM-4.6V which requires transformers V5. Use different Docker images.
> [!WARNING]
> ⚠️ SGLang does not support this model due to missing mixed precision support. Feature request raised at https://github.com/sgl-project/sglang/issues/16276.\
> Please use [mratsim/MiniMax-M2.1-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ) in the meantime.
### Running script
`--trust-remote-code` is necessary until the transformers team merges github.com/huggingface/transformers/pull/42028
You have 2 reasoning parsers;
- `minimax_m2`, puts the reasoning content in a special field like DeepSeek models that is usually rendered in a specific manner in frontends.
- `minimax_m2_append_think`, puts the reasoning into `<think>reasoning_content</think>` and that is sent as normal text. Few frontends properly render that, I'm aware of [Cherry Studio](https://github.com/CherryHQ/cherry-studio) on Desktop and [ChatterUI](https://github.com/Vali-98/ChatterUI) on Android.
The reason why `minimax_m2_append_think` was introduced was Interleaved Thinking and having the model build upon it's previous thinking (usually frontends discard the thinking trace)
> [!TIP]
> 💡With the recommended parameters the model tends to get stuck in repetition loops.\
> It seems like repetition_penalty: 1.10, frequency_penalty: 0.40 avoids that
```bash
# Model configuration (Mandatory)
MODEL="mratsim/MiniMax-M2.1-FP8-INT4-AWQ"
MODELNAME="MiniMax-M2.1"
GPU_UTIL=0.93
SAMPLER_OVERRIDE='{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'
# Prevent memory fragmentation
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
# Prevent vLLM from using 100% CPU when idle (Very Recommended)
export VLLM_SLEEP_WHEN_IDLE=1
vllm serve "${MODEL}" \
--served-model-name "${MODELNAME}" \
--trust-remote-code \
--gpu-memory-utilization ${GPU_UTIL} \
--tp 2 \
--override-generation-config "${SAMPLER_OVERRIDE}" \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2
# --reasoning-parser minimax_m2_append_think
```
## Performance
On dual RTX Pro 6000, I can reach over 5500 prefill/prompt/context processing and over 100 tok/s token generation for a single request.
![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/YbP1qw_YhcaM0aywJHSjG.png)
With PagedAttention in action you can reach over 25000 tok/s in prompt processing speed.
![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/haCbdHZWScsgGiGCj768i.png)
When batching, with default config, you can reach over 6000 even 8000 tok/s and 1200 tok/s generation speed.\
Tune prefill vs decode prioritization with `--max_num_batched_tokens` see [Performance & Tuning | vLLM](https://docs.vllm.ai/en/v0.4.2/models/performance.html)
![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/ma7oVnEGbj15Rk4EG0h5B.png)
In a steady state with interleaved prefill and decode requests that interrupt each other, you can get ~2400 tok/s context processing and 800 tok/s generation
![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/Gbc2pz5Tpm8gF-MV_UPDe.png)
Note: vLLM supports prefill-decode disaggregation for high throughput serving if you have double the minimum hardware:
- https://pytorch.org/blog/disaggregated-inference-at-scale-with-pytorch-vllm/
- https://github.com/vllm-project/production-stack
- Prefill/decode disaggregation
- Multi-Tier KV-cache via [LMCache](https://github.com/LMCache/LMCache) (GPU > CPU > Local Disk)
- Cache aware router
- Multi-model dispatch via single interface
## 🔬 Quantization method
Quantization was quite complex for this model and was done in 3 steps:
1. Original weights are in FP8, they were dequantized to FP16 due to llm-compressor not being able to process FP8.
2. llm-compressor was used to quantize the MLP experts projection using AWQ, with [PR #2171](https://github.com/vllm-project/llm-compressor/pull/2171) to ensure they were all activated.
3. Stitching the FrankenQuant: I combined the original weights, including the 2D-block FP8, with the experts-only AWQ weights.
The llmcompressor library was used with the following recipe:
```yaml
default_stage:
default_modifiers:
AWQModifier:
config_groups:
mlp_experts_projections:
# Include only MLP expert weights for 4-bit quantization
targets: ["re:.*block_sparse_moe\\.experts\\.\\d+\\.(w1|w2|w3)$"]
weights:
num_bits: 4
type: int
symmetric: true
group_size: 32
strategy: group
dynamic: false
# actorder: group
observer: memoryless_minmax
mappings:
- smooth_layer: re:.*post_attention_layernorm$
balance_layers: ["re:.*w1$", "re:.*w3$"]
- smooth_layer: re:.*w3$
balance_layers: ["re:.*w2$"]
duo_scaling: true
```
The calibration set had 590 examples, 8192 sequence length, 60 programming languages, 12 spoken languages and is detailed at [calibrate_software_engineer.yaml](./calibrate_software_engineer.yaml)
## Quantization theory and heuristics for manual tuning
<details>
<summary>In-depth overview of quantization theory and heuristics for manual tuning</summary>
### Layers to quantize
Quantization should be focused on Linear layers (also called Dense or Fully-Connected layers i.e. MatMul+Bias)
In particular quantizing LayerNorm/RMSnorm layer is strongly discouraged, see [1]
> LayerNorm in Quantization. Kovaleva et al. (2021); Wei et al. (2022) find that outliers in the
> LayerNorm parameters of BERT (Devlin et al., 2019) cause difficulties in model compression.
> Given the importance of LayerNorm, all the quantization methods we discuss above leave LayerNorm unquantized.
This is also reported in Intel and Nvidia repo:
- https://github.com/intel/neural-compressor/issues/1963#issuecomment-2274873441
- https://github.com/NVIDIA/TensorRT/issues/4084#issuecomment-2294513950
### Tensors to up-quantize
If there is enough bits, down projections should be prioritized.
According to [4]
> Fig. 3: Maximum absolute value over layers for a LLaMA3-8B.
> Each color represent a different projection and we clearly see that down_proj has the biggest
> spikes in input and output. We also observe that RMSNorm propagate spikes through the entire model
According to [5]
> Figure 5(a) illustrates the extremal ratio across layers and modules in LLaMA2-7B, highlighting
> that weight outliers are concentrated in the down-projection matrices Wdown
> ℓ of the second layer and
> the last two layers. Figures 5(b) and 5(c) provide detailed visualizations of these outliers in the last
> two layers.
### Mixture-of-Experts quantization (MoE)
Mixture-of-Experts require specific quantization techniques.
#### Mixed-precision quantization
Some layers have a higher impact on LLM performance.
According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers.
According to [3] on 2-bit quantization:
- quantizing expert FFN layers do not seriously impact model quality
- quantizing cross-attention has some impact
- quantizing self-attention has a large impact
- quantizing dense FFN has a very significant impact
Hence to preserve model quality we should choose not to quantize dense FFN layers and self-attention layers.
We notice that:
- official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16:
- https://huggingface.co/openai/gpt-oss-120b/blob/main/model.safetensors.index.json
- NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16:
- https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4/blob/main/model.safetensors.index.json
#### Layers with high-impact
According to [2], giving more bits to the first `k` blocks have a significantly higher impact on model quality than for the same last `k` blocks.
#### Expert quantization
When quantizing MoE, quantizing activations is tricky as only a subset of experts are activated per request. You have to make sure all experts are calibrated.
<details>
<summary>Visual showcase of why ensuring quantization of all MoE experts is important</summary>
- Source: https://avtc.github.io/aquarium-side-by-side/
- Context: https://github.com/ModelCloud/GPTQModel/pull/2235
![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/BDc3-0m3_WLl3ZmbBMhmd.png)
</details>
## References
1. Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)\
Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia\
https://arxiv.org/pdf/2506.12044
2. Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)\
Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen\
https://arxiv.org/pdf/2406.08155v1
3. Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)\
Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla\
https://arxiv.org/pdf/2310.02410
4. Precision Where It Matters: A Novel Spike\
Aware Mixed-Precision Quantization Strategy for\
LLaMA-based Language Models (2025)\
Lucas Maisonnave, Cyril Moineau, Olivier Bichler, and Fabrice Rastello\
https://arxiv.org/pdf/2504.21553
5. Systematic Outliers in Large Language Models (2025)\
Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang\
https://arxiv.org/pdf/2502.06415v2
</details>