--- pipeline_tag: text-generation license: other license_name: modified-mit license_link: https://github.com/MiniMax-AI/MiniMax-M2.1/blob/main/LICENSE library_name: llm-compressor tags: - fp8 - awq - conversational - vllm - code - devops - software engineering - engineer - developer - architect - stem - agent datasets: - HuggingFaceH4/ultrachat_200k - databricks/databricks-dolly-15k - neuralmagic/calibration - HuggingFaceH4/no_robots - nvidia/HelpSteer - garage-bAInd/Open-Platypus - PJMixers/grimulkan_physical-reasoning-ShareGPT - PJMixers/grimulkan_theory-of-mind-ShareGPT - HuggingFaceH4/Multilingual-Thinking - ServiceNow-AI/M2Lingual - interstellarninja/hermes_reasoning_tool_use - deepmind/code_contests - dh02391735/stackoverflow-kubernetes-questions - diversoailab/humaneval-rust - ammarnasr/the-stack-rust-clean - CSJianYang/CodeArena - nvidia/OpenCodeInstruct - nvidia/Llama-Nemotron-Post-Training-Dataset - nvidia/Nemotron-Competitive-Programming-v1 - rombodawg/code_bagel_hermes-2.5 - MathArena/project_euler - nvidia/Nemotron-Math-Proofs-v1 - nvidia/OpenMathInstruct-2 - nvidia/OpenScienceReasoning-2 - MegaScience/MegaScience - OpenMed/Medical-Reasoning-SFT-GPT-OSS-120B - ccdv/pubmed-summarization - gbharti/finance-alpaca - vladlen32230/summarization-yahoo-stock-finance-article-text - fka/awesome-chatgpt-prompts - theoldmandthesea/17k_business_book - ruggsea/stanford-encyclopedia-of-philosophy_instruct - mlfoundations-dev/stackexchange_philosophy - FreedomIntelligence/SocraticChat - Gryphe/Opus-WritingPrompts - anthracite-org/nopm_claude_writing_fixed - zerofata/Roleplay-Anime-Characters - zerofata/Instruct-Anime - zerofata/Instruct-Anime-CreativeWriting - sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo - PocketDoc/Dans-Prosemaxx-Adventure - anthracite-org/stheno-filtered-v1.1 - KaraKaraWitch/TvTroper-2025 - AquaV/US-Army-Survival-Sharegpt - AquaV/Interrogation-Sharegpt - AquaV/Multi-Environment-Operations-Sharegpt - AquaV/Resistance-Sharegpt - PocketDoc/Dans-Kinomaxx-VanillaBackrooms base_model: - MiniMaxAI/MiniMax-M2.1 --- # MiniMax M2.1 (Mixed-Precision FP8 + INT4 AWQ FrankenQuant) This strives to be the highest quality quant that can run on 192GiB VRAM > [!TIP] > 💡 A non-FP8 version is available at [mratsim/MiniMax-M2.1-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ) \ > That version is compatible with 8x RTX 3090s and with SGLang (which doesn't support mixed quantization yet) for an extra 3GiB in VRAM. \ > This FP8+INT4 AWQ was build by merging the original FP8 self-attention weights and [mratsim/MiniMax-M2.1-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ) experts. It features: - That model has ensured that all experts are calibrated, not doing so is extremely detrimental, PR: https://github.com/vllm-project/llm-compressor/pull/2171
Visual showcase of why ensuring quantization of all MoE experts is important - Source: https://avtc.github.io/aquarium-side-by-side/ - Context: https://github.com/ModelCloud/GPTQModel/pull/2235 ![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/BDc3-0m3_WLl3ZmbBMhmd.png)
- Mixed precision with: - self-attention weights copied directly from the official version (default FP8 with 2D-blocks) - experts weights quantized using AWQ W4A16G32 scheme (4-bit weights, 16-bit activations, scaling factor per group of 32 weights) - High-quality large and diverse dataset with programming and devops focus as well as domain-specific knowledge (math, sciences, medical, finance, business, humanities, philosophy, creative writing), general knowledge, pop culture and behavioral situations because we never code in a vacuum. And we want to make sure all experts are calibrated to the full range of their activations. - Calibration explicitly tests multilingual capabilities: - Asia: Chinese, Hindi, Korean, Japanese - Europe: French, German, Portuguese, Russian, Spanish - Middle-East: Arabic, Hebrew, Turkish - Calibration explicitly tests 60 programming languages and not just Python: - Imperative programming: C, C++, Go, Zig, ... - Functional programming: Haskell, F#, OCaml, Erlang, Lisp, Clojure ... - Web-focused: HTML/CSS, Typescript, PHP, ... - Mixed paradigm: D, Kotlin, Nim, Rust, Swift, ... - Theorem provers: Coq, Lean - Low-level: ARM64 assembly, x86-64 assembly, LLVM IR - GPU Programming: Cuda, Vulkan, Apple Metal - Game Programming: GDScript, GLSL - Domain-specific: MATLAB, Julia, Solidity, R - Calibration tries to ensure coverage for a wide variety of experience (from explaining concepts to your grandmother to debugging Kubernetes logs) - Built by a dev, for devs (and it looks very good for STEM as well) It uses my new declarative quantization framework https://github.com/mratsim/quantizers which facilitates highly-tuned calibration sets: [calibrate_software_engineer.yaml](./calibrate_software_engineer.yaml)
This has taken several days and contribution and bug reports to the ecosystem, I hope you find it useful. - https://github.com/vllm-project/llm-compressor/pull/2171 - https://github.com/vllm-project/llm-compressor/issues/2172 - https://github.com/vllm-project/vllm/issues/31623 - https://github.com/sgl-project/sglang/issues/16276 - https://github.com/sgl-project/sglang/issues/16295
## 📥 Usage & Running Instructions The model was tested with vLLM + 2x RTX Pro 6000, here is a script suitable for such configuration with the maximum 196,608 context length. This uses 92.5GiB of VRAM with the flashinfer backend. > [!WARNING] > ⚠️ Due to rope_parameters change, at the moment this model is incompatible with transformers V5.\ This makes it incompatible with GLM-4.6V which requires transformers V5. Use different Docker images. > [!WARNING] > ⚠️ SGLang does not support this model due to missing mixed precision support. Feature request raised at https://github.com/sgl-project/sglang/issues/16276.\ > Please use [mratsim/MiniMax-M2.1-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ) in the meantime. ### Running script `--trust-remote-code` is necessary until the transformers team merges github.com/huggingface/transformers/pull/42028 You have 2 reasoning parsers; - `minimax_m2`, puts the reasoning content in a special field like DeepSeek models that is usually rendered in a specific manner in frontends. - `minimax_m2_append_think`, puts the reasoning into `reasoning_content` and that is sent as normal text. Few frontends properly render that, I'm aware of [Cherry Studio](https://github.com/CherryHQ/cherry-studio) on Desktop and [ChatterUI](https://github.com/Vali-98/ChatterUI) on Android. The reason why `minimax_m2_append_think` was introduced was Interleaved Thinking and having the model build upon it's previous thinking (usually frontends discard the thinking trace) > [!TIP] > 💡With the recommended parameters the model tends to get stuck in repetition loops.\ > It seems like repetition_penalty: 1.10, frequency_penalty: 0.40 avoids that ```bash # Model configuration (Mandatory) MODEL="mratsim/MiniMax-M2.1-FP8-INT4-AWQ" MODELNAME="MiniMax-M2.1" GPU_UTIL=0.93 SAMPLER_OVERRIDE='{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}' # Prevent memory fragmentation export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512 # Prevent vLLM from using 100% CPU when idle (Very Recommended) export VLLM_SLEEP_WHEN_IDLE=1 vllm serve "${MODEL}" \ --served-model-name "${MODELNAME}" \ --trust-remote-code \ --gpu-memory-utilization ${GPU_UTIL} \ --tp 2 \ --override-generation-config "${SAMPLER_OVERRIDE}" \ --enable-auto-tool-choice \ --tool-call-parser minimax_m2 \ --reasoning-parser minimax_m2 # --reasoning-parser minimax_m2_append_think ``` ## Performance On dual RTX Pro 6000, I can reach over 5500 prefill/prompt/context processing and over 100 tok/s token generation for a single request. ![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/YbP1qw_YhcaM0aywJHSjG.png) With PagedAttention in action you can reach over 25000 tok/s in prompt processing speed. ![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/haCbdHZWScsgGiGCj768i.png) When batching, with default config, you can reach over 6000 even 8000 tok/s and 1200 tok/s generation speed.\ Tune prefill vs decode prioritization with `--max_num_batched_tokens` see [Performance & Tuning | vLLM](https://docs.vllm.ai/en/v0.4.2/models/performance.html) ![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/ma7oVnEGbj15Rk4EG0h5B.png) In a steady state with interleaved prefill and decode requests that interrupt each other, you can get ~2400 tok/s context processing and 800 tok/s generation ![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/Gbc2pz5Tpm8gF-MV_UPDe.png) Note: vLLM supports prefill-decode disaggregation for high throughput serving if you have double the minimum hardware: - https://pytorch.org/blog/disaggregated-inference-at-scale-with-pytorch-vllm/ - https://github.com/vllm-project/production-stack - Prefill/decode disaggregation - Multi-Tier KV-cache via [LMCache](https://github.com/LMCache/LMCache) (GPU > CPU > Local Disk) - Cache aware router - Multi-model dispatch via single interface ## 🔬 Quantization method Quantization was quite complex for this model and was done in 3 steps: 1. Original weights are in FP8, they were dequantized to FP16 due to llm-compressor not being able to process FP8. 2. llm-compressor was used to quantize the MLP experts projection using AWQ, with [PR #2171](https://github.com/vllm-project/llm-compressor/pull/2171) to ensure they were all activated. 3. Stitching the FrankenQuant: I combined the original weights, including the 2D-block FP8, with the experts-only AWQ weights. The llmcompressor library was used with the following recipe: ```yaml default_stage: default_modifiers: AWQModifier: config_groups: mlp_experts_projections: # Include only MLP expert weights for 4-bit quantization targets: ["re:.*block_sparse_moe\\.experts\\.\\d+\\.(w1|w2|w3)$"] weights: num_bits: 4 type: int symmetric: true group_size: 32 strategy: group dynamic: false # actorder: group observer: memoryless_minmax mappings: - smooth_layer: re:.*post_attention_layernorm$ balance_layers: ["re:.*w1$", "re:.*w3$"] - smooth_layer: re:.*w3$ balance_layers: ["re:.*w2$"] duo_scaling: true ``` The calibration set had 590 examples, 8192 sequence length, 60 programming languages, 12 spoken languages and is detailed at [calibrate_software_engineer.yaml](./calibrate_software_engineer.yaml) ## Quantization theory and heuristics for manual tuning
In-depth overview of quantization theory and heuristics for manual tuning ### Layers to quantize Quantization should be focused on Linear layers (also called Dense or Fully-Connected layers i.e. MatMul+Bias) In particular quantizing LayerNorm/RMSnorm layer is strongly discouraged, see [1] > LayerNorm in Quantization. Kovaleva et al. (2021); Wei et al. (2022) find that outliers in the > LayerNorm parameters of BERT (Devlin et al., 2019) cause difficulties in model compression. > Given the importance of LayerNorm, all the quantization methods we discuss above leave LayerNorm unquantized. This is also reported in Intel and Nvidia repo: - https://github.com/intel/neural-compressor/issues/1963#issuecomment-2274873441 - https://github.com/NVIDIA/TensorRT/issues/4084#issuecomment-2294513950 ### Tensors to up-quantize If there is enough bits, down projections should be prioritized. According to [4] > Fig. 3: Maximum absolute value over layers for a LLaMA3-8B. > Each color represent a different projection and we clearly see that down_proj has the biggest > spikes in input and output. We also observe that RMSNorm propagate spikes through the entire model According to [5] > Figure 5(a) illustrates the extremal ratio across layers and modules in LLaMA2-7B, highlighting > that weight outliers are concentrated in the down-projection matrices Wdown > ℓ of the second layer and > the last two layers. Figures 5(b) and 5(c) provide detailed visualizations of these outliers in the last > two layers. ### Mixture-of-Experts quantization (MoE) Mixture-of-Experts require specific quantization techniques. #### Mixed-precision quantization Some layers have a higher impact on LLM performance. According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers. According to [3] on 2-bit quantization: - quantizing expert FFN layers do not seriously impact model quality - quantizing cross-attention has some impact - quantizing self-attention has a large impact - quantizing dense FFN has a very significant impact Hence to preserve model quality we should choose not to quantize dense FFN layers and self-attention layers. We notice that: - official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16: - https://huggingface.co/openai/gpt-oss-120b/blob/main/model.safetensors.index.json - NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16: - https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4/blob/main/model.safetensors.index.json #### Layers with high-impact According to [2], giving more bits to the first `k` blocks have a significantly higher impact on model quality than for the same last `k` blocks. #### Expert quantization When quantizing MoE, quantizing activations is tricky as only a subset of experts are activated per request. You have to make sure all experts are calibrated.
Visual showcase of why ensuring quantization of all MoE experts is important - Source: https://avtc.github.io/aquarium-side-by-side/ - Context: https://github.com/ModelCloud/GPTQModel/pull/2235 ![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/BDc3-0m3_WLl3ZmbBMhmd.png)
## References 1. Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)\ Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia\ https://arxiv.org/pdf/2506.12044 2. Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)\ Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen\ https://arxiv.org/pdf/2406.08155v1 3. Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)\ Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla\ https://arxiv.org/pdf/2310.02410 4. Precision Where It Matters: A Novel Spike\ Aware Mixed-Precision Quantization Strategy for\ LLaMA-based Language Models (2025)\ Lucas Maisonnave, Cyril Moineau, Olivier Bichler, and Fabrice Rastello\ https://arxiv.org/pdf/2504.21553 5. Systematic Outliers in Large Language Models (2025)\ Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang\ https://arxiv.org/pdf/2502.06415v2