README.md · mratsim/MiniMax-M2.1-FP8-INT4-AWQ at main

File size: 15,402 Bytes

---
pipeline_tag: text-generation
license: other
license_name: modified-mit
license_link: https://github.com/MiniMax-AI/MiniMax-M2.1/blob/main/LICENSE
library_name: llm-compressor
tags:
- fp8
- awq
- conversational
- vllm
- code
- devops
- software engineering
- engineer
- developer
- architect
- stem
- agent
datasets:
- HuggingFaceH4/ultrachat_200k
- databricks/databricks-dolly-15k
- neuralmagic/calibration
- HuggingFaceH4/no_robots
- nvidia/HelpSteer
- garage-bAInd/Open-Platypus
- PJMixers/grimulkan_physical-reasoning-ShareGPT
- PJMixers/grimulkan_theory-of-mind-ShareGPT
- HuggingFaceH4/Multilingual-Thinking
- ServiceNow-AI/M2Lingual
- interstellarninja/hermes_reasoning_tool_use
- deepmind/code_contests
- dh02391735/stackoverflow-kubernetes-questions
- diversoailab/humaneval-rust
- ammarnasr/the-stack-rust-clean
- CSJianYang/CodeArena
- nvidia/OpenCodeInstruct
- nvidia/Llama-Nemotron-Post-Training-Dataset
- nvidia/Nemotron-Competitive-Programming-v1
- rombodawg/code_bagel_hermes-2.5
- MathArena/project_euler
- nvidia/Nemotron-Math-Proofs-v1
- nvidia/OpenMathInstruct-2
- nvidia/OpenScienceReasoning-2
- MegaScience/MegaScience
- OpenMed/Medical-Reasoning-SFT-GPT-OSS-120B
- ccdv/pubmed-summarization
- gbharti/finance-alpaca
- vladlen32230/summarization-yahoo-stock-finance-article-text
- fka/awesome-chatgpt-prompts
- theoldmandthesea/17k_business_book
- ruggsea/stanford-encyclopedia-of-philosophy_instruct
- mlfoundations-dev/stackexchange_philosophy
- FreedomIntelligence/SocraticChat
- Gryphe/Opus-WritingPrompts
- anthracite-org/nopm_claude_writing_fixed
- zerofata/Roleplay-Anime-Characters
- zerofata/Instruct-Anime
- zerofata/Instruct-Anime-CreativeWriting
- sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo
- PocketDoc/Dans-Prosemaxx-Adventure
- anthracite-org/stheno-filtered-v1.1
- KaraKaraWitch/TvTroper-2025
- AquaV/US-Army-Survival-Sharegpt
- AquaV/Interrogation-Sharegpt
- AquaV/Multi-Environment-Operations-Sharegpt
- AquaV/Resistance-Sharegpt
- PocketDoc/Dans-Kinomaxx-VanillaBackrooms
base_model:
- MiniMaxAI/MiniMax-M2.1
---

# MiniMax M2.1 (Mixed-Precision FP8 + INT4 AWQ FrankenQuant)

This strives to be the highest quality quant that can run on 192GiB VRAM

> [!TIP]  
> 💡 A non-FP8 version is available at [mratsim/MiniMax-M2.1-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ) \
> That version is compatible with 8x RTX 3090s and with SGLang (which doesn't support mixed quantization yet) for an extra 3GiB in VRAM. \
> This FP8+INT4 AWQ was build by merging the original FP8 self-attention weights and [mratsim/MiniMax-M2.1-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ) experts.

It features:
- That model has ensured that all experts are calibrated, not doing so is extremely detrimental, PR: https://github.com/vllm-project/llm-compressor/pull/2171
  <details>
  <summary>Visual showcase of why ensuring quantization of all MoE experts is important</summary>
  
  - Source: https://avtc.github.io/aquarium-side-by-side/
  - Context: https://github.com/ModelCloud/GPTQModel/pull/2235
  
  ![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/BDc3-0m3_WLl3ZmbBMhmd.png)
  
  </details>
- Mixed precision with:
  - self-attention weights copied directly from the official version (default FP8 with 2D-blocks)
  - experts weights quantized using AWQ W4A16G32 scheme (4-bit weights, 16-bit activations, scaling factor per group of 32 weights)
- High-quality large and diverse dataset with programming and devops focus
  as well as domain-specific knowledge (math, sciences, medical, finance, business, humanities, philosophy, creative writing), general knowledge, pop culture and behavioral situations because we never code in a vacuum. And we want to make sure all experts are calibrated to the full range of their activations.
- Calibration explicitly tests multilingual capabilities:
  - Asia: Chinese, Hindi, Korean, Japanese
  - Europe: French, German, Portuguese, Russian, Spanish
  - Middle-East: Arabic, Hebrew, Turkish
- Calibration explicitly tests 60 programming languages and not just Python:
  - Imperative programming: C, C++, Go, Zig, ...
  - Functional programming: Haskell, F#, OCaml, Erlang, Lisp, Clojure ...
  - Web-focused: HTML/CSS, Typescript, PHP, ...
  - Mixed paradigm: D, Kotlin, Nim, Rust, Swift, ...
  - Theorem provers: Coq, Lean
  - Low-level: ARM64 assembly, x86-64 assembly, LLVM IR
  - GPU Programming: Cuda, Vulkan, Apple Metal
  - Game Programming: GDScript, GLSL
  - Domain-specific: MATLAB, Julia, Solidity, R
- Calibration tries to ensure coverage for a wide variety of experience (from explaining concepts to your grandmother to debugging Kubernetes logs)
- Built by a dev, for devs (and it looks very good for STEM as well)

It uses my new declarative quantization framework https://github.com/mratsim/quantizers which facilitates highly-tuned calibration sets: [calibrate_software_engineer.yaml](./calibrate_software_engineer.yaml)

<details>
<summary>This has taken several days and contribution and bug reports to the ecosystem, I hope you find it useful.</summary>
  
- https://github.com/vllm-project/llm-compressor/pull/2171
- https://github.com/vllm-project/llm-compressor/issues/2172
- https://github.com/vllm-project/vllm/issues/31623
- https://github.com/sgl-project/sglang/issues/16276
- https://github.com/sgl-project/sglang/issues/16295

</details>

## 📥 Usage & Running Instructions

The model was tested with vLLM + 2x RTX Pro 6000, here is a script suitable for such configuration with the maximum 196,608 context length. This uses 92.5GiB of VRAM with the flashinfer backend.

> [!WARNING]  
> ⚠️ Due to rope_parameters change, at the moment this model is incompatible with transformers V5.\
This makes it incompatible with GLM-4.6V which requires transformers V5. Use different Docker images.

> [!WARNING]  
> ⚠️ SGLang does not support this model due to missing mixed precision support. Feature request raised at https://github.com/sgl-project/sglang/issues/16276.\
> Please use [mratsim/MiniMax-M2.1-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ) in the meantime.

### Running script

`--trust-remote-code` is necessary until the transformers team merges github.com/huggingface/transformers/pull/42028

You have 2 reasoning parsers;
- `minimax_m2`, puts the reasoning content in a special field like DeepSeek models that is usually rendered in a specific manner in frontends.
- `minimax_m2_append_think`, puts the reasoning into `<think>reasoning_content</think>` and that is sent as normal text. Few frontends properly render that, I'm aware of [Cherry Studio](https://github.com/CherryHQ/cherry-studio) on Desktop and [ChatterUI](https://github.com/Vali-98/ChatterUI) on Android.

The reason why `minimax_m2_append_think` was introduced was Interleaved Thinking and having the model build upon it's previous thinking (usually frontends discard the thinking trace)

> [!TIP]  
> 💡With the recommended parameters the model tends to get stuck in repetition loops.\
> It seems like repetition_penalty: 1.10, frequency_penalty: 0.40 avoids that

```bash
# Model configuration (Mandatory)
MODEL="mratsim/MiniMax-M2.1-FP8-INT4-AWQ"
MODELNAME="MiniMax-M2.1"
GPU_UTIL=0.93
SAMPLER_OVERRIDE='{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'

# Prevent memory fragmentation
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512

# Prevent vLLM from using 100% CPU when idle (Very Recommended)
export VLLM_SLEEP_WHEN_IDLE=1

vllm serve "${MODEL}" \
  --served-model-name "${MODELNAME}" \
  --trust-remote-code \
  --gpu-memory-utilization ${GPU_UTIL} \
  --tp 2 \
  --override-generation-config "${SAMPLER_OVERRIDE}" \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2
  # --reasoning-parser minimax_m2_append_think
```

## Performance

On dual RTX Pro 6000, I can reach over 5500 prefill/prompt/context processing and over 100 tok/s token generation for a single request.

![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/YbP1qw_YhcaM0aywJHSjG.png)

With PagedAttention in action you can reach over 25000 tok/s in prompt processing speed.

![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/haCbdHZWScsgGiGCj768i.png)

When batching, with default config, you can reach over 6000 even 8000 tok/s and 1200 tok/s generation speed.\
Tune prefill vs decode prioritization with `--max_num_batched_tokens` see [Performance & Tuning | vLLM](https://docs.vllm.ai/en/v0.4.2/models/performance.html)

![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/ma7oVnEGbj15Rk4EG0h5B.png)

In a steady state with interleaved prefill and decode requests that interrupt each other, you can get ~2400 tok/s context processing and 800 tok/s generation

![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/Gbc2pz5Tpm8gF-MV_UPDe.png)

Note: vLLM supports prefill-decode disaggregation for high throughput serving if you have double the minimum hardware:
- https://pytorch.org/blog/disaggregated-inference-at-scale-with-pytorch-vllm/
- https://github.com/vllm-project/production-stack
  - Prefill/decode disaggregation
  - Multi-Tier KV-cache via [LMCache](https://github.com/LMCache/LMCache) (GPU > CPU > Local Disk)
  - Cache aware router
  - Multi-model dispatch via single interface

## 🔬 Quantization method

Quantization was quite complex for this model and was done in 3 steps:
1. Original weights are in FP8, they were dequantized to FP16 due to llm-compressor not being able to process FP8.
2. llm-compressor was used to quantize the MLP experts projection using AWQ, with [PR #2171](https://github.com/vllm-project/llm-compressor/pull/2171) to ensure they were all activated.
3. Stitching the FrankenQuant: I combined the original weights, including the 2D-block FP8, with the experts-only AWQ weights.

The llmcompressor library was used with the following recipe:

```yaml
default_stage:
  default_modifiers:
    AWQModifier:
      config_groups:
        mlp_experts_projections:
          # Include only MLP expert weights for 4-bit quantization
          targets: ["re:.*block_sparse_moe\\.experts\\.\\d+\\.(w1|w2|w3)$"]
          weights:
            num_bits: 4
            type: int
            symmetric: true
            group_size: 32
            strategy: group
            dynamic: false
            # actorder: group
            observer: memoryless_minmax

      mappings:
        - smooth_layer: re:.*post_attention_layernorm$
          balance_layers: ["re:.*w1$", "re:.*w3$"]
        - smooth_layer: re:.*w3$
          balance_layers: ["re:.*w2$"]
      duo_scaling: true
```

The calibration set had 590 examples, 8192 sequence length, 60 programming languages, 12 spoken languages and is detailed at [calibrate_software_engineer.yaml](./calibrate_software_engineer.yaml)

## Quantization theory and heuristics for manual tuning

<details>
<summary>In-depth overview of quantization theory and heuristics for manual tuning</summary>

### Layers to quantize

Quantization should be focused on Linear layers (also called Dense or Fully-Connected layers i.e. MatMul+Bias)
In particular quantizing LayerNorm/RMSnorm layer is strongly discouraged, see [1]
> LayerNorm in Quantization. Kovaleva et al. (2021); Wei et al. (2022) find that outliers in the
> LayerNorm parameters of BERT (Devlin et al., 2019) cause difficulties in model compression.
> Given the importance of LayerNorm, all the quantization methods we discuss above leave LayerNorm unquantized.

This is also reported in Intel and Nvidia repo:
- https://github.com/intel/neural-compressor/issues/1963#issuecomment-2274873441
- https://github.com/NVIDIA/TensorRT/issues/4084#issuecomment-2294513950

### Tensors to up-quantize

If there is enough bits, down projections should be prioritized.

According to [4]
> Fig. 3: Maximum absolute value over layers for a LLaMA3-8B.
> Each color represent a different projection and we clearly see that down_proj has the biggest
> spikes in input and output. We also observe that RMSNorm propagate spikes through the entire model

According to [5]
> Figure 5(a) illustrates the extremal ratio across layers and modules in LLaMA2-7B, highlighting
> that weight outliers are concentrated in the down-projection matrices Wdown
> ℓ of the second layer and
> the last two layers. Figures 5(b) and 5(c) provide detailed visualizations of these outliers in the last
> two layers.

### Mixture-of-Experts quantization (MoE)

Mixture-of-Experts require specific quantization techniques.

#### Mixed-precision quantization

Some layers have a higher impact on LLM performance.
According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers.
According to [3] on 2-bit quantization:
- quantizing expert FFN layers do not seriously impact model quality
- quantizing cross-attention has some impact
- quantizing self-attention has a large impact
- quantizing dense FFN has a very significant impact

Hence to preserve model quality we should choose not to quantize dense FFN layers and self-attention layers.

We notice that:
- official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16:
  - https://huggingface.co/openai/gpt-oss-120b/blob/main/model.safetensors.index.json
- NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16:
  - https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4/blob/main/model.safetensors.index.json

#### Layers with high-impact

According to [2], giving more bits to the first `k` blocks have a significantly higher impact on model quality than for the same last `k` blocks.

#### Expert quantization

When quantizing MoE, quantizing activations is tricky as only a subset of experts are activated per request. You have to make sure all experts are calibrated.

<details>
<summary>Visual showcase of why ensuring quantization of all MoE experts is important</summary>

- Source: https://avtc.github.io/aquarium-side-by-side/
- Context: https://github.com/ModelCloud/GPTQModel/pull/2235

![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/BDc3-0m3_WLl3ZmbBMhmd.png)

</details>

## References

1. Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)\
  Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia\
  https://arxiv.org/pdf/2506.12044

2. Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)\
  Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen\
  https://arxiv.org/pdf/2406.08155v1

3. Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)\
  Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla\
  https://arxiv.org/pdf/2310.02410

4. Precision Where It Matters: A Novel Spike\
   Aware Mixed-Precision Quantization Strategy for\
   LLaMA-based Language Models (2025)\
   Lucas Maisonnave, Cyril Moineau, Olivier Bichler, and Fabrice Rastello\
   https://arxiv.org/pdf/2504.21553

5. Systematic Outliers in Large Language Models (2025)\
   Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang\
   https://arxiv.org/pdf/2502.06415v2

</details>