README.md · mratsim/MiniMax-M2.1-FP8-INT4-AWQ at main

MiniMax-M2.1-FP8-INT4-AWQ / README.md

mratsim

mention VRAM cost of BF16

5b18eb5 verified 11 days ago

preview code

raw

history blame contribute delete

15.4 kB

	---
	pipeline_tag: text-generation
	license: other
	license_name: modified-mit
	license_link: https://github.com/MiniMax-AI/MiniMax-M2.1/blob/main/LICENSE
	library_name: llm-compressor
	tags:
	- fp8
	- awq
	- conversational
	- vllm
	- code
	- devops
	- software engineering
	- engineer
	- developer
	- architect
	- stem
	- agent
	datasets:
	- HuggingFaceH4/ultrachat_200k
	- databricks/databricks-dolly-15k
	- neuralmagic/calibration
	- HuggingFaceH4/no_robots
	- nvidia/HelpSteer
	- garage-bAInd/Open-Platypus
	- PJMixers/grimulkan_physical-reasoning-ShareGPT
	- PJMixers/grimulkan_theory-of-mind-ShareGPT
	- HuggingFaceH4/Multilingual-Thinking
	- ServiceNow-AI/M2Lingual
	- interstellarninja/hermes_reasoning_tool_use
	- deepmind/code_contests
	- dh02391735/stackoverflow-kubernetes-questions
	- diversoailab/humaneval-rust
	- ammarnasr/the-stack-rust-clean
	- CSJianYang/CodeArena
	- nvidia/OpenCodeInstruct
	- nvidia/Llama-Nemotron-Post-Training-Dataset
	- nvidia/Nemotron-Competitive-Programming-v1
	- rombodawg/code_bagel_hermes-2.5
	- MathArena/project_euler
	- nvidia/Nemotron-Math-Proofs-v1
	- nvidia/OpenMathInstruct-2
	- nvidia/OpenScienceReasoning-2
	- MegaScience/MegaScience
	- OpenMed/Medical-Reasoning-SFT-GPT-OSS-120B
	- ccdv/pubmed-summarization
	- gbharti/finance-alpaca
	- vladlen32230/summarization-yahoo-stock-finance-article-text
	- fka/awesome-chatgpt-prompts
	- theoldmandthesea/17k_business_book
	- ruggsea/stanford-encyclopedia-of-philosophy_instruct
	- mlfoundations-dev/stackexchange_philosophy
	- FreedomIntelligence/SocraticChat
	- Gryphe/Opus-WritingPrompts
	- anthracite-org/nopm_claude_writing_fixed
	- zerofata/Roleplay-Anime-Characters
	- zerofata/Instruct-Anime
	- zerofata/Instruct-Anime-CreativeWriting
	- sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo
	- PocketDoc/Dans-Prosemaxx-Adventure
	- anthracite-org/stheno-filtered-v1.1
	- KaraKaraWitch/TvTroper-2025
	- AquaV/US-Army-Survival-Sharegpt
	- AquaV/Interrogation-Sharegpt
	- AquaV/Multi-Environment-Operations-Sharegpt
	- AquaV/Resistance-Sharegpt
	- PocketDoc/Dans-Kinomaxx-VanillaBackrooms
	base_model:
	- MiniMaxAI/MiniMax-M2.1
	---

	# MiniMax M2.1 (Mixed-Precision FP8 + INT4 AWQ FrankenQuant)

	This strives to be the highest quality quant that can run on 192GiB VRAM

	> [!TIP]
	> 💡 A non-FP8 version is available at [mratsim/MiniMax-M2.1-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ) \
	> That version is compatible with 8x RTX 3090s and with SGLang (which doesn't support mixed quantization yet) for an extra 3GiB in VRAM. \
	> This FP8+INT4 AWQ was build by merging the original FP8 self-attention weights and [mratsim/MiniMax-M2.1-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ) experts.

	It features:
	- That model has ensured that all experts are calibrated, not doing so is extremely detrimental, PR: https://github.com/vllm-project/llm-compressor/pull/2171
	<details>
	<summary>Visual showcase of why ensuring quantization of all MoE experts is important</summary>

	- Source: https://avtc.github.io/aquarium-side-by-side/
	- Context: https://github.com/ModelCloud/GPTQModel/pull/2235

	![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/BDc3-0m3_WLl3ZmbBMhmd.png)

	</details>
	- Mixed precision with:
	- self-attention weights copied directly from the official version (default FP8 with 2D-blocks)
	- experts weights quantized using AWQ W4A16G32 scheme (4-bit weights, 16-bit activations, scaling factor per group of 32 weights)
	- High-quality large and diverse dataset with programming and devops focus
	as well as domain-specific knowledge (math, sciences, medical, finance, business, humanities, philosophy, creative writing), general knowledge, pop culture and behavioral situations because we never code in a vacuum. And we want to make sure all experts are calibrated to the full range of their activations.
	- Calibration explicitly tests multilingual capabilities:
	- Asia: Chinese, Hindi, Korean, Japanese
	- Europe: French, German, Portuguese, Russian, Spanish
	- Middle-East: Arabic, Hebrew, Turkish
	- Calibration explicitly tests 60 programming languages and not just Python:
	- Imperative programming: C, C++, Go, Zig, ...
	- Functional programming: Haskell, F#, OCaml, Erlang, Lisp, Clojure ...
	- Web-focused: HTML/CSS, Typescript, PHP, ...
	- Mixed paradigm: D, Kotlin, Nim, Rust, Swift, ...
	- Theorem provers: Coq, Lean
	- Low-level: ARM64 assembly, x86-64 assembly, LLVM IR
	- GPU Programming: Cuda, Vulkan, Apple Metal
	- Game Programming: GDScript, GLSL
	- Domain-specific: MATLAB, Julia, Solidity, R
	- Calibration tries to ensure coverage for a wide variety of experience (from explaining concepts to your grandmother to debugging Kubernetes logs)
	- Built by a dev, for devs (and it looks very good for STEM as well)

	It uses my new declarative quantization framework https://github.com/mratsim/quantizers which facilitates highly-tuned calibration sets: [calibrate_software_engineer.yaml](./calibrate_software_engineer.yaml)

	<details>
	<summary>This has taken several days and contribution and bug reports to the ecosystem, I hope you find it useful.</summary>

	- https://github.com/vllm-project/llm-compressor/pull/2171
	- https://github.com/vllm-project/llm-compressor/issues/2172
	- https://github.com/vllm-project/vllm/issues/31623
	- https://github.com/sgl-project/sglang/issues/16276
	- https://github.com/sgl-project/sglang/issues/16295

	</details>

	## 📥 Usage & Running Instructions

	The model was tested with vLLM + 2x RTX Pro 6000, here is a script suitable for such configuration with the maximum 196,608 context length. This uses 92.5GiB of VRAM with the flashinfer backend.

	> [!WARNING]
	> ⚠️ Due to rope_parameters change, at the moment this model is incompatible with transformers V5.\
	This makes it incompatible with GLM-4.6V which requires transformers V5. Use different Docker images.

	> [!WARNING]
	> ⚠️ SGLang does not support this model due to missing mixed precision support. Feature request raised at https://github.com/sgl-project/sglang/issues/16276.\
	> Please use [mratsim/MiniMax-M2.1-BF16-INT4-AWQ](https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ) in the meantime.

	### Running script

	`--trust-remote-code` is necessary until the transformers team merges github.com/huggingface/transformers/pull/42028

	You have 2 reasoning parsers;
	- `minimax_m2`, puts the reasoning content in a special field like DeepSeek models that is usually rendered in a specific manner in frontends.
	- `minimax_m2_append_think`, puts the reasoning into `<think>reasoning_content</think>` and that is sent as normal text. Few frontends properly render that, I'm aware of [Cherry Studio](https://github.com/CherryHQ/cherry-studio) on Desktop and [ChatterUI](https://github.com/Vali-98/ChatterUI) on Android.

	The reason why `minimax_m2_append_think` was introduced was Interleaved Thinking and having the model build upon it's previous thinking (usually frontends discard the thinking trace)

	> [!TIP]
	> 💡With the recommended parameters the model tends to get stuck in repetition loops.\
	> It seems like repetition_penalty: 1.10, frequency_penalty: 0.40 avoids that

	```bash
	# Model configuration (Mandatory)
	MODEL="mratsim/MiniMax-M2.1-FP8-INT4-AWQ"
	MODELNAME="MiniMax-M2.1"
	GPU_UTIL=0.93
	SAMPLER_OVERRIDE='{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'

	# Prevent memory fragmentation
	export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512

	# Prevent vLLM from using 100% CPU when idle (Very Recommended)
	export VLLM_SLEEP_WHEN_IDLE=1

	vllm serve "${MODEL}" \
	--served-model-name "${MODELNAME}" \
	--trust-remote-code \
	--gpu-memory-utilization ${GPU_UTIL} \
	--tp 2 \
	--override-generation-config "${SAMPLER_OVERRIDE}" \
	--enable-auto-tool-choice \
	--tool-call-parser minimax_m2 \
	--reasoning-parser minimax_m2
	# --reasoning-parser minimax_m2_append_think
	```

	## Performance

	On dual RTX Pro 6000, I can reach over 5500 prefill/prompt/context processing and over 100 tok/s token generation for a single request.

	![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/YbP1qw_YhcaM0aywJHSjG.png)

	With PagedAttention in action you can reach over 25000 tok/s in prompt processing speed.

	![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/haCbdHZWScsgGiGCj768i.png)

	When batching, with default config, you can reach over 6000 even 8000 tok/s and 1200 tok/s generation speed.\
	Tune prefill vs decode prioritization with `--max_num_batched_tokens` see [Performance & Tuning \| vLLM](https://docs.vllm.ai/en/v0.4.2/models/performance.html)

	![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/ma7oVnEGbj15Rk4EG0h5B.png)

	In a steady state with interleaved prefill and decode requests that interrupt each other, you can get ~2400 tok/s context processing and 800 tok/s generation

	![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/Gbc2pz5Tpm8gF-MV_UPDe.png)

	Note: vLLM supports prefill-decode disaggregation for high throughput serving if you have double the minimum hardware:
	- https://pytorch.org/blog/disaggregated-inference-at-scale-with-pytorch-vllm/
	- https://github.com/vllm-project/production-stack
	- Prefill/decode disaggregation
	- Multi-Tier KV-cache via [LMCache](https://github.com/LMCache/LMCache) (GPU > CPU > Local Disk)
	- Cache aware router
	- Multi-model dispatch via single interface

	## 🔬 Quantization method

	Quantization was quite complex for this model and was done in 3 steps:
	1. Original weights are in FP8, they were dequantized to FP16 due to llm-compressor not being able to process FP8.
	2. llm-compressor was used to quantize the MLP experts projection using AWQ, with [PR #2171](https://github.com/vllm-project/llm-compressor/pull/2171) to ensure they were all activated.
	3. Stitching the FrankenQuant: I combined the original weights, including the 2D-block FP8, with the experts-only AWQ weights.

	The llmcompressor library was used with the following recipe:

	```yaml
	default_stage:
	default_modifiers:
	AWQModifier:
	config_groups:
	mlp_experts_projections:
	# Include only MLP expert weights for 4-bit quantization
	targets: ["re:.*block_sparse_moe\\.experts\\.\\d+\\.(w1\|w2\|w3)$"]
	weights:
	num_bits: 4
	type: int
	symmetric: true
	group_size: 32
	strategy: group
	dynamic: false
	# actorder: group
	observer: memoryless_minmax

	mappings:
	- smooth_layer: re:.*post_attention_layernorm$
	balance_layers: ["re:.w1$", "re:.w3$"]
	- smooth_layer: re:.*w3$
	balance_layers: ["re:.*w2$"]
	duo_scaling: true
	```

	The calibration set had 590 examples, 8192 sequence length, 60 programming languages, 12 spoken languages and is detailed at [calibrate_software_engineer.yaml](./calibrate_software_engineer.yaml)

	## Quantization theory and heuristics for manual tuning

	<details>
	<summary>In-depth overview of quantization theory and heuristics for manual tuning</summary>

	### Layers to quantize

	Quantization should be focused on Linear layers (also called Dense or Fully-Connected layers i.e. MatMul+Bias)
	In particular quantizing LayerNorm/RMSnorm layer is strongly discouraged, see [1]
	> LayerNorm in Quantization. Kovaleva et al. (2021); Wei et al. (2022) find that outliers in the
	> LayerNorm parameters of BERT (Devlin et al., 2019) cause difficulties in model compression.
	> Given the importance of LayerNorm, all the quantization methods we discuss above leave LayerNorm unquantized.

	This is also reported in Intel and Nvidia repo:
	- https://github.com/intel/neural-compressor/issues/1963#issuecomment-2274873441
	- https://github.com/NVIDIA/TensorRT/issues/4084#issuecomment-2294513950

	### Tensors to up-quantize

	If there is enough bits, down projections should be prioritized.

	According to [4]
	> Fig. 3: Maximum absolute value over layers for a LLaMA3-8B.
	> Each color represent a different projection and we clearly see that down_proj has the biggest
	> spikes in input and output. We also observe that RMSNorm propagate spikes through the entire model

	According to [5]
	> Figure 5(a) illustrates the extremal ratio across layers and modules in LLaMA2-7B, highlighting
	> that weight outliers are concentrated in the down-projection matrices Wdown
	> ℓ of the second layer and
	> the last two layers. Figures 5(b) and 5(c) provide detailed visualizations of these outliers in the last
	> two layers.

	### Mixture-of-Experts quantization (MoE)

	Mixture-of-Experts require specific quantization techniques.

	#### Mixed-precision quantization

	Some layers have a higher impact on LLM performance.
	According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers.
	According to [3] on 2-bit quantization:
	- quantizing expert FFN layers do not seriously impact model quality
	- quantizing cross-attention has some impact
	- quantizing self-attention has a large impact
	- quantizing dense FFN has a very significant impact

	Hence to preserve model quality we should choose not to quantize dense FFN layers and self-attention layers.

	We notice that:
	- official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16:
	- https://huggingface.co/openai/gpt-oss-120b/blob/main/model.safetensors.index.json
	- NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16:
	- https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4/blob/main/model.safetensors.index.json

	#### Layers with high-impact

	According to [2], giving more bits to the first `k` blocks have a significantly higher impact on model quality than for the same last `k` blocks.

	#### Expert quantization

	When quantizing MoE, quantizing activations is tricky as only a subset of experts are activated per request. You have to make sure all experts are calibrated.

	<details>
	<summary>Visual showcase of why ensuring quantization of all MoE experts is important</summary>

	- Source: https://avtc.github.io/aquarium-side-by-side/
	- Context: https://github.com/ModelCloud/GPTQModel/pull/2235

	![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/BDc3-0m3_WLl3ZmbBMhmd.png)

	</details>

	## References

	1. Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)\
	Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia\
	https://arxiv.org/pdf/2506.12044

	2. Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)\
	Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen\
	https://arxiv.org/pdf/2406.08155v1

	3. Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)\
	Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla\
	https://arxiv.org/pdf/2310.02410

	4. Precision Where It Matters: A Novel Spike\
	Aware Mixed-Precision Quantization Strategy for\
	LLaMA-based Language Models (2025)\
	Lucas Maisonnave, Cyril Moineau, Olivier Bichler, and Fabrice Rastello\
	https://arxiv.org/pdf/2504.21553

	5. Systematic Outliers in Large Language Models (2025)\
	Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang\
	https://arxiv.org/pdf/2502.06415v2

	</details>