MiniMax-M3 — AWQ W4A16 (int4), A100-ready
4-bit (AWQ W4A16, routed-experts-only) quantization of MiniMax-M3, packaged to serve on NVIDIA A100 (SM80) under vLLM — text + vision, 1M-context, at TP-8 (or 2× TP-4).
The base model is a large bf16 checkpoint built for 8× Hopper (H200/H20). This repo is the A100 path: the int4 weights plus three small vLLM model-code patches that make M3's int4 MoE serve correctly on the Marlin W4A16 kernel — without changing the math.
- bf16 → int4: ≈796 GB → 225 GB
- ~428B params · ~23B activated · MiniMax Sparse Attention (MSA), 1M context · text + vision · tool-calling + reasoning
Use this model (vLLM)
M3's int4 routed-expert MoE serves on the Marlin W4A16 path, but stock vLLM doesn't thread M3's clamped-SwiGLU activation through that kernel — so it loads but emits incoherent output. Bind-mount the three patch files over the image copies (details in vllm-patches/PATCHES.md):
V=/usr/local/lib/python3.12/dist-packages/vllm
docker run --rm --gpus all --ipc=host -p 8000:8000 \
-v /path/to/MiniMax-M3-AWQ-int4:/model:ro \
-v /path/to/vllm-patches/A_wna16_marlin.py:$V/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_wna16_marlin.py:ro \
-v /path/to/vllm-patches/B_int_wna16.py:$V/model_executor/layers/fused_moe/oracle/int_wna16.py:ro \
-v /path/to/vllm-patches/C_config.py:$V/model_executor/layers/fused_moe/config.py:ro \
vllm/vllm-openai:minimax-m3 \
--model /model \
--served-model-name m3 \
--tensor-parallel-size 8 \
--block-size 128 \
--enable-expert-parallel \
--enable-prefix-caching \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice --tool-call-parser minimax_m3 \
--trust-remote-code \
--max-model-len 262144 \
--gpu-memory-utilization 0.95
--block-size 128is mandatory — MSA's index cache requires page-128 alignment; other block sizes break M3.- TP-4 works too — set
--tensor-parallel-size 4(see the context note below). - Sampling:
temperature 1.0, top_p 0.95, top_k 40. - Reasoning: M3 supports thinking / non-thinking modes (via the
minimax_m3reasoning parser). --max-model-lencan be raised toward the native 1,048,576 as VRAM allows.
Files
| file | what |
|---|---|
model-000{01..16}-of-00016.safetensors |
int4 weights (W4A16) |
config.json, recipe.yaml |
quant config + the quantization recipe |
configuration_minimax_m3_vl.py, processing_minimax.py, … |
model code (--trust-remote-code) |
tokenizer*, chat_template.jinja, preprocessor_config.json |
tokenizer + chat / vision preprocessing |
vllm-patches/A_wna16_marlin.py, vllm-patches/B_int_wna16.py, vllm-patches/C_config.py |
the three vLLM serving patches — mount over the image copies |
vllm-patches/launch_m3_awq.sh, vllm-patches/PATCHES.md |
ready-to-run launch script + patch writeup |
Method
AWQ W4A16, routed-experts-only, group-size 128, with GPTQ-style error compensation and an MSE observer. Only the MoE routed-expert projections are quantized to 4-bit. Everything quality-sensitive stays high-precision: attention, the router/gate, shared/dense layers, embeddings & lm_head, and the vision tower are left untouched.
This is why MoE tolerates 4-bit far better than dense models — the sensitive machinery is intact and only the redundant expert bulk is compressed. M3's RMSNorm uses the Gemma (1+w) convention, so the AWQ smoothing fold is applied accordingly (getting this wrong is the classic way to make M3 incoherent). The full recipe is in recipe.yaml.
A100 serving patches
The three files make M3's int4 routed-expert MoE serve exactly on stock vLLM's Marlin W4A16 path:
- Clamped SwiGLU in the W4A16 Marlin MoE — M3's experts use an OpenAI-style clamped
(up+1)·gluactivation; the stock fused-MoE int4 kernel doesn't carry the clamp limit, so served experts compute the wrong activation (coherent-looking greedy output, corrupt distribution). The patches plumb the clamp through the Marlin path. - int4 (W4A16) dequant wiring for the routed-expert packing, plus the matching fused-MoE config.
Bind-mounting is the zero-rebuild path; baking the three files into a derived image is the clean end-state. Full writeup in vllm-patches/PATCHES.md.
Serving notes (A100) — learned the hard way
- Context vs. replicas is the real tradeoff, not raw VRAM. MSA keeps a sparse-attention index workspace that grows with
--max-model-len(roughly independent of batch size). On 8× A100 you can run one TP-8 replica at long context (128K–512K+), or two TP-4 replicas but each caps around ~64K before the index workspace + weights crowd out the KV cache. Choose by whether you want context or concurrency. - fp8 KV cache: use
fp8_e5m2, notfp8(e4m3), on A100. The default--kv-cache-dtype fp8(e4m3) routes to a FlashInfer page-128 kernel that needs trtllm-gen (Blackwell / SM100+) and fails at init on Ampere — and A100's Triton can't compile e4m3 regardless. The e5m2 path does work, with a small set of additional serving patches (patches/), giving ~1.5× KV capacity with CUDA graphs on, at a modest quality cost (≈ +0.017 held-out KL-divergence over bf16 KV):--kv-cache-dtype fp8_e5m2 --disable-custom-all-reduce. Plain bf16/auto stays the zero-patch default; reach for e5m2 when you want more context or concurrency per GPU. - Block-size must be 128 (see above).
Quality
Held-out wikitext mean KL-divergence vs the bf16 reference ≈ 0.16 (full-vocab, calibration-decoupled) — the solid end of the 4-bit MoE range, close to community IQ4-class int4. The high-precision attention/router/shared paths plus group-size-128 experts keep it faithful; a finer group-size-32 variant measured ≈0.15 (within noise), so group-size 128 was kept for the better size/quality tradeoff.
License & credits
- Original model: MiniMax-M3 © MiniMax, MiniMax Community License.
- int4 quantization + A100 vLLM patches by @spectator2026.
- Downloads last month
- 241
Model tree for spectator2026/MiniMax-M3-AWQ-int4
Base model
MiniMaxAI/MiniMax-M3