File size: 4,785 Bytes
6684358 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | ---
license: other
license_name: minimax
license_link: https://huggingface.co/MiniMaxAI/MiniMax-M3/blob/main/LICENSE
base_model: MiniMaxAI/MiniMax-M3
base_model_relation: quantized
tags:
- nvfp4
- fp4
- modelopt
- minimax
- moe
- quantized
---
# MiniMax-M3-NVFP4
**The first NVFP4 quantization of [MiniMaxAI/MiniMax-M3](https://huggingface.co/MiniMaxAI/MiniMax-M3)** — 428B-total / 23B-active MoE with MiniMax Sparse Attention (MSA), quantized 2026-06-12.
- **~256 GB** on disk (vs 854 GB BF16, 444 GB MXFP8) → serves on **2× B300/GB300** with huge KV headroom, or fits tighter Blackwell pairs
- Routed + shared **experts in NVFP4** (group 16, two-level scaling); attention, MSA indexer, router, embeddings, lm_head and the vision tower kept in **original BF16** (no quantization round-trip — copied verbatim from the source checkpoint)
- Same recipe family as `nvidia/MiniMax-M2.7-NVFP4` (experts-only NVFP4), produced with **TensorRT Model Optimizer 0.44.0**
## Quantization recipe
| | |
|---|---|
| Method | PTQ, NVFP4 (FP4 weights+activations, FP8 per-16 block scales + FP32 global) |
| Tool | nvidia-modelopt 0.44.0, transformers main (native `minimax_m3_vl`) |
| Calibration | 512 samples × 2048 tokens: cnn_dailymail + nvidia/OpenCodeReasoning + nvidia/OpenMathReasoning |
| Quantized | routed experts (w1/w2/w3) + shared experts, all 57 MoE layers |
| Excluded | attention (incl. MSA indexer), router/gate, embeddings, lm_head, vision tower, projectors |
| KV cache | not quantized (v1; MSA is young in engines — don't stack experiments) |
Calibration deliberately uses longer sequences and reasoning traces than the modelopt defaults: NVFP4 activation scales are per-block *dynamic* at runtime, so calibration only pins the per-tensor global scales — reasoning-heavy data exposes the activation extremes a thinking model actually produces.
## Evals
Measured via lm-evaluation-harness against the official **MiniMax-M3-MXFP8** endpoint as baseline, same engine (SGLang), same sampling (temperature 1.0, top-p 0.95, model-card settings), thinking enabled. Generation caps: 16384 tokens (GPQA, MMLU), 8192 (GSM8K); at these caps truncation is negligible (<0.25% of samples).
| Task | MXFP8 (official) | **NVFP4 (this repo)** |
|---|---|---|
| GSM8K (5-shot, strict) | 93.93 | 92.57 |
| GPQA diamond (CoT zero-shot) | 76.26 | 69.70 |
| MMLU (flan CoT few-shot, 25% sample) | 77.36 | 74.16 |
## Serving (SGLang)
Requires the MiniMax-M3 bring-up image and **one patch**: SGLang's NVFP4 cutlass MoE path does not yet forward M3's clamped-swiglu activation parameters (`swiglu_alpha=1.702`, `swiglu_limit=7.0`, `+1` beta — same family as GPT-OSS). Without the patch the model loads but generates garbage. The patched files ship in this repo under `sglang_patch/` (upstream PR pending; the second file makes the unsupported flashinfer-trtllm MoE backend fail fast instead of producing garbage).
```bash
docker run --runtime=nvidia --gpus '"device=0,1"' --ipc=host --shm-size 32g \
-v $MODEL_DIR:/model \
-v $MODEL_DIR/sglang_patch/modelopt_quant.py:/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py:ro \
-v $MODEL_DIR/sglang_patch/flashinfer_trtllm.py:/sgl-workspace/sglang/python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py:ro \
-p 30014:30014 lmsysorg/sglang:dev-cu13-minimax-m3 \
sglang serve --model-path /model --tp 2 \
--quantization modelopt_fp4 \
--attention-backend fa4 --page-size 128 \
--moe-runner-backend flashinfer_cutlass \
--context-length 131072 --mem-fraction-static 0.90 \
--reasoning-parser auto --tool-call-parser auto \
--trust-remote-code --host 0.0.0.0 --port 30014
```
`--page-size 128` is mandatory (MSA indexing). Sampling: temperature 1.0, top-p 0.95, top-k 40.
`--moe-runner-backend flashinfer_cutlass` is the only supported MoE backend: the flashinfer-trtllm FP4 kernels cannot run M3's parameterized clamped swiglu (they ignore `gemm1_alpha`/`gemm1_beta`; the patched code fails fast with a clear error instead of generating garbage).
## Known limitations
- Engine support is bleeding-edge: M3 itself has not shipped in stable SGLang/vLLM; this NVFP4 additionally needs the swiglu-parameter fix (upstream PR pending).
- Vision tower is BF16 and untested under this serving path beyond loading; the eval table is text-only.
- KV cache quantization intentionally omitted in v1.
## Provenance
Quantized from `MiniMaxAI/MiniMax-M3` (revision 3a41b31) on 8× NVIDIA B300. During bring-up, two upstream gaps were found and fixed: modelopt's fused-experts detector did not recognize M3's `_apply_gate` expert module (experts silently skipped — PR pending), and SGLang's NVFP4 cutlass MoE path dropped custom swiglu parameters (PR pending).
|