MiniMax-M3-NVFP4
The first NVFP4 quantization of MiniMaxAI/MiniMax-M3 — 428B-total / 23B-active MoE with MiniMax Sparse Attention (MSA), quantized 2026-06-12.
- ~256 GB on disk (vs 854 GB BF16, 444 GB MXFP8) → serves on 2× B300/GB300 with huge KV headroom, or fits tighter Blackwell pairs
- Routed + shared experts in NVFP4 (group 16, two-level scaling); attention, MSA indexer, router, embeddings, lm_head and the vision tower kept in original BF16 (no quantization round-trip — copied verbatim from the source checkpoint)
- Same recipe family as
nvidia/MiniMax-M2.7-NVFP4(experts-only NVFP4), produced with TensorRT Model Optimizer 0.44.0
Quantization recipe
| Method | PTQ, NVFP4 (FP4 weights+activations, FP8 per-16 block scales + FP32 global) |
| Tool | nvidia-modelopt 0.44.0, transformers main (native minimax_m3_vl) |
| Calibration | 512 samples × 2048 tokens: cnn_dailymail + nvidia/OpenCodeReasoning + nvidia/OpenMathReasoning |
| Quantized | routed experts (w1/w2/w3) + shared experts, all 57 MoE layers |
| Excluded | attention (incl. MSA indexer), router/gate, embeddings, lm_head, vision tower, projectors |
| KV cache | not quantized (v1; MSA is young in engines — don't stack experiments) |
Calibration deliberately uses longer sequences and reasoning traces than the modelopt defaults: NVFP4 activation scales are per-block dynamic at runtime, so calibration only pins the per-tensor global scales — reasoning-heavy data exposes the activation extremes a thinking model actually produces.
Evals
Measured via lm-evaluation-harness against the official MiniMax-M3-MXFP8 endpoint as baseline, same engine (SGLang), same sampling (temperature 1.0, top-p 0.95, model-card settings), thinking enabled. Generation caps: 16384 tokens (GPQA, MMLU), 8192 (GSM8K); at these caps truncation is negligible (<0.25% of samples).
| Task | MXFP8 (official) | NVFP4 (this repo) |
|---|---|---|
| GSM8K (5-shot, strict) | 93.93 | 92.57 |
| GPQA diamond (CoT zero-shot) | 76.26 | 69.70 |
| MMLU (flan CoT few-shot, 25% sample) | 77.36 | 74.16 |
Serving (SGLang)
Requires the MiniMax-M3 bring-up image and one patch: SGLang's NVFP4 cutlass MoE path does not yet forward M3's clamped-swiglu activation parameters (swiglu_alpha=1.702, swiglu_limit=7.0, +1 beta — same family as GPT-OSS). Without the patch the model loads but generates garbage. The patched files ship in this repo under sglang_patch/ (upstream PR pending; the second file makes the unsupported flashinfer-trtllm MoE backend fail fast instead of producing garbage).
docker run --runtime=nvidia --gpus '"device=0,1"' --ipc=host --shm-size 32g \
-v $MODEL_DIR:/model \
-v $MODEL_DIR/sglang_patch/modelopt_quant.py:/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py:ro \
-v $MODEL_DIR/sglang_patch/flashinfer_trtllm.py:/sgl-workspace/sglang/python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py:ro \
-p 30014:30014 lmsysorg/sglang:dev-cu13-minimax-m3 \
sglang serve --model-path /model --tp 2 \
--quantization modelopt_fp4 \
--attention-backend fa4 --page-size 128 \
--moe-runner-backend flashinfer_cutlass \
--context-length 131072 --mem-fraction-static 0.90 \
--reasoning-parser auto --tool-call-parser auto \
--trust-remote-code --host 0.0.0.0 --port 30014
--page-size 128 is mandatory (MSA indexing). Sampling: temperature 1.0, top-p 0.95, top-k 40.
--moe-runner-backend flashinfer_cutlass is the only supported MoE backend: the flashinfer-trtllm FP4 kernels cannot run M3's parameterized clamped swiglu (they ignore gemm1_alpha/gemm1_beta; the patched code fails fast with a clear error instead of generating garbage).
Known limitations
- Engine support is bleeding-edge: M3 itself has not shipped in stable SGLang/vLLM; this NVFP4 additionally needs the swiglu-parameter fix (upstream PR pending).
- Vision tower is BF16 and untested under this serving path beyond loading; the eval table is text-only.
- KV cache quantization intentionally omitted in v1.
Provenance
Quantized from MiniMaxAI/MiniMax-M3 (revision 3a41b31) on 8× NVIDIA B300. During bring-up, two upstream gaps were found and fixed: modelopt's fused-experts detector did not recognize M3's _apply_gate expert module (experts silently skipped — PR pending), and SGLang's NVFP4 cutlass MoE path dropped custom swiglu parameters (PR pending).
- Downloads last month
- 704
Model tree for Mapika/MiniMax-M3-NVFP4
Base model
MiniMaxAI/MiniMax-M3