--- license: other license_name: minimax license_link: https://huggingface.co/MiniMaxAI/MiniMax-M3/blob/main/LICENSE base_model: MiniMaxAI/MiniMax-M3 base_model_relation: quantized tags: - nvfp4 - fp4 - modelopt - minimax - moe - quantized --- # MiniMax-M3-NVFP4 **The first NVFP4 quantization of [MiniMaxAI/MiniMax-M3](https://huggingface.co/MiniMaxAI/MiniMax-M3)** — 428B-total / 23B-active MoE with MiniMax Sparse Attention (MSA), quantized 2026-06-12. - **~256 GB** on disk (vs 854 GB BF16, 444 GB MXFP8) → serves on **2× B300/GB300** with huge KV headroom, or fits tighter Blackwell pairs - Routed + shared **experts in NVFP4** (group 16, two-level scaling); attention, MSA indexer, router, embeddings, lm_head and the vision tower kept in **original BF16** (no quantization round-trip — copied verbatim from the source checkpoint) - Same recipe family as `nvidia/MiniMax-M2.7-NVFP4` (experts-only NVFP4), produced with **TensorRT Model Optimizer 0.44.0** ## Quantization recipe | | | |---|---| | Method | PTQ, NVFP4 (FP4 weights+activations, FP8 per-16 block scales + FP32 global) | | Tool | nvidia-modelopt 0.44.0, transformers main (native `minimax_m3_vl`) | | Calibration | 512 samples × 2048 tokens: cnn_dailymail + nvidia/OpenCodeReasoning + nvidia/OpenMathReasoning | | Quantized | routed experts (w1/w2/w3) + shared experts, all 57 MoE layers | | Excluded | attention (incl. MSA indexer), router/gate, embeddings, lm_head, vision tower, projectors | | KV cache | not quantized (v1; MSA is young in engines — don't stack experiments) | Calibration deliberately uses longer sequences and reasoning traces than the modelopt defaults: NVFP4 activation scales are per-block *dynamic* at runtime, so calibration only pins the per-tensor global scales — reasoning-heavy data exposes the activation extremes a thinking model actually produces. ## Evals Measured via lm-evaluation-harness against the official **MiniMax-M3-MXFP8** endpoint as baseline, same engine (SGLang), same sampling (temperature 1.0, top-p 0.95, model-card settings), thinking enabled. Generation caps: 16384 tokens (GPQA, MMLU), 8192 (GSM8K); at these caps truncation is negligible (<0.25% of samples). | Task | MXFP8 (official) | **NVFP4 (this repo)** | |---|---|---| | GSM8K (5-shot, strict) | 93.93 | 92.57 | | GPQA diamond (CoT zero-shot) | 76.26 | 69.70 | | MMLU (flan CoT few-shot, 25% sample) | 77.36 | 74.16 | ## Serving (SGLang) Requires the MiniMax-M3 bring-up image and **one patch**: SGLang's NVFP4 cutlass MoE path does not yet forward M3's clamped-swiglu activation parameters (`swiglu_alpha=1.702`, `swiglu_limit=7.0`, `+1` beta — same family as GPT-OSS). Without the patch the model loads but generates garbage. The patched files ship in this repo under `sglang_patch/` (upstream PR pending; the second file makes the unsupported flashinfer-trtllm MoE backend fail fast instead of producing garbage). ```bash docker run --runtime=nvidia --gpus '"device=0,1"' --ipc=host --shm-size 32g \ -v $MODEL_DIR:/model \ -v $MODEL_DIR/sglang_patch/modelopt_quant.py:/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py:ro \ -v $MODEL_DIR/sglang_patch/flashinfer_trtllm.py:/sgl-workspace/sglang/python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py:ro \ -p 30014:30014 lmsysorg/sglang:dev-cu13-minimax-m3 \ sglang serve --model-path /model --tp 2 \ --quantization modelopt_fp4 \ --attention-backend fa4 --page-size 128 \ --moe-runner-backend flashinfer_cutlass \ --context-length 131072 --mem-fraction-static 0.90 \ --reasoning-parser auto --tool-call-parser auto \ --trust-remote-code --host 0.0.0.0 --port 30014 ``` `--page-size 128` is mandatory (MSA indexing). Sampling: temperature 1.0, top-p 0.95, top-k 40. `--moe-runner-backend flashinfer_cutlass` is the only supported MoE backend: the flashinfer-trtllm FP4 kernels cannot run M3's parameterized clamped swiglu (they ignore `gemm1_alpha`/`gemm1_beta`; the patched code fails fast with a clear error instead of generating garbage). ## Known limitations - Engine support is bleeding-edge: M3 itself has not shipped in stable SGLang/vLLM; this NVFP4 additionally needs the swiglu-parameter fix (upstream PR pending). - Vision tower is BF16 and untested under this serving path beyond loading; the eval table is text-only. - KV cache quantization intentionally omitted in v1. ## Provenance Quantized from `MiniMaxAI/MiniMax-M3` (revision 3a41b31) on 8× NVIDIA B300. During bring-up, two upstream gaps were found and fixed: modelopt's fused-experts detector did not recognize M3's `_apply_gate` expert module (experts silently skipped — PR pending), and SGLang's NVFP4 cutlass MoE path dropped custom swiglu parameters (PR pending).