MiniMax-M3-NVFP4

The first NVFP4 quantization of MiniMaxAI/MiniMax-M3 — 428B-total / 23B-active MoE with MiniMax Sparse Attention (MSA), quantized 2026-06-12.

~256 GB on disk (vs 854 GB BF16, 444 GB MXFP8) → serves on 2× B300/GB300 with huge KV headroom, or fits tighter Blackwell pairs
Routed + shared experts in NVFP4 (group 16, two-level scaling); attention, MSA indexer, router, embeddings, lm_head and the vision tower kept in original BF16 (no quantization round-trip — copied verbatim from the source checkpoint)
Same recipe family as nvidia/MiniMax-M2.7-NVFP4 (experts-only NVFP4), produced with TensorRT Model Optimizer 0.44.0

Quantization recipe


Method	PTQ, NVFP4 (FP4 weights+activations, FP8 per-16 block scales + FP32 global)
Tool	nvidia-modelopt 0.44.0, transformers main (native `minimax_m3_vl`)
Calibration	512 samples × 2048 tokens: cnn_dailymail + nvidia/OpenCodeReasoning + nvidia/OpenMathReasoning
Quantized	routed experts (w1/w2/w3) + shared experts, all 57 MoE layers
Excluded	attention (incl. MSA indexer), router/gate, embeddings, lm_head, vision tower, projectors
KV cache	not quantized (v1; MSA is young in engines — don't stack experiments)

Calibration deliberately uses longer sequences and reasoning traces than the modelopt defaults: NVFP4 activation scales are per-block dynamic at runtime, so calibration only pins the per-tensor global scales — reasoning-heavy data exposes the activation extremes a thinking model actually produces.

Evals

Measured via lm-evaluation-harness against the official MiniMax-M3-MXFP8 endpoint as baseline, same engine (SGLang), same sampling (temperature 1.0, top-p 0.95, model-card settings), thinking enabled. Generation caps: 16384 tokens (GPQA, MMLU), 8192 (GSM8K); at these caps truncation is negligible (<0.25% of samples).

Task	MXFP8 (official)	NVFP4 (this repo)
GSM8K (5-shot, strict)	93.93	92.57
GPQA diamond (CoT zero-shot)	76.26	69.70
MMLU (flan CoT few-shot, 25% sample)	77.36	74.16

Serving (SGLang)

Requires the MiniMax-M3 bring-up image and one patch: SGLang's NVFP4 cutlass MoE path does not yet forward M3's clamped-swiglu activation parameters (swiglu_alpha=1.702, swiglu_limit=7.0, +1 beta — same family as GPT-OSS). Without the patch the model loads but generates garbage. The patched files ship in this repo under sglang_patch/ (upstream PR pending; the second file makes the unsupported flashinfer-trtllm MoE backend fail fast instead of producing garbage).

docker run --runtime=nvidia --gpus '"device=0,1"' --ipc=host --shm-size 32g \
  -v $MODEL_DIR:/model \
  -v $MODEL_DIR/sglang_patch/modelopt_quant.py:/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py:ro \
  -v $MODEL_DIR/sglang_patch/flashinfer_trtllm.py:/sgl-workspace/sglang/python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py:ro \
  -p 30014:30014 lmsysorg/sglang:dev-cu13-minimax-m3 \
  sglang serve --model-path /model --tp 2 \
    --quantization modelopt_fp4 \
    --attention-backend fa4 --page-size 128 \
    --moe-runner-backend flashinfer_cutlass \
    --context-length 131072 --mem-fraction-static 0.90 \
    --reasoning-parser auto --tool-call-parser auto \
    --trust-remote-code --host 0.0.0.0 --port 30014

--page-size 128 is mandatory (MSA indexing). Sampling: temperature 1.0, top-p 0.95, top-k 40.

--moe-runner-backend flashinfer_cutlass is the only supported MoE backend: the flashinfer-trtllm FP4 kernels cannot run M3's parameterized clamped swiglu (they ignore gemm1_alpha/gemm1_beta; the patched code fails fast with a clear error instead of generating garbage).

Known limitations

Engine support is bleeding-edge: M3 itself has not shipped in stable SGLang/vLLM; this NVFP4 additionally needs the swiglu-parameter fix (upstream PR pending).
Vision tower is BF16 and untested under this serving path beyond loading; the eval table is text-only.
KV cache quantization intentionally omitted in v1.

Provenance

Quantized from MiniMaxAI/MiniMax-M3 (revision 3a41b31) on 8× NVIDIA B300. During bring-up, two upstream gaps were found and fixed: modelopt's fused-experts detector did not recognize M3's _apply_gate expert module (experts silently skipped — PR pending), and SGLang's NVFP4 cutlass MoE path dropped custom swiglu parameters (PR pending).

Downloads last month: 7,129

Safetensors

Model size

245B params

Tensor type

F32

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Mapika/MiniMax-M3-NVFP4

Base model

MiniMaxAI/MiniMax-M3

Quantized

(56)

this model