File size: 4,785 Bytes
6684358
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
license: other
license_name: minimax
license_link: https://huggingface.co/MiniMaxAI/MiniMax-M3/blob/main/LICENSE
base_model: MiniMaxAI/MiniMax-M3
base_model_relation: quantized
tags:
  - nvfp4
  - fp4
  - modelopt
  - minimax
  - moe
  - quantized
---

# MiniMax-M3-NVFP4

**The first NVFP4 quantization of [MiniMaxAI/MiniMax-M3](https://huggingface.co/MiniMaxAI/MiniMax-M3)** — 428B-total / 23B-active MoE with MiniMax Sparse Attention (MSA), quantized 2026-06-12.

- **~256 GB** on disk (vs 854 GB BF16, 444 GB MXFP8) → serves on **2× B300/GB300** with huge KV headroom, or fits tighter Blackwell pairs
- Routed + shared **experts in NVFP4** (group 16, two-level scaling); attention, MSA indexer, router, embeddings, lm_head and the vision tower kept in **original BF16** (no quantization round-trip — copied verbatim from the source checkpoint)
- Same recipe family as `nvidia/MiniMax-M2.7-NVFP4` (experts-only NVFP4), produced with **TensorRT Model Optimizer 0.44.0**

## Quantization recipe

| | |
|---|---|
| Method | PTQ, NVFP4 (FP4 weights+activations, FP8 per-16 block scales + FP32 global) |
| Tool | nvidia-modelopt 0.44.0, transformers main (native `minimax_m3_vl`) |
| Calibration | 512 samples × 2048 tokens: cnn_dailymail + nvidia/OpenCodeReasoning + nvidia/OpenMathReasoning |
| Quantized | routed experts (w1/w2/w3) + shared experts, all 57 MoE layers |
| Excluded | attention (incl. MSA indexer), router/gate, embeddings, lm_head, vision tower, projectors |
| KV cache | not quantized (v1; MSA is young in engines — don't stack experiments) |

Calibration deliberately uses longer sequences and reasoning traces than the modelopt defaults: NVFP4 activation scales are per-block *dynamic* at runtime, so calibration only pins the per-tensor global scales — reasoning-heavy data exposes the activation extremes a thinking model actually produces.

## Evals

Measured via lm-evaluation-harness against the official **MiniMax-M3-MXFP8** endpoint as baseline, same engine (SGLang), same sampling (temperature 1.0, top-p 0.95, model-card settings), thinking enabled. Generation caps: 16384 tokens (GPQA, MMLU), 8192 (GSM8K); at these caps truncation is negligible (<0.25% of samples).

| Task | MXFP8 (official) | **NVFP4 (this repo)** |
|---|---|---|
| GSM8K (5-shot, strict) | 93.93 | 92.57 |
| GPQA diamond (CoT zero-shot) | 76.26 | 69.70 |
| MMLU (flan CoT few-shot, 25% sample) | 77.36 | 74.16 |

## Serving (SGLang)

Requires the MiniMax-M3 bring-up image and **one patch**: SGLang's NVFP4 cutlass MoE path does not yet forward M3's clamped-swiglu activation parameters (`swiglu_alpha=1.702`, `swiglu_limit=7.0`, `+1` beta — same family as GPT-OSS). Without the patch the model loads but generates garbage. The patched files ship in this repo under `sglang_patch/` (upstream PR pending; the second file makes the unsupported flashinfer-trtllm MoE backend fail fast instead of producing garbage).

```bash
docker run --runtime=nvidia --gpus '"device=0,1"' --ipc=host --shm-size 32g \
  -v $MODEL_DIR:/model \
  -v $MODEL_DIR/sglang_patch/modelopt_quant.py:/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py:ro \
  -v $MODEL_DIR/sglang_patch/flashinfer_trtllm.py:/sgl-workspace/sglang/python/sglang/srt/layers/moe/moe_runner/flashinfer_trtllm.py:ro \
  -p 30014:30014 lmsysorg/sglang:dev-cu13-minimax-m3 \
  sglang serve --model-path /model --tp 2 \
    --quantization modelopt_fp4 \
    --attention-backend fa4 --page-size 128 \
    --moe-runner-backend flashinfer_cutlass \
    --context-length 131072 --mem-fraction-static 0.90 \
    --reasoning-parser auto --tool-call-parser auto \
    --trust-remote-code --host 0.0.0.0 --port 30014
```

`--page-size 128` is mandatory (MSA indexing). Sampling: temperature 1.0, top-p 0.95, top-k 40.

`--moe-runner-backend flashinfer_cutlass` is the only supported MoE backend: the flashinfer-trtllm FP4 kernels cannot run M3's parameterized clamped swiglu (they ignore `gemm1_alpha`/`gemm1_beta`; the patched code fails fast with a clear error instead of generating garbage).

## Known limitations

- Engine support is bleeding-edge: M3 itself has not shipped in stable SGLang/vLLM; this NVFP4 additionally needs the swiglu-parameter fix (upstream PR pending).
- Vision tower is BF16 and untested under this serving path beyond loading; the eval table is text-only.
- KV cache quantization intentionally omitted in v1.

## Provenance

Quantized from `MiniMaxAI/MiniMax-M3` (revision 3a41b31) on 8× NVIDIA B300. During bring-up, two upstream gaps were found and fixed: modelopt's fused-experts detector did not recognize M3's `_apply_gate` expert module (experts silently skipped — PR pending), and SGLang's NVFP4 cutlass MoE path dropped custom swiglu parameters (PR pending).