MiniMax-M3 — MXFP4 (mixed precision)

A 4-bit MXFP4 quantization of MiniMax-M3, produced with qstream. The routed MoE experts (≈95% of the weights) are quantized to MXFP4; everything that is quality-sensitive is kept at higher precision.

4x RTX PRO 6000 launch recipe by 0xSero: https://github.com/0xSero/minimax-m3-sm120

Size 237 GB (down from 444 GB MXFP8 source, ~53%)
Format compressed-tensors mixed-precision (E2M1 4-bit + E8M0 group-32 scales)
Base MiniMax-M3 (256K-context vision-language sparse MoE, 128 experts top-4 + 1 shared, SwiGLU-OAI, lightning-indexer block-sparse attention)

What is quantized to what

Component Precision Why
Routed experts (block_sparse_moe.experts.*) MXFP4 (4-bit) 95% of the weights — the only place worth the size win
Shared expert, attention, dense MLP MXFP8 (8-bit, native passthrough) runs on every token / sensitive — kept lossless from the source
Embeddings, lm_head, router gate, vision tower, projector, norms BF16 / F32 unchanged

Quality (this checkpoint, served on vLLM)

Metric Result
Perplexity (clean English) 5.32
GSM8K (full 1319-problem test set, chain-of-thought) 92.9% (1225/1319)

Quantization is faithful: a degraded checkpoint would show PPL in the hundreds. Eval scripts: scripts/eval_ppl.py, scripts/eval_gsm8k.py in the qstream repo.

Fidelity, footprint & provenance

  • Quantization error: routed-expert reconstruction SQNR ≈ 18.4 dB (MXFP4 vs the MXFP8 source) — i.e. only the unavoidable 4-bit rounding; the 2D-linear and 3D-MoE GEMM paths were verified bit-faithful at 55 dB / 48 dB.
  • Vision is untouched: the CLIP vision tower + projector stay BF16, so image capability equals the base model — only the text MoE is quantized.
  • Footprint: ~221 GiB of weights; fits a single ≥256 GB GPU (e.g. B300). Measured ~460 tok/s aggregate generation at 16 concurrent requests on one B300.
  • Provenance: built with qstream @cb795c3 from the MiniMax-M3 MXFP8 release; mixed-precision recipe (experts→MXFP4, rest→MXFP8).

Serving with vLLM

This checkpoint targets a MiniMax-M3-capable vLLM build. MXFP4-on-M3 is currently an experimental path in that fork, so two things are required:

  1. The config in this repo (config.json) — its config_groups target vLLM's merged runtime modules (qkv_proj, gate_up_proj), which is necessary for the fused linears to load quantized.
  2. The MoE clamp patch in vllm_patch/ — forwards the SwiGLU-OAI swiglu_limit/alpha/beta into the MXFP4 MoE quant config (without it the SWIGLUOAI_UNINTERLEAVE requires clamp_limit assertion fires). See vllm_patch/README.md.
docker run --gpus all --privileged --ipc=host -p 8000:8000 \
  -e VLLM_MXFP4_USE_MARLIN=1 \
  -v $(FOLDER-WITH-MiniMax-M3-MXFP4)/vllm_patch/compressed_tensors_moe_w4a4_mxfp4.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w4a4_mxfp4.py \
  vllm/vllm-openai:minimax-m3 olka-fi/MiniMax-M3-MXFP4 \
  --block-size 128 --tool-call-parser minimax_m3 --enable-auto-tool-choice \
  --reasoning-parser minimax_m3 --load-format fastsafetensors \
  --gpu-memory-utilization 0.97 --enforce-eager --max-model-len 200000 \
  --max-num-batched-tokens 2048 --linear-backend marlin

Fits on a single ~275 GB GPU (e.g. B300/SM100). On SM120 (DGX Spark) the same Marlin path applies, but also needs the MSA SM12x sparse-attention kernels, and the ~221 GiB of weights won't fit in 2×128 GB.

License

Inherits the MiniMax Community License from the base model (non-commercial). This is a derivative (quantized) work of MiniMax-M3.

Downloads last month
2,039
Safetensors
Model size
234B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for olka-fi/MiniMax-M3-MXFP4

Quantized
(20)
this model