MiniMax-M3, TurboQuant+ Config-I (MLX)

⚠️ UNTESTED MODEL, USE AT YOUR OWN RISK

I did not have enough disk/RAM to host or run this model, so it has NOT been validated. No perplexity, MMLU, needle-in-a-haystack, or generation testing was performed on this M3 quant. The size and bits-per-weight figures below are the measured output of the conversion; everything about output quality is unverified. It may produce broken or degraded output.

The Config-I policy itself is proven on other MoE models (see MiniMax-M2.7-ConfigI-MLX, 93.5% MMLU), and M3 uses the same policy, but M3 is a different, larger architecture (minimax_m3_vl, ~427B) that has not been independently confirmed to survive 2-bit expert compression. Validate before relying on it. If you run it, please report results.

🔧 PATCH REQUIRED, M3 is not in stock mlx_lm yet

MiniMax-M3 (minimax_m3_vl) has no model class in released mlx_lm. Support is in-flight upstream, this quant was made against ml-explore/mlx-lm#1398 (see also #1401). Until one of those merges, you need that model class present. Two ways:

  • Bundled here: minimax_m3_vl.py ships in this repo, drop it into your mlx_lm/models/ directory.
  • From the PR: check out the PR branch, or pip install "git+https://github.com/ml-explore/mlx-lm.git@refs/pull/1398/head".

Once #1398/#1401 lands in a release, stock mlx_lm will load it and no patch is needed.

Config-I quantization of MiniMaxAI/MiniMax-M3 (~427B total MoE, 60 layers, 128 experts/layer top-4 + 1 shared expert). The MoE/attention weights are Config-I quantized; the vision tower and MiniMax Sparse Attention (MSA) indexer weights are retained at bf16 so a future VL/MSA-capable MLX can use them (current mlx_lm ignores them and runs the model text-only with dense attention). The policy applies aggressive 2-bit compression to expert MLPs (where MoE is most tolerant), protects attention at 4-bit, and shields boundary layers, routing, and embeddings at higher precision. See the Config-I paper for the policy derivation.

Compression

Size
bf16 source ~869 GB
MXFP8 source (used for this conversion) ~444 GB
Config-I (quantized weights 3.097 bpw) + bf16 vision/MSA ~167 GB
Reduction vs bf16 ~81%

Includes the bf16 vision tower + MSA indexer (+2.2 GB) retained for forward-compatibility.

Converted from the official MXFP8 checkpoint (FP8 weights dequantized at load). The sensitive layers (router gates, embeddings, lm_head) are full-precision in the MXFP8 source, so Config-I's FP8→low-bit step only touches the expert/attention weights it crushes anyway.

Quality

NOT MEASURED. See the warning at the top. The tables of MMLU / PPL / NIAH / throughput that accompany the validated M2.7 release are deliberately absent here because no such measurements exist for this M3 quant.

Config-I Policy (MiniMax-M3 adaptation)

Component Bits Layers Rationale
Expert MLP gate/up (w1/w3) 2-bit middle 56 bulk of params, MoE-tolerant
Expert MLP down (w2) 3-bit middle 56 write-back sensitivity (Config-I finding)
Attention Q/K/V/O 4-bit middle 56 uniform per layer
Boundary (all tensors) 8-bit first 2 + last 2 boundary-layer protection
MoE router f16 all routing precision critical
Embeddings + lm_head 8-bit , protected

Uniform MLX quantization produces broken output on MiniMax-class MoE because it compresses attention and routing to the same bits as expert MLPs. Config-I protects the components that control coherence while compressing the ~97% of parameters (expert MLPs) that tolerate it.

Compatibility

Field Value
Format MLX safetensors (standard)
Avg bits 3.097 bpw (quantized weights; vision + MSA-index kept bf16)
Runtime mlx_lm (Python), mlx-swift-lm (Swift)
Model type minimax_m3_vl (text backbone)
Platform Apple Silicon, needs ~200 GB unified memory (M3 Ultra 256 GB / M-series with 192 GB+)
Quantized on 2026-06-14

Standard MLX per-layer quantization, but M3 support is new and needs the patch above (see "🔧 Patch required"): the minimax_m3_vl model class isn't in released mlx_lm yet. Use the bundled minimax_m3_vl.py (drop into mlx_lm/models/) or the in-flight PR #1398.

How to Run

Python (mlx_lm)

# Needs minimax_m3_vl support, use the bundled minimax_m3_vl.py or PR #1398
# (see "🔧 Patch required" above). Then:
python -m mlx_lm.generate --model thetom-ai/MiniMax-M3-ConfigI-MLX --prompt "Hello"
from mlx_lm import load, generate
model, tokenizer = load("thetom-ai/MiniMax-M3-ConfigI-MLX")
print(generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=1.0, top_p=0.95))

Note: MiniMax models are always-reasoning, use temperature=1.0; greedy/temp=0 can cause infinite thinking loops.

Limitations (current loader)

With today's minimax_m3_vl loader (PR #1398), this runs as a text-only, dense-attention model:

  • No image input. The vision tower weights ship in the repo but the loader doesn't wire up VL inference yet; they are dead weight until MLX adds M3-VL support, at which point no re-quantization is needed.
  • Dense attention, not MSA. MiniMax Sparse Attention is run as full causal attention, numerically exact (equal-or-better quality), but long context is slower / more KV-hungry than native M3. The MSA indexer weights are retained (bf16) for a future MSA-capable loader.

Both are intentional: the weights are kept so the artifact is forward-compatible without re-quantizing from source.

What is Config-I?

Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation it was found that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight: compression policy matters more than compression math: which tensors to compress, which to protect, and how aggressively. For MoE models, expert MLPs dominate parameter count but tolerate aggressive compression because only a few of the 128 experts are active per token; Config-I compresses them to 2–3 bit while protecting attention and routing.


This quant was produced from the MXFP8 checkpoint with convert_m3.py. It is shared as-is, untested, for others with the hardware to evaluate it.

Downloads last month
974
Safetensors
Model size
49B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thetom-ai/MiniMax-M3-ConfigI-MLX

Quantized
(19)
this model