Instructions to use thetom-ai/MiniMax-M3-ConfigI-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use thetom-ai/MiniMax-M3-ConfigI-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("thetom-ai/MiniMax-M3-ConfigI-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use thetom-ai/MiniMax-M3-ConfigI-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "thetom-ai/MiniMax-M3-ConfigI-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "thetom-ai/MiniMax-M3-ConfigI-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use thetom-ai/MiniMax-M3-ConfigI-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "thetom-ai/MiniMax-M3-ConfigI-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default thetom-ai/MiniMax-M3-ConfigI-MLX
Run Hermes
hermes
- MLX LM
How to use thetom-ai/MiniMax-M3-ConfigI-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "thetom-ai/MiniMax-M3-ConfigI-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "thetom-ai/MiniMax-M3-ConfigI-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thetom-ai/MiniMax-M3-ConfigI-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
MiniMax-M3, TurboQuant+ Config-I (MLX)
⚠️ UNTESTED MODEL, USE AT YOUR OWN RISK
I did not have enough disk/RAM to host or run this model, so it has NOT been validated. No perplexity, MMLU, needle-in-a-haystack, or generation testing was performed on this M3 quant. The size and bits-per-weight figures below are the measured output of the conversion; everything about output quality is unverified. It may produce broken or degraded output.
The Config-I policy itself is proven on other MoE models (see MiniMax-M2.7-ConfigI-MLX, 93.5% MMLU), and M3 uses the same policy, but M3 is a different, larger architecture (
minimax_m3_vl, ~427B) that has not been independently confirmed to survive 2-bit expert compression. Validate before relying on it. If you run it, please report results.
🔧 PATCH REQUIRED, M3 is not in stock mlx_lm yet
MiniMax-M3 (
minimax_m3_vl) has no model class in releasedmlx_lm. Support is in-flight upstream, this quant was made against ml-explore/mlx-lm#1398 (see also #1401). Until one of those merges, you need that model class present. Two ways:
- Bundled here:
minimax_m3_vl.pyships in this repo, drop it into yourmlx_lm/models/directory.- From the PR: check out the PR branch, or
pip install "git+https://github.com/ml-explore/mlx-lm.git@refs/pull/1398/head".Once #1398/#1401 lands in a release, stock
mlx_lmwill load it and no patch is needed.
Config-I quantization of MiniMaxAI/MiniMax-M3
(~427B total MoE, 60 layers, 128 experts/layer top-4 + 1 shared expert).
The MoE/attention weights are Config-I quantized; the vision tower and MiniMax Sparse Attention (MSA) indexer weights are retained at bf16 so a future VL/MSA-capable MLX can use them (current mlx_lm ignores them and runs the model text-only with dense attention). The policy applies
aggressive 2-bit compression to expert MLPs (where MoE is most tolerant),
protects attention at 4-bit, and shields boundary layers, routing, and
embeddings at higher precision. See the
Config-I paper
for the policy derivation.
Compression
| Size | |
|---|---|
| bf16 source | ~869 GB |
| MXFP8 source (used for this conversion) | ~444 GB |
| Config-I (quantized weights 3.097 bpw) + bf16 vision/MSA | ~167 GB |
| Reduction vs bf16 | ~81% |
Includes the bf16 vision tower + MSA indexer (+2.2 GB) retained for forward-compatibility.
Converted from the official MXFP8 checkpoint (FP8 weights dequantized at load). The sensitive layers (router gates, embeddings, lm_head) are full-precision in the MXFP8 source, so Config-I's FP8→low-bit step only touches the expert/attention weights it crushes anyway.
Quality
NOT MEASURED. See the warning at the top. The tables of MMLU / PPL / NIAH / throughput that accompany the validated M2.7 release are deliberately absent here because no such measurements exist for this M3 quant.
Config-I Policy (MiniMax-M3 adaptation)
| Component | Bits | Layers | Rationale |
|---|---|---|---|
| Expert MLP gate/up (w1/w3) | 2-bit | middle 56 | bulk of params, MoE-tolerant |
| Expert MLP down (w2) | 3-bit | middle 56 | write-back sensitivity (Config-I finding) |
| Attention Q/K/V/O | 4-bit | middle 56 | uniform per layer |
| Boundary (all tensors) | 8-bit | first 2 + last 2 | boundary-layer protection |
| MoE router | f16 | all | routing precision critical |
| Embeddings + lm_head | 8-bit | , | protected |
Uniform MLX quantization produces broken output on MiniMax-class MoE because it compresses attention and routing to the same bits as expert MLPs. Config-I protects the components that control coherence while compressing the ~97% of parameters (expert MLPs) that tolerate it.
Compatibility
| Field | Value |
|---|---|
| Format | MLX safetensors (standard) |
| Avg bits | 3.097 bpw (quantized weights; vision + MSA-index kept bf16) |
| Runtime | mlx_lm (Python), mlx-swift-lm (Swift) |
| Model type | minimax_m3_vl (text backbone) |
| Platform | Apple Silicon, needs ~200 GB unified memory (M3 Ultra 256 GB / M-series with 192 GB+) |
| Quantized on | 2026-06-14 |
Standard MLX per-layer quantization, but M3 support is new and needs the
patch above (see "🔧 Patch required"): the minimax_m3_vl model class isn't
in released mlx_lm yet. Use the bundled minimax_m3_vl.py (drop into
mlx_lm/models/) or the in-flight PR
#1398.
How to Run
Python (mlx_lm)
# Needs minimax_m3_vl support, use the bundled minimax_m3_vl.py or PR #1398
# (see "🔧 Patch required" above). Then:
python -m mlx_lm.generate --model thetom-ai/MiniMax-M3-ConfigI-MLX --prompt "Hello"
from mlx_lm import load, generate
model, tokenizer = load("thetom-ai/MiniMax-M3-ConfigI-MLX")
print(generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=1.0, top_p=0.95))
Note: MiniMax models are always-reasoning, use
temperature=1.0; greedy/temp=0 can cause infinite thinking loops.
Limitations (current loader)
With today's minimax_m3_vl loader (PR #1398), this runs as a text-only,
dense-attention model:
- No image input. The vision tower weights ship in the repo but the loader doesn't wire up VL inference yet; they are dead weight until MLX adds M3-VL support, at which point no re-quantization is needed.
- Dense attention, not MSA. MiniMax Sparse Attention is run as full causal attention, numerically exact (equal-or-better quality), but long context is slower / more KV-hungry than native M3. The MSA indexer weights are retained (bf16) for a future MSA-capable loader.
Both are intentional: the weights are kept so the artifact is forward-compatible without re-quantizing from source.
What is Config-I?
Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation it was found that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight: compression policy matters more than compression math: which tensors to compress, which to protect, and how aggressively. For MoE models, expert MLPs dominate parameter count but tolerate aggressive compression because only a few of the 128 experts are active per token; Config-I compresses them to 2–3 bit while protecting attention and routing.
This quant was produced from the MXFP8 checkpoint with
convert_m3.py. It is shared
as-is, untested, for others with the hardware to evaluate it.
- Downloads last month
- 974
4-bit
Model tree for thetom-ai/MiniMax-M3-ConfigI-MLX
Base model
MiniMaxAI/MiniMax-M3