Instructions to use thetom-ai/MiniMax-M3-ConfigI-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use thetom-ai/MiniMax-M3-ConfigI-MLX with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("thetom-ai/MiniMax-M3-ConfigI-MLX")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use thetom-ai/MiniMax-M3-ConfigI-MLX with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "thetom-ai/MiniMax-M3-ConfigI-MLX"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "thetom-ai/MiniMax-M3-ConfigI-MLX"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use thetom-ai/MiniMax-M3-ConfigI-MLX with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "thetom-ai/MiniMax-M3-ConfigI-MLX"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default thetom-ai/MiniMax-M3-ConfigI-MLX

Run Hermes

hermes

OpenClaw new

How to use thetom-ai/MiniMax-M3-ConfigI-MLX with OpenClaw:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "thetom-ai/MiniMax-M3-ConfigI-MLX"

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "thetom-ai/MiniMax-M3-ConfigI-MLX" \
  --custom-provider-id mlx-lm \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

MLX LM

How to use thetom-ai/MiniMax-M3-ConfigI-MLX with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "thetom-ai/MiniMax-M3-ConfigI-MLX"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "thetom-ai/MiniMax-M3-ConfigI-MLX"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "thetom-ai/MiniMax-M3-ConfigI-MLX",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

MiniMax-M3, TurboQuant+ Config-I (MLX)

⚠️ UNTESTED MODEL, USE AT YOUR OWN RISK

I did not have enough disk/RAM to host or run this model, so it has NOT been validated. No perplexity, MMLU, needle-in-a-haystack, or generation testing was performed on this M3 quant. The size and bits-per-weight figures below are the measured output of the conversion; everything about output quality is unverified. It may produce broken or degraded output.

The Config-I policy itself is proven on other MoE models (see MiniMax-M2.7-ConfigI-MLX, 93.5% MMLU), and M3 uses the same policy, but M3 is a different, larger architecture (minimax_m3_vl, ~427B) that has not been independently confirmed to survive 2-bit expert compression. Validate before relying on it. If you run it, please report results.

🔧 PATCH REQUIRED, M3 is not in stock mlx_lm yet

MiniMax-M3 (minimax_m3_vl) has no model class in released mlx_lm. Support is in-flight upstream, this quant was made against ml-explore/mlx-lm#1398 (see also #1401). Until one of those merges, you need that model class present. Two ways:

Bundled here: minimax_m3_vl.py ships in this repo, drop it into your mlx_lm/models/ directory.

From the PR: check out the PR branch, or pip install "git+https://github.com/ml-explore/mlx-lm.git@refs/pull/1398/head".

Once #1398/#1401 lands in a release, stock mlx_lm will load it and no patch is needed.

Config-I quantization of MiniMaxAI/MiniMax-M3 (~427B total MoE, 60 layers, 128 experts/layer top-4 + 1 shared expert). The MoE/attention weights are Config-I quantized; the vision tower and MiniMax Sparse Attention (MSA) indexer weights are retained at bf16 so a future VL/MSA-capable MLX can use them (current mlx_lm ignores them and runs the model text-only with dense attention). The policy applies aggressive 2-bit compression to expert MLPs (where MoE is most tolerant), protects attention at 4-bit, and shields boundary layers, routing, and embeddings at higher precision. See the Config-I paper for the policy derivation.

Compression

	Size
bf16 source	~869 GB
MXFP8 source (used for this conversion)	~444 GB
Config-I (quantized weights 3.097 bpw) + bf16 vision/MSA	~167 GB
Reduction vs bf16	~81%

Includes the bf16 vision tower + MSA indexer (+2.2 GB) retained for forward-compatibility.

Converted from the official MXFP8 checkpoint (FP8 weights dequantized at load). The sensitive layers (router gates, embeddings, lm_head) are full-precision in the MXFP8 source, so Config-I's FP8→low-bit step only touches the expert/attention weights it crushes anyway.

Quality

NOT MEASURED. See the warning at the top. The tables of MMLU / PPL / NIAH / throughput that accompany the validated M2.7 release are deliberately absent here because no such measurements exist for this M3 quant.

Config-I Policy (MiniMax-M3 adaptation)

Component	Bits	Layers	Rationale
Expert MLP gate/up (w1/w3)	2-bit	middle 56	bulk of params, MoE-tolerant
Expert MLP down (w2)	3-bit	middle 56	write-back sensitivity (Config-I finding)
Attention Q/K/V/O	4-bit	middle 56	uniform per layer
Boundary (all tensors)	8-bit	first 2 + last 2	boundary-layer protection
MoE router	f16	all	routing precision critical
Embeddings + lm_head	8-bit	,	protected

Uniform MLX quantization produces broken output on MiniMax-class MoE because it compresses attention and routing to the same bits as expert MLPs. Config-I protects the components that control coherence while compressing the ~97% of parameters (expert MLPs) that tolerate it.

Compatibility

Field	Value
Format	MLX safetensors (standard)
Avg bits	3.097 bpw (quantized weights; vision + MSA-index kept bf16)
Runtime	`mlx_lm` (Python), `mlx-swift-lm` (Swift)
Model type	`minimax_m3_vl` (text backbone)
Platform	Apple Silicon, needs ~200 GB unified memory (M3 Ultra 256 GB / M-series with 192 GB+)
Quantized on	2026-06-14

Standard MLX per-layer quantization, but M3 support is new and needs the patch above (see "🔧 Patch required"): the minimax_m3_vl model class isn't in released mlx_lm yet. Use the bundled minimax_m3_vl.py (drop into mlx_lm/models/) or the in-flight PR #1398.

How to Run

Python (mlx_lm)

# Needs minimax_m3_vl support, use the bundled minimax_m3_vl.py or PR #1398
# (see "🔧 Patch required" above). Then:
python -m mlx_lm.generate --model thetom-ai/MiniMax-M3-ConfigI-MLX --prompt "Hello"

from mlx_lm import load, generate
model, tokenizer = load("thetom-ai/MiniMax-M3-ConfigI-MLX")
print(generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=1.0, top_p=0.95))

Note: MiniMax models are always-reasoning, use temperature=1.0; greedy/temp=0 can cause infinite thinking loops.

Limitations (current loader)

With today's minimax_m3_vl loader (PR #1398), this runs as a text-only, dense-attention model:

No image input. The vision tower weights ship in the repo but the loader doesn't wire up VL inference yet; they are dead weight until MLX adds M3-VL support, at which point no re-quantization is needed.
Dense attention, not MSA. MiniMax Sparse Attention is run as full causal attention, numerically exact (equal-or-better quality), but long context is slower / more KV-hungry than native M3. The MSA indexer weights are retained (bf16) for a future MSA-capable loader.

Both are intentional: the weights are kept so the artifact is forward-compatible without re-quantizing from source.

What is Config-I?

Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation it was found that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight: compression policy matters more than compression math: which tensors to compress, which to protect, and how aggressively. For MoE models, expert MLPs dominate parameter count but tolerate aggressive compression because only a few of the 128 experts are active per token; Config-I compresses them to 2–3 bit while protecting attention and routing.

This quant was produced from the MXFP8 checkpoint with convert_m3.py. It is shared as-is, untested, for others with the hardware to evaluate it.

Downloads last month: 281

Safetensors

Model size

49B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit

Model tree for thetom-ai/MiniMax-M3-ConfigI-MLX

Base model

MiniMaxAI/MiniMax-M3

Quantized

(56)

this model