MiniMax-M2.7-GGUF

imatrix-calibrated GGUF quantizations of MiniMaxAI/MiniMax-M2.7 β€” the first model in the MiniMax lineup to participate in its own training evolution.

M2.7 matches or exceeds GPT-5.3-Codex on SWE-Pro (56.22%), rivals Opus 4.6 on VIBE-Pro (55.6%), and outperforms Claude Sonnet 4.6 on agentic benchmarks β€” with only 10B active parameters per token out of 229B total.

Benchmark Overview

Why these quants

  • imatrix calibrated β€” all sub-Q8 quants use importance matrix for optimal quality retention. No blind quantization.
  • Q8_0 is effectively lossless β€” source weights are FP8 (float8_e4m3fn). Q8_0 preserves full precision.
  • XL variants β€” embed and output weights kept at Q8_0 for meaningfully better coherence on MoE routing. Available for Q4_K_M, Q5_K_M, Q3_K_M.
  • Correct license metadata β€” NON-COMMERCIAL license properly set in GGUF headers via --override-kv. Other repos incorrectly list modified-MIT.
  • Validated β€” every quant smoke-tested for coherent output, reasoning accuracy, and code generation before upload.

License

NON-COMMERCIAL USE ONLY. Commercial use requires prior written authorization from MiniMax. See LICENSE.

Quick Start

# Recommended: Q4_K_M for most users
hf download Youssofal/MiniMax-M2.7-GGUF --include "MiniMax-M2.7-Q4_K_M/*" --local-dir ./

# Run with llama-server
llama-server -m MiniMax-M2.7-Q4_K_M.gguf \
  -ngl 999 -c 32768 --jinja --reasoning-format auto -fa \
  --temp 1.0 --top-p 0.95 --top-k 40

Which file should I use?

Use Case Recommended Size Notes
Maximum quality, unlimited RAM Q8_0 ~243 GB Effectively lossless (FP8 source)
High quality, recommended Q6_K ~188 GB Near-perfect
Best quality/size balance Q4_K_M ~138 GB Default for most users
Better coherence at Q4 Q4_K_M-XL ~140 GB Q8_0 embed/output weights
Low RAM / Apple Silicon 128GB IQ4_XS ~122 GB Best quality under 128 GB
Minimum viable Q3_K_M ~109 GB Usable but noticeable quality loss
Absolute minimum Q2_K ~83 GB Surprisingly usable

Sizing rule: Pick a quant 1-2 GB smaller than your total VRAM (GPU) or VRAM+RAM combined (Apple Silicon). A great deep-dive with charts is provided by Artefact2.

All Available Quantizations

Filename Quant Size Split Description
MiniMax-M2.7-Q8_0 Q8_0 ~243 GB Yes Effectively lossless β€” source is FP8
MiniMax-M2.7-Q6_K Q6_K ~188 GB Yes Very high quality, recommended
MiniMax-M2.7-Q5_K_M Q5_K_M ~162 GB Yes High quality
MiniMax-M2.7-Q5_K_S Q5_K_S ~157 GB Yes High quality
MiniMax-M2.7-Q4_K_M-XL Q4_K_M ~140 GB Yes Q8_0 embed/output, better coherence
MiniMax-M2.7-Q4_K_M Q4_K_M ~138 GB Yes Good quality, recommended default
MiniMax-M2.7-Q4_K_S Q4_K_S ~130 GB Yes Slightly lower quality
MiniMax-M2.7-IQ4_XS IQ4_XS ~122 GB Yes Best quality under 128 GB
MiniMax-M2.7-Q3_K_M-XL Q3_K_M ~110 GB Yes Q8_0 embed/output, better than standard Q3
MiniMax-M2.7-Q3_K_L Q3_K_L ~118 GB Yes Lower quality, usable
MiniMax-M2.7-Q3_K_M Q3_K_M ~109 GB Yes Low quality
MiniMax-M2.7-Q3_K_S Q3_K_S ~99 GB Yes Not recommended
MiniMax-M2.7-Q2_K Q2_K ~83 GB Yes Very low quality, surprisingly usable

Download

pip install -U "huggingface_hub[cli]"

# Most quants are split (>50 GB) -- download the folder
hf download Youssofal/MiniMax-M2.7-GGUF --include "MiniMax-M2.7-Q4_K_M/*" --local-dir ./

# Smaller quants (single file)
hf download Youssofal/MiniMax-M2.7-GGUF --include "MiniMax-M2.7-Q2_K*" --local-dir ./

Running

# llama-server (recommended)
llama-server -m MiniMax-M2.7-Q4_K_M.gguf \
  -ngl 999 -c 32768 --jinja --reasoning-format auto -fa \
  --temp 1.0 --top-p 0.95 --top-k 40

# llama-cli (conversation mode)
llama-cli -m MiniMax-M2.7-Q4_K_M.gguf \
  -ngl 999 --jinja --reasoning-format auto -cnv

Sampling parameters (official MiniMax defaults): temperature 1.0, top_p 0.95, top_k 40.

Apple Silicon / MoE Notes

Despite only 10B active parameters per token, all 229B weights must reside in memory. On Apple Silicon with unified memory, use --cpu-moe for improved expert dispatch.

  • Q4_K_M (~138 GB) fits on 192 GB Mac
  • IQ4_XS (~122 GB) fits on 128 GB with swap pressure
  • Q3_K_M (~109 GB) runs on 128 GB Mac

Architecture

Spec Value
Parameters 229B total, 10B active (4.3% activation)
Architecture Sparse MoE, minimax_m2
Experts 256 local, 8 per token (top-k routing)
Layers 62
Attention 48 heads, 8 KV heads, hybrid Lightning + softmax
Context 200K (rope_theta 5,000,000)
Source precision FP8 (float8_e4m3fn)
Thinking <think>...</think> interleaved reasoning

XL Variants

XL variants (Q4_K_M-XL, Q5_K_M-XL, Q3_K_M-XL) keep embedding and output weight tensors at Q8_0 instead of the default lower precision. This preserves vocabulary quality at minimal size cost and meaningfully improves coherence -- particularly important for MoE models where expert routing precision matters.

Chat Template

Two templates are provided:

  • chat_template.official.jinja -- exact copy from MiniMax, includes <think> in generation prompt
  • chat_template.llama-server-patched.jinja -- removes <think> from generation prompt to prevent duplication in some llama-server versions

Compatibility

Engine Status
llama.cpp (latest main) Tested
LM Studio Check for minimax_m2 support
Ollama Requires recent version with minimax_m2
KoboldCpp Untested

Quantization Details

  • Quantized with llama.cpp at commit ff5ef82
  • imatrix generated from bartowski calibration data
  • License metadata corrected via --override-kv (upstream model card incorrectly propagates modified-MIT)
  • All quants validated: coherent English output, correct reasoning, valid code generation

Credits

Downloads last month
-
GGUF
Model size
229B params
Architecture
minimax-m2
Hardware compatibility
Log In to add your hardware

2-bit

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Youssofal/MiniMax-M2.7-GGUF

Quantized
(25)
this model