MiniMax-M2.5-GGUF (230B MoE)

High-precision GGUF quants of the MiniMax-M2.5 (230B parameters) Mixture of Experts model. These versions are specifically optimized for local inference on high-RAM setups, particularly Apple Silicon (M3 Max/Ultra).

πŸ”¬ Perplexity Validation (WikiText-2):

Final PPL: 8.2213 +/- 0.09

Context: 4096 / 32 chunks

Outcome: The Q3_K_L quantization maintains high logical coherence while boosting speed to 28.7 t/s. Minimal degradation for a ~20GB size reduction vs Q4.

πŸš€ Available Quants

File Name Method Size Use Case
minimax-m2.5-Q4_K_M.gguf Q4_K_M 138 GB Highest logic preservation. Requires >128GB RAM or SSD swap.
minimax-m2.5-Q3_K_L.gguf Q3_K_L ~110 GB Sweet spot for 128GB Macs. Runs natively in RAM with high t/s ( 28 ON MAC M3 MAX ).

πŸ›  Model Details

  • Architecture: MiniMax-M2 (Mixture of Experts) with 256 experts (8 active per token).
  • Parameters: ~230B total.
  • Quantization Process: Unlike automated scripts, these quants were generated from a full F16 GGUF Master (457GB) to minimize accumulation of errors during the K-Quant process.
  • Context Window: Up to 196k tokens (Native support).
  • Chat Template: Includes the official Jinja template for proper handling of interleaved <think> tags, separating reasoning from the final response.

πŸ’» Usage

Requires llama.cpp build 8022 or higher.

Command Line Example:

./llama-cli -m minimax-m2.5-Q3_K_L.gguf -n -1 \\
  -c 262000 \\
  -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -b 2048 -ub 1024 --port 8080 --jinja --verbose -sm none --draft 16 -ncmoe 0 --cache-reuse 1024 --draft-p-min 0.5
Downloads last month
-
GGUF
Model size
229B params
Architecture
minimax-m2
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ox-ox/MiniMax-M2.5-GGUF

Quantized
(23)
this model