Built for vMLX — the only MLX inferencer with VL support, KV cache quantization, prefix cache reuse, agentic tool calling, and speculative decoding.
_{Free for macOS · vmlx.net}

MiniMax-M2.5-REAP-172B 4-bit MLX

Cerebras REAP Expert-Pruned MiniMax M2.5 — 4-bit for Apple Silicon

What This Is

4-bit quantized Cerebras REAP version of MiniMax M2.5 for Apple Silicon MLX.

256 → 192 experts (25% pruned via Cerebras REAP)
62 layers, 8 experts active per token, hidden size 3072
4-bit affine quantization, group size 64, gates at 8-bit
Speed: ~50 tok/s on M3 Ultra 256GB
Thinking: Always ON

Inference Settings

MiniMax M2.5 requires specific sampling — greedy decoding causes infinite loops:

temperature: 1.0    # REQUIRED
top_p: 0.95
top_k: 40

Usage

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load("dealignai/MiniMax-M2.5-REAP-172B-4bit-MLX")
sampler = make_sampler(temp=1.0, top_p=0.95, top_k=40)

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=text, max_tokens=2048, sampler=sampler)
print(response)

All Variants

Model	Size	Speed
4-bit REAP	~91 GB	~50 tok/s
6-bit REAP	~131 GB	~40-45 tok/s
8-bit REAP	~171 GB	~30 tok/s (est)

Credits

MiniMax — original M2.5 model
Cerebras — REAP expert pruning
Quantized by dealignai

dealignai on HuggingFace | GitHub

Support dealignai

All models are built from original research and published for free. These models are specifically crafted to be excellent coders and general-purpose assistants.

Support us on Ko-fi — check out the Ko-fi membership for early access and extras.

Have questions or need help with a specific model? DM us — we help for free most of the time.

Ko-fi | X @dealignai | dealign.ai

Downloads last month: 197

Safetensors

Model size

173B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit