Dealign.AI Mascot

MiniMax-M2.5-REAP-172B 4-bit MLX

Cerebras REAP Expert-Pruned MiniMax M2.5 — 4-bit for Apple Silicon


What This Is

4-bit quantized Cerebras REAP version of MiniMax M2.5 for Apple Silicon MLX.

  • 256 → 192 experts (25% pruned via Cerebras REAP)
  • 62 layers, 8 experts active per token, hidden size 3072
  • 4-bit affine quantization, group size 64, gates at 8-bit
  • Speed: ~50 tok/s on M3 Ultra 256GB
  • Thinking: Always ON

Inference Settings

MiniMax M2.5 requires specific sampling — greedy decoding causes infinite loops:

temperature: 1.0    # REQUIRED
top_p: 0.95
top_k: 40

Usage

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load("dealignai/MiniMax-M2.5-REAP-172B-4bit-MLX")
sampler = make_sampler(temp=1.0, top_p=0.95, top_k=40)

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=text, max_tokens=2048, sampler=sampler)
print(response)

All Variants

Model Size Speed
4-bit REAP ~91 GB ~50 tok/s
6-bit REAP ~131 GB ~40-45 tok/s
8-bit REAP ~171 GB ~30 tok/s (est)

Credits



Support dealignai

All models are built from original research and published for free. These models are specifically crafted to be excellent coders and general-purpose assistants.

Support us on Ko-fi — check out the Ko-fi membership for early access and extras.

Have questions or need help with a specific model? DM us — we help for free most of the time.

Ko-fi | X @dealignai | dealign.ai

Downloads last month
197
Safetensors
Model size
173B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support