What This Is
4-bit quantized Cerebras REAP version of MiniMax M2.5 for Apple Silicon MLX.
- 256 → 192 experts (25% pruned via Cerebras REAP)
- 62 layers, 8 experts active per token, hidden size 3072
- 4-bit affine quantization, group size 64, gates at 8-bit
- Speed: ~50 tok/s on M3 Ultra 256GB
- Thinking: Always ON
Inference Settings
MiniMax M2.5 requires specific sampling — greedy decoding causes infinite loops:
temperature: 1.0 # REQUIRED
top_p: 0.95
top_k: 40
Usage
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler
model, tokenizer = load("dealignai/MiniMax-M2.5-REAP-172B-4bit-MLX")
sampler = make_sampler(temp=1.0, top_p=0.95, top_k=40)
messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=text, max_tokens=2048, sampler=sampler)
print(response)
All Variants
| Model | Size | Speed |
|---|---|---|
| 4-bit REAP | ~91 GB | ~50 tok/s |
| 6-bit REAP | ~131 GB | ~40-45 tok/s |
| 8-bit REAP | ~171 GB | ~30 tok/s (est) |
Credits
Support dealignai
All models are built from original research and published for free. These models are specifically crafted to be excellent coders and general-purpose assistants.
Support us on Ko-fi — check out the Ko-fi membership for early access and extras.
Have questions or need help with a specific model? DM us — we help for free most of the time.
- Downloads last month
- 197
Model size
173B params
Tensor type
BF16
·
U32 ·
Hardware compatibility
Log In to add your hardware
4-bit