Text Generation
MLX
Safetensors
minimax_m2
apple-silicon
Mixture of Experts
prism-dq
dynamic-quantization
minimax
code
reasoning
agents
quantized
conversational
custom_code
2-bit
Instructions to use Ex0bit/MiniMax-SLURPY-DQ-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Ex0bit/MiniMax-SLURPY-DQ-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("Ex0bit/MiniMax-SLURPY-DQ-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use Ex0bit/MiniMax-SLURPY-DQ-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Ex0bit/MiniMax-SLURPY-DQ-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Ex0bit/MiniMax-SLURPY-DQ-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Ex0bit/MiniMax-SLURPY-DQ-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Ex0bit/MiniMax-SLURPY-DQ-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Ex0bit/MiniMax-SLURPY-DQ-MLX
Run Hermes
hermes
- MLX LM
How to use Ex0bit/MiniMax-SLURPY-DQ-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "Ex0bit/MiniMax-SLURPY-DQ-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "Ex0bit/MiniMax-SLURPY-DQ-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ex0bit/MiniMax-SLURPY-DQ-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
File size: 7,966 Bytes
0e34c2c b904c53 0e34c2c b904c53 0e34c2c b904c53 0e34c2c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 | ---
license: other
license_name: modified-mit
license_link: LICENSE
base_model:
- Ex0bit/MiniMax-SLURPY
base_model_relation: quantized
tags:
- mlx
- apple-silicon
- moe
- prism-dq
- dynamic-quantization
- minimax
- minimax_m2
- code
- reasoning
- agents
- quantized
model_type: minimax_m2
pipeline_tag: text-generation
library_name: mlx
quantized_by: Ex0bit
---

# MiniMax-SLURPY-DQ-MLX
**Per-tensor mixed-precision quantization of [MiniMax-SLURPY](https://huggingface.co/Ex0bit/MiniMax-SLURPY) for Apple Silicon — 2.54 BPW with 498 per-tensor-projection allocations (plus 16,122 per-expert PRISM decisions collapsed into MLX's SwitchGLU format).**
The full SLURPY model (228.7B params) compressed from 215 GB → 68 GB (68% reduction) using **PRISM Dynamic Quantization** — a per-tensor-class mixed-precision allocation derived entirely from weight structure sensitivity analysis. Zero calibration data, zero training, zero datasets.
Created by [Ex0bit](https://hf.co/Ex0bit)
---
<div align="center">
### 💡 Support our Research & Development efforts. PRISM Members Receive access to the latest PRISM-PRO Model drops on Day-0
[](https://ko-fi.com/Ex0bit)
</div>
---
## Model Details
| Property | Value |
|----------|-------|
| Base Model | [Ex0bit/MiniMax-SLURPY](https://huggingface.co/Ex0bit/MiniMax-SLURPY) |
| Architecture | MiniMax M2 MoE (256 experts, top-8) |
| Parameters | 228.7B total / ~10B active |
| Quantization | PRISM-DYNAMIC-QUANT (MLX native) |
| Achieved BPW | 2.54 |
| File Size | 68 GB (vs 215 GB source = 68% reduction) |
| Per-tensor overrides | 498 (MoE: per-layer-projection modal of 16,122 per-expert decisions) |
| Default precision | 2-bit |
| Group size | 64 |
| Context Length | 196,608 tokens |
| Runtime | mlx-lm (Apple Silicon Metal) |
| Creator | [Ex0bit](https://hf.co/Ex0bit) |
## What SLURPY inherits
A mathematically unique Designer Baby of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) and [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) — neither parent, entirely its own model.
SLURPY inherits M2.5's architect-first coding style and MIT freedom, absorbs M2.7's RL-tuned precision on multi-agent collaboration and real-world engineering — without a single training step.
| Benchmark | M2.5 | M2.7 | SLURPY |
|---|---|---|---|
| HumanEval pass@5 | 85.4% | — | **89.6%** |
| SWE-Bench Verified | 80.2% | — | inherited |
| SWE-Pro | — | 56.2% | inherited |
| MLE Bench Lite | — | 66.6% | inherited |
| GDPval-AA ELO | — | 1495 | inherited |
See [Ex0bit/MiniMax-SLURPY](https://huggingface.co/Ex0bit/MiniMax-SLURPY) for full benchmark details.
---
## PRISM Dynamic Quantization
This model uses **PRISM Dynamic Quantization** — a per-tensor mixed-precision allocation that assigns different quantization types to different tensor classes based on weight structure sensitivity analysis.
Unlike uniform quantization (Q3, Q4, Q5), PRISM-DQ analyzes each tensor's sensitivity to quantization error and allocates precision where it matters most. Critical tensors (attention projections, key MoE experts, lm_head) receive higher precision while less impactful tensors get aggressive compression.
PRISM produced 16,122 per-expert decisions (256 experts × 62 layers × 3 projections, plus attention and embeddings). MLX's `SwitchGLU` packs all 256 experts per layer-projection into a single 3D tensor sharing one bit width, so the per-expert decisions collapse to the modal bit width for each of the 186 MoE projections. The remaining 312 per-tensor decisions (attention, embeddings, lm_head, routers) retain full PRISM granularity, giving 498 effective overrides.
The model's `config.json` contains the per-tensor quantization overrides that mlx-lm loads natively — no custom runtime required. Apple Silicon's compiled Metal kernels automatically handle mixed-precision tensors in a single forward pass at full GPU speed.
**No calibration data, no importance matrices, no training data required.**
---
## Architecture
Identical to MiniMax-M2.5 / M2.7 — quantization-only:
- **Model type**: `minimax_m2` / `MiniMaxM2ForCausalLM`
- **Parameters**: 228.7B total, ~10B active (MoE)
- **Layers**: 62
- **Hidden size**: 3072
- **MoE**: 256 experts, top-8, sigmoid routing + learned bias
- **Attention**: 48 query / 8 KV heads (GQA 6:1), head_dim=128
- **Quantization**: MLX affine, mixed 2-6 bit
- **Vocab**: 200,064 tokens
- **Context**: up to 196,608 tokens
- **Thinking**: Interleaved `<think>...</think>` (always-on)
- **`trust_remote_code=True` required**
---
## Usage on Apple Silicon
### mlx-lm (CLI)
```bash
pip install mlx-lm
# Interactive chat
mlx_lm.chat --model Ex0bit/MiniMax-SLURPY-PRISM-3BPW-MLX \
--temperature 1.0 --top-p 0.95 --max-tokens 4096
# Single prompt
python -m mlx_lm.generate \
--model Ex0bit/MiniMax-SLURPY-PRISM-3BPW-MLX \
--prompt "Write a Python function that reverses a linked list." \
--max-tokens 2048 \
--temp 1.0 --top-p 0.95
```
### Python API
```python
from mlx_lm import load, generate
model, tokenizer = load("Ex0bit/MiniMax-SLURPY-PRISM-3BPW-MLX")
response = generate(
model, tokenizer,
prompt="Write a Python function that reverses a linked list.",
max_tokens=2048,
temp=1.0,
top_p=0.95,
)
print(response)
```
### Recommended sampling parameters
| Parameter | Value |
|---|---|
| temperature | 1.0 |
| top_p | 0.95 |
| top_k | 40 |
### Important: preserve thinking in conversation history
MiniMax-M2 uses interleaved thinking. The model outputs `<think>...</think>` blocks during generation. **You must pass these back verbatim in conversation history.** Removing them degrades performance.
---
## Tool calling
Same format as base SLURPY. Tool calls use `<minimax:tool_call>` / `</minimax:tool_call>` XML wrappers:
```xml
<minimax:tool_call>
<invoke name="get_weather">
<parameter name="city">San Francisco</parameter>
</invoke>
</minimax:tool_call>
```
---
## Hardware requirements
- **Apple Silicon Mac** with unified memory
- **80 GB RAM minimum** (model is 68 GB; needs headroom for KV cache)
- **128 GB RAM recommended** for full context length
- **M2 Ultra / M3 Max / M4 Max** for best throughput
For non-Apple platforms, use the FP8 [Ex0bit/MiniMax-SLURPY](https://huggingface.co/Ex0bit/MiniMax-SLURPY) variant with vLLM.
---
## Files
- 14 MLX safetensors shards (68 GB total)
- `config.json` with 498 per-tensor quantization overrides (collapsed from 16,122 PRISM decisions via SwitchGLU packing)
- `chat_template.jinja` — M2.7's chat template with tool calling support
- `modeling_minimax_m2.py` / `configuration_minimax_m2.py` — custom model code (inherited from base)
---
## License
Modified MIT — same as MiniMax-M2.5. See [LICENSE](LICENSE) for full text.
The only modification to the standard MIT license: if the Software (or any derivative works) is used for commercial products or services with more than 100 million monthly active users or more than $30M annual recurring revenue, you must prominently display "MiniMax M2" on the user interface.
---
## Credits
- Creator: [Ex0bit](https://hf.co/Ex0bit)
- Base model: [Ex0bit/MiniMax-SLURPY](https://huggingface.co/Ex0bit/MiniMax-SLURPY)
- Parents: [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5), [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7)
- Quantization engine: PRISM-DQ by [Ex0bit](https://hf.co/Ex0bit)
---
## Citation
```
@misc{minimax-slurpy-prism-mlx-2026,
title={MiniMax-SLURPY-PRISM-3BPW-MLX: Per-tensor mixed-precision quantization of MiniMax-SLURPY for Apple Silicon},
author={Ex0bit},
year={2026},
url={https://huggingface.co/Ex0bit/MiniMax-SLURPY-PRISM-3BPW-MLX}
}
```
|