File size: 7,966 Bytes
0e34c2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b904c53
0e34c2c
 
 
 
 
 
 
 
 
 
 
b904c53
0e34c2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b904c53
0e34c2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
---
license: other
license_name: modified-mit
license_link: LICENSE
base_model:
  - Ex0bit/MiniMax-SLURPY
base_model_relation: quantized
tags:
  - mlx
  - apple-silicon
  - moe
  - prism-dq
  - dynamic-quantization
  - minimax
  - minimax_m2
  - code
  - reasoning
  - agents
  - quantized
model_type: minimax_m2
pipeline_tag: text-generation
library_name: mlx
quantized_by: Ex0bit
---

![image](https://cdn-uploads.huggingface.co/production/uploads/63adf1fa42fd3b8dbaeb0c92/JuwTD-9eczmeBf5P8NLDP.png)

# MiniMax-SLURPY-DQ-MLX

**Per-tensor mixed-precision quantization of [MiniMax-SLURPY](https://huggingface.co/Ex0bit/MiniMax-SLURPY) for Apple Silicon — 2.54 BPW with 498 per-tensor-projection allocations (plus 16,122 per-expert PRISM decisions collapsed into MLX's SwitchGLU format).**

The full SLURPY model (228.7B params) compressed from 215 GB → 68 GB (68% reduction) using **PRISM Dynamic Quantization** — a per-tensor-class mixed-precision allocation derived entirely from weight structure sensitivity analysis. Zero calibration data, zero training, zero datasets.

Created by [Ex0bit](https://hf.co/Ex0bit)

---

<div align="center">

### 💡 Support our Research & Development efforts. PRISM Members Receive access to the latest PRISM-PRO Model drops on Day-0

[![Ko-fi](https://img.shields.io/badge/Ko--fi-Support%20PRISM-ff5e5b?logo=ko-fi&logoColor=white)](https://ko-fi.com/Ex0bit)

</div>

---

## Model Details

| Property | Value |
|----------|-------|
| Base Model | [Ex0bit/MiniMax-SLURPY](https://huggingface.co/Ex0bit/MiniMax-SLURPY) |
| Architecture | MiniMax M2 MoE (256 experts, top-8) |
| Parameters | 228.7B total / ~10B active |
| Quantization | PRISM-DYNAMIC-QUANT (MLX native) |
| Achieved BPW | 2.54 |
| File Size | 68 GB (vs 215 GB source = 68% reduction) |
| Per-tensor overrides | 498 (MoE: per-layer-projection modal of 16,122 per-expert decisions) |
| Default precision | 2-bit |
| Group size | 64 |
| Context Length | 196,608 tokens |
| Runtime | mlx-lm (Apple Silicon Metal) |
| Creator | [Ex0bit](https://hf.co/Ex0bit) |

## What SLURPY inherits

A mathematically unique Designer Baby of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) and [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) — neither parent, entirely its own model.

SLURPY inherits M2.5's architect-first coding style and MIT freedom, absorbs M2.7's RL-tuned precision on multi-agent collaboration and real-world engineering — without a single training step.

| Benchmark | M2.5 | M2.7 | SLURPY |
|---|---|---|---|
| HumanEval pass@5 | 85.4% | — | **89.6%** |
| SWE-Bench Verified | 80.2% | — | inherited |
| SWE-Pro | — | 56.2% | inherited |
| MLE Bench Lite | — | 66.6% | inherited |
| GDPval-AA ELO | — | 1495 | inherited |

See [Ex0bit/MiniMax-SLURPY](https://huggingface.co/Ex0bit/MiniMax-SLURPY) for full benchmark details.

---

## PRISM Dynamic Quantization

This model uses **PRISM Dynamic Quantization** — a per-tensor mixed-precision allocation that assigns different quantization types to different tensor classes based on weight structure sensitivity analysis.

Unlike uniform quantization (Q3, Q4, Q5), PRISM-DQ analyzes each tensor's sensitivity to quantization error and allocates precision where it matters most. Critical tensors (attention projections, key MoE experts, lm_head) receive higher precision while less impactful tensors get aggressive compression.

PRISM produced 16,122 per-expert decisions (256 experts × 62 layers × 3 projections, plus attention and embeddings). MLX's `SwitchGLU` packs all 256 experts per layer-projection into a single 3D tensor sharing one bit width, so the per-expert decisions collapse to the modal bit width for each of the 186 MoE projections. The remaining 312 per-tensor decisions (attention, embeddings, lm_head, routers) retain full PRISM granularity, giving 498 effective overrides.

The model's `config.json` contains the per-tensor quantization overrides that mlx-lm loads natively — no custom runtime required. Apple Silicon's compiled Metal kernels automatically handle mixed-precision tensors in a single forward pass at full GPU speed.

**No calibration data, no importance matrices, no training data required.**

---

## Architecture

Identical to MiniMax-M2.5 / M2.7 — quantization-only:

- **Model type**: `minimax_m2` / `MiniMaxM2ForCausalLM`
- **Parameters**: 228.7B total, ~10B active (MoE)
- **Layers**: 62
- **Hidden size**: 3072
- **MoE**: 256 experts, top-8, sigmoid routing + learned bias
- **Attention**: 48 query / 8 KV heads (GQA 6:1), head_dim=128
- **Quantization**: MLX affine, mixed 2-6 bit
- **Vocab**: 200,064 tokens
- **Context**: up to 196,608 tokens
- **Thinking**: Interleaved `<think>...</think>` (always-on)
- **`trust_remote_code=True` required**

---

## Usage on Apple Silicon

### mlx-lm (CLI)

```bash
pip install mlx-lm

# Interactive chat
mlx_lm.chat --model Ex0bit/MiniMax-SLURPY-PRISM-3BPW-MLX \
  --temperature 1.0 --top-p 0.95 --max-tokens 4096

# Single prompt
python -m mlx_lm.generate \
  --model Ex0bit/MiniMax-SLURPY-PRISM-3BPW-MLX \
  --prompt "Write a Python function that reverses a linked list." \
  --max-tokens 2048 \
  --temp 1.0 --top-p 0.95
```

### Python API

```python
from mlx_lm import load, generate

model, tokenizer = load("Ex0bit/MiniMax-SLURPY-PRISM-3BPW-MLX")

response = generate(
    model, tokenizer,
    prompt="Write a Python function that reverses a linked list.",
    max_tokens=2048,
    temp=1.0,
    top_p=0.95,
)
print(response)
```

### Recommended sampling parameters

| Parameter | Value |
|---|---|
| temperature | 1.0 |
| top_p | 0.95 |
| top_k | 40 |

### Important: preserve thinking in conversation history

MiniMax-M2 uses interleaved thinking. The model outputs `<think>...</think>` blocks during generation. **You must pass these back verbatim in conversation history.** Removing them degrades performance.

---

## Tool calling

Same format as base SLURPY. Tool calls use `<minimax:tool_call>` / `</minimax:tool_call>` XML wrappers:

```xml
<minimax:tool_call>
<invoke name="get_weather">
<parameter name="city">San Francisco</parameter>
</invoke>
</minimax:tool_call>
```

---

## Hardware requirements

- **Apple Silicon Mac** with unified memory
- **80 GB RAM minimum** (model is 68 GB; needs headroom for KV cache)
- **128 GB RAM recommended** for full context length
- **M2 Ultra / M3 Max / M4 Max** for best throughput

For non-Apple platforms, use the FP8 [Ex0bit/MiniMax-SLURPY](https://huggingface.co/Ex0bit/MiniMax-SLURPY) variant with vLLM.

---

## Files

- 14 MLX safetensors shards (68 GB total)
- `config.json` with 498 per-tensor quantization overrides (collapsed from 16,122 PRISM decisions via SwitchGLU packing)
- `chat_template.jinja` — M2.7's chat template with tool calling support
- `modeling_minimax_m2.py` / `configuration_minimax_m2.py` — custom model code (inherited from base)

---

## License

Modified MIT — same as MiniMax-M2.5. See [LICENSE](LICENSE) for full text.

The only modification to the standard MIT license: if the Software (or any derivative works) is used for commercial products or services with more than 100 million monthly active users or more than $30M annual recurring revenue, you must prominently display "MiniMax M2" on the user interface.

---

## Credits

- Creator: [Ex0bit](https://hf.co/Ex0bit)
- Base model: [Ex0bit/MiniMax-SLURPY](https://huggingface.co/Ex0bit/MiniMax-SLURPY)
- Parents: [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5), [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7)
- Quantization engine: PRISM-DQ by [Ex0bit](https://hf.co/Ex0bit)

---

## Citation

```
@misc{minimax-slurpy-prism-mlx-2026,
  title={MiniMax-SLURPY-PRISM-3BPW-MLX: Per-tensor mixed-precision quantization of MiniMax-SLURPY for Apple Silicon},
  author={Ex0bit},
  year={2026},
  url={https://huggingface.co/Ex0bit/MiniMax-SLURPY-PRISM-3BPW-MLX}
}
```