File size: 8,387 Bytes
9166f06
78438be
 
 
9166f06
 
 
 
 
 
 
 
 
 
 
 
f30905c
9166f06
 
 
 
 
44e92d7
 
9166f06
 
f30905c
9166f06
f30905c
9166f06
f30905c
9166f06
f30905c
 
 
 
 
 
 
 
 
 
 
9166f06
f30905c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9166f06
f30905c
9166f06
 
 
f30905c
9166f06
 
 
 
 
 
 
 
 
 
 
 
 
f30905c
9166f06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f30905c
9166f06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f30905c
 
 
9166f06
 
 
 
 
 
 
 
 
 
 
 
 
f30905c
 
9166f06
 
 
 
f30905c
9166f06
 
 
 
 
 
 
 
 
 
 
 
 
f30905c
 
 
9166f06
 
f30905c
 
 
 
 
 
 
 
9166f06
 
 
 
f30905c
 
9166f06
 
f30905c
 
 
9166f06
 
 
 
 
 
f30905c
 
 
 
 
9166f06
f30905c
9166f06
f30905c
9166f06
f30905c
9166f06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f30905c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
---
license: other
license_name: modified-mit
license_link: LICENSE
base_model:
  - MiniMaxAI/MiniMax-M2.5
  - MiniMaxAI/MiniMax-M2.7
tags:
  - merge
  - slerp
  - moe
  - fp8
  - minimax
  - minimax_m2
  - code
  - reasoning
  - agents
model_type: minimax_m2
pipeline_tag: text-generation
library_name: transformers
---

![image](https://cdn-uploads.huggingface.co/production/uploads/63adf1fa42fd3b8dbaeb0c92/JuwTD-9eczmeBf5P8NLDP.png)

# MiniMax-SLURPY

**A mathematically unique blend of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) and [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) — neither parent, entirely its own model.**

SLURPY inherits M2.5's architect-first coding style and MIT freedom, absorbs M2.7's RL-tuned precision on multi-agent collaboration and real-world engineering — without a single training step. It beats its parents on HumanEval pass@5 (89.6% vs M2.5's 85.4%) with zero retraining.

Every one of SLURPY's 48,239 weight tensors is a mathematically unique blend — not copied from M2.5, not copied from M2.7, belonging entirely to neither parent.

---

## What SLURPY inherits

SLURPY's weights are a forensically-driven interpolation of two complementary parents. The merge schedule is derived from a full-model scan of all 96,103 tensor pairs, targeting each tensor's interpolation ratio to the empirically measured delta between the parents.

### From M2.5 — the architect

M2.5 is the foundation-builder: strong on greenfield engineering, deep reasoning, and research-grade benchmarks.

| Benchmark | M2.5 Published |
|---|---|
| SWE-Bench Verified | **80.2%** |
| BrowseComp (with context mgmt) | **76.3%** |
| Multi-SWE-Bench | 51.3% |
| AIME 2025 | 86.3 |
| GPQA Diamond | 85.2 |
| SciCode | 44.4 |
| IFBench | 70.0 |
| HLE (w/o tools) | 19.4 |
| GDPval-MM (office work) | 59.0% avg win rate |

### From M2.7 — the operator

M2.7 is the execution specialist: RL-tuned for multi-step tool use, terminal ops, agentic scaffolding, and production-grade software engineering.

| Benchmark | M2.7 Published |
|---|---|
| SWE-Pro | **56.2%** (matches GPT-5.3-Codex) |
| SWE Multilingual | **76.5%** |
| Multi-SWE-Bench | 52.7% |
| MLE Bench Lite | **66.6%** medal rate (22 ML competitions) |
| VIBE-Pro | **55.6%** (near Opus 4.6) |
| TerminalBench 2 | **57.0%** |
| NL2Repo | 39.8% |
| GDPval-AA ELO | **1495** (highest open-weight) |
| Toolathon | 46.3% accuracy |
| MM Claw (skill compliance) | **97%** across 40+ skills |
| MM Claw (end-to-end) | 62.7% (near Sonnet 4.6) |

### SLURPY — best of both

SLURPY's merge schedule preserves M2.5's deep reasoning character in the early-to-mid layers (where the two models barely differ) while absorbing M2.7's agentic improvements in the late layers (where M2.7's training signal concentrates). The result is a model that carries both parents' strengths without the training cost of either.

---

## Merge method

**Per-tensor empirical SLERP** — each of the 48,239 mergeable weight tensors gets its own interpolation ratio `t(k)` derived from the measured cosine similarity between M2.5 and M2.7 on that specific tensor:

```
delta(k)      = 1 - cos(M2.5_k, M2.7_k)
delta_norm(k) = clip(delta(k) / delta_p99, 0, 1)
t(k)          = 0.50 + 0.35 * delta_norm(k)
```

- **Tensors that barely changed** (cos ~ 1.0): `t ~ 0.50` — neutral midpoint, preserving both parents
- **Tensors that changed the most** (layer 61 MoE experts): `t = 0.85` — absorbing M2.7's concentrated training signal
- **FP8 weights**: dequantized to BF16 before SLERP, re-quantized with fresh block-wise scales
- **No scale_inv pass-through**: forensics confirmed 0% bit-identical scales between parents — all 47,864 FP8 scale tensors are recomputed, not copied

### Forensic highlights

- **99.18%** of tensors sit in a tight cosine cluster around 0.9946 — most weights barely moved between M2.5 and M2.7
- **Layer 61 MoE experts** {76, 74, 61, 30, 43, 138, 226, 126, 58, 159} have deltas 2-5x baseline — this is where M2.7's RL training signal concentrates
- **lm_head.weight** (cos=0.9905, rel_l2=0.139) carries M2.7's vocabulary-level improvements

---

## Architecture

Identical to MiniMax-M2.5 / M2.7 — weight merge only, no architecture changes:

- **Model type**: `minimax_m2` / `MiniMaxM2ForCausalLM`
- **Parameters**: 228.7B total, ~10B active (MoE)
- **Layers**: 62
- **Hidden size**: 3072
- **MoE**: 256 experts, top-8, sigmoid routing + learned bias
- **Attention**: 48 query / 8 KV heads (GQA 6:1), head_dim=128
- **Quantization**: FP8 (`float8_e4m3fn`), block size [128, 128]
- **Vocab**: 200,064 tokens
- **Context**: up to 196,608 tokens
- **Thinking**: Interleaved `<think>...</think>` (always-on)
- **`trust_remote_code=True` required**

---

## Serving with vLLM

Recommended command (8x H100 80GB):

```bash
SAFETENSORS_FAST_GPU=1 vllm serve \
    Ex0bit/MiniMax-SLURPY --trust-remote-code \
    --enable-expert-parallel --tensor-parallel-size 8 \
    --enable-auto-tool-choice --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --enforce-eager
```

For 4x GPU (no expert parallel):

```bash
SAFETENSORS_FAST_GPU=1 vllm serve \
    Ex0bit/MiniMax-SLURPY --trust-remote-code \
    --tensor-parallel-size 4 \
    --enable-auto-tool-choice --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think
```

If you encounter CUDA memory errors, add:
```bash
--compilation-config '{"cudagraph_mode": "PIECEWISE"}'
```

### Recommended sampling parameters

| Parameter | Value |
|---|---|
| temperature | 1.0 |
| top_p | 0.95 |
| top_k | 40 |

### Important: preserve thinking in conversation history

MiniMax-M2 uses interleaved thinking. The model outputs `<think>...</think>` blocks during generation. **You must pass these back verbatim in conversation history.** Removing them degrades performance.

---

## Tool calling

Same format as MiniMax-M2.7. Tool calls use `<minimax:tool_call>` / `</minimax:tool_call>` XML wrappers:

```xml
<minimax:tool_call>
<invoke name="get_weather">
<parameter name="city">San Francisco</parameter>
</invoke>
</minimax:tool_call>
```

Enable with `--enable-auto-tool-choice --tool-call-parser minimax_m2` in vLLM.

---

## Using with Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Ex0bit/MiniMax-SLURPY",
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "Ex0bit/MiniMax-SLURPY",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a Python function that reverses a linked list."}]
input_ids = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=2048,
        do_sample=True,
        temperature=1.0,
        top_p=0.95,
        top_k=40,
    )

print(tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True))
```

---

## Config notes

- `use_mtp` is set to `False` in config.json (MTP tensors don't exist in the checkpoint)
- `quantization_config` is preserved — native FP8
- Chat template and tokenizer are sourced from M2.7

## Files

- 43 safetensors shards (~5 GB each, 214.3 GB total)
- Native FP8 (`float8_e4m3fn`) with block-wise `[128, 128]` scale factors
- `chat_template.jinja` — M2.7's chat template with tool calling support
- `modeling_minimax_m2.py` / `configuration_minimax_m2.py` — custom model code

---

## License

Modified MIT — same as MiniMax-M2.5. See [LICENSE](LICENSE) for full text.

The only modification to the standard MIT license: if the Software (or any derivative works) is used for commercial products or services with more than 100 million monthly active users or more than $30M annual recurring revenue, you must prominently display "MiniMax M2" on the user interface.

---

## Citation

```
@misc{minimax-slurpy-2026,
  title={MiniMax-SLURPY: Per-tensor empirical SLERP merge of MiniMax-M2.5 and M2.7},
  author={Ex0bit},
  year={2026},
  url={https://huggingface.co/Ex0bit/MiniMax-SLURPY}
}
```

## Acknowledgments

- [MiniMax](https://www.minimaxi.com/) for the M2.5 and M2.7 base models
- Merge infrastructure adapted from the PRISM abliteration pipeline