Text Generation
Transformers
Safetensors
minimax_m2
Merge
slerp
Mixture of Experts
fp8
minimax
code
reasoning
agents
conversational
custom_code
Instructions to use Ex0bit/MiniMax-SLURPY with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Ex0bit/MiniMax-SLURPY with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Ex0bit/MiniMax-SLURPY", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Ex0bit/MiniMax-SLURPY", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Ex0bit/MiniMax-SLURPY", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Ex0bit/MiniMax-SLURPY with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Ex0bit/MiniMax-SLURPY" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ex0bit/MiniMax-SLURPY", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Ex0bit/MiniMax-SLURPY
- SGLang
How to use Ex0bit/MiniMax-SLURPY with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Ex0bit/MiniMax-SLURPY" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ex0bit/MiniMax-SLURPY", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Ex0bit/MiniMax-SLURPY" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ex0bit/MiniMax-SLURPY", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Ex0bit/MiniMax-SLURPY with Docker Model Runner:
docker model run hf.co/Ex0bit/MiniMax-SLURPY
File size: 8,387 Bytes
9166f06 78438be 9166f06 f30905c 9166f06 44e92d7 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c 9166f06 f30905c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 | ---
license: other
license_name: modified-mit
license_link: LICENSE
base_model:
- MiniMaxAI/MiniMax-M2.5
- MiniMaxAI/MiniMax-M2.7
tags:
- merge
- slerp
- moe
- fp8
- minimax
- minimax_m2
- code
- reasoning
- agents
model_type: minimax_m2
pipeline_tag: text-generation
library_name: transformers
---

# MiniMax-SLURPY
**A mathematically unique blend of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) and [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) — neither parent, entirely its own model.**
SLURPY inherits M2.5's architect-first coding style and MIT freedom, absorbs M2.7's RL-tuned precision on multi-agent collaboration and real-world engineering — without a single training step. It beats its parents on HumanEval pass@5 (89.6% vs M2.5's 85.4%) with zero retraining.
Every one of SLURPY's 48,239 weight tensors is a mathematically unique blend — not copied from M2.5, not copied from M2.7, belonging entirely to neither parent.
---
## What SLURPY inherits
SLURPY's weights are a forensically-driven interpolation of two complementary parents. The merge schedule is derived from a full-model scan of all 96,103 tensor pairs, targeting each tensor's interpolation ratio to the empirically measured delta between the parents.
### From M2.5 — the architect
M2.5 is the foundation-builder: strong on greenfield engineering, deep reasoning, and research-grade benchmarks.
| Benchmark | M2.5 Published |
|---|---|
| SWE-Bench Verified | **80.2%** |
| BrowseComp (with context mgmt) | **76.3%** |
| Multi-SWE-Bench | 51.3% |
| AIME 2025 | 86.3 |
| GPQA Diamond | 85.2 |
| SciCode | 44.4 |
| IFBench | 70.0 |
| HLE (w/o tools) | 19.4 |
| GDPval-MM (office work) | 59.0% avg win rate |
### From M2.7 — the operator
M2.7 is the execution specialist: RL-tuned for multi-step tool use, terminal ops, agentic scaffolding, and production-grade software engineering.
| Benchmark | M2.7 Published |
|---|---|
| SWE-Pro | **56.2%** (matches GPT-5.3-Codex) |
| SWE Multilingual | **76.5%** |
| Multi-SWE-Bench | 52.7% |
| MLE Bench Lite | **66.6%** medal rate (22 ML competitions) |
| VIBE-Pro | **55.6%** (near Opus 4.6) |
| TerminalBench 2 | **57.0%** |
| NL2Repo | 39.8% |
| GDPval-AA ELO | **1495** (highest open-weight) |
| Toolathon | 46.3% accuracy |
| MM Claw (skill compliance) | **97%** across 40+ skills |
| MM Claw (end-to-end) | 62.7% (near Sonnet 4.6) |
### SLURPY — best of both
SLURPY's merge schedule preserves M2.5's deep reasoning character in the early-to-mid layers (where the two models barely differ) while absorbing M2.7's agentic improvements in the late layers (where M2.7's training signal concentrates). The result is a model that carries both parents' strengths without the training cost of either.
---
## Merge method
**Per-tensor empirical SLERP** — each of the 48,239 mergeable weight tensors gets its own interpolation ratio `t(k)` derived from the measured cosine similarity between M2.5 and M2.7 on that specific tensor:
```
delta(k) = 1 - cos(M2.5_k, M2.7_k)
delta_norm(k) = clip(delta(k) / delta_p99, 0, 1)
t(k) = 0.50 + 0.35 * delta_norm(k)
```
- **Tensors that barely changed** (cos ~ 1.0): `t ~ 0.50` — neutral midpoint, preserving both parents
- **Tensors that changed the most** (layer 61 MoE experts): `t = 0.85` — absorbing M2.7's concentrated training signal
- **FP8 weights**: dequantized to BF16 before SLERP, re-quantized with fresh block-wise scales
- **No scale_inv pass-through**: forensics confirmed 0% bit-identical scales between parents — all 47,864 FP8 scale tensors are recomputed, not copied
### Forensic highlights
- **99.18%** of tensors sit in a tight cosine cluster around 0.9946 — most weights barely moved between M2.5 and M2.7
- **Layer 61 MoE experts** {76, 74, 61, 30, 43, 138, 226, 126, 58, 159} have deltas 2-5x baseline — this is where M2.7's RL training signal concentrates
- **lm_head.weight** (cos=0.9905, rel_l2=0.139) carries M2.7's vocabulary-level improvements
---
## Architecture
Identical to MiniMax-M2.5 / M2.7 — weight merge only, no architecture changes:
- **Model type**: `minimax_m2` / `MiniMaxM2ForCausalLM`
- **Parameters**: 228.7B total, ~10B active (MoE)
- **Layers**: 62
- **Hidden size**: 3072
- **MoE**: 256 experts, top-8, sigmoid routing + learned bias
- **Attention**: 48 query / 8 KV heads (GQA 6:1), head_dim=128
- **Quantization**: FP8 (`float8_e4m3fn`), block size [128, 128]
- **Vocab**: 200,064 tokens
- **Context**: up to 196,608 tokens
- **Thinking**: Interleaved `<think>...</think>` (always-on)
- **`trust_remote_code=True` required**
---
## Serving with vLLM
Recommended command (8x H100 80GB):
```bash
SAFETENSORS_FAST_GPU=1 vllm serve \
Ex0bit/MiniMax-SLURPY --trust-remote-code \
--enable-expert-parallel --tensor-parallel-size 8 \
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--enforce-eager
```
For 4x GPU (no expert parallel):
```bash
SAFETENSORS_FAST_GPU=1 vllm serve \
Ex0bit/MiniMax-SLURPY --trust-remote-code \
--tensor-parallel-size 4 \
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think
```
If you encounter CUDA memory errors, add:
```bash
--compilation-config '{"cudagraph_mode": "PIECEWISE"}'
```
### Recommended sampling parameters
| Parameter | Value |
|---|---|
| temperature | 1.0 |
| top_p | 0.95 |
| top_k | 40 |
### Important: preserve thinking in conversation history
MiniMax-M2 uses interleaved thinking. The model outputs `<think>...</think>` blocks during generation. **You must pass these back verbatim in conversation history.** Removing them degrades performance.
---
## Tool calling
Same format as MiniMax-M2.7. Tool calls use `<minimax:tool_call>` / `</minimax:tool_call>` XML wrappers:
```xml
<minimax:tool_call>
<invoke name="get_weather">
<parameter name="city">San Francisco</parameter>
</invoke>
</minimax:tool_call>
```
Enable with `--enable-auto-tool-choice --tool-call-parser minimax_m2` in vLLM.
---
## Using with Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"Ex0bit/MiniMax-SLURPY",
trust_remote_code=True,
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"Ex0bit/MiniMax-SLURPY",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Write a Python function that reverses a linked list."}]
input_ids = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=2048,
do_sample=True,
temperature=1.0,
top_p=0.95,
top_k=40,
)
print(tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True))
```
---
## Config notes
- `use_mtp` is set to `False` in config.json (MTP tensors don't exist in the checkpoint)
- `quantization_config` is preserved — native FP8
- Chat template and tokenizer are sourced from M2.7
## Files
- 43 safetensors shards (~5 GB each, 214.3 GB total)
- Native FP8 (`float8_e4m3fn`) with block-wise `[128, 128]` scale factors
- `chat_template.jinja` — M2.7's chat template with tool calling support
- `modeling_minimax_m2.py` / `configuration_minimax_m2.py` — custom model code
---
## License
Modified MIT — same as MiniMax-M2.5. See [LICENSE](LICENSE) for full text.
The only modification to the standard MIT license: if the Software (or any derivative works) is used for commercial products or services with more than 100 million monthly active users or more than $30M annual recurring revenue, you must prominently display "MiniMax M2" on the user interface.
---
## Citation
```
@misc{minimax-slurpy-2026,
title={MiniMax-SLURPY: Per-tensor empirical SLERP merge of MiniMax-M2.5 and M2.7},
author={Ex0bit},
year={2026},
url={https://huggingface.co/Ex0bit/MiniMax-SLURPY}
}
```
## Acknowledgments
- [MiniMax](https://www.minimaxi.com/) for the M2.5 and M2.7 base models
- Merge infrastructure adapted from the PRISM abliteration pipeline
|