Instructions to use Ex0bit/MiniMax-SLURPY with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Ex0bit/MiniMax-SLURPY with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Ex0bit/MiniMax-SLURPY", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Ex0bit/MiniMax-SLURPY", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Ex0bit/MiniMax-SLURPY", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Ex0bit/MiniMax-SLURPY with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Ex0bit/MiniMax-SLURPY"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Ex0bit/MiniMax-SLURPY",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Ex0bit/MiniMax-SLURPY

SGLang

How to use Ex0bit/MiniMax-SLURPY with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Ex0bit/MiniMax-SLURPY" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Ex0bit/MiniMax-SLURPY",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Ex0bit/MiniMax-SLURPY" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Ex0bit/MiniMax-SLURPY",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Ex0bit/MiniMax-SLURPY with Docker Model Runner:
```
docker model run hf.co/Ex0bit/MiniMax-SLURPY
```

MiniMax-SLURPY / README.md

Ex0bit

Update README.md

44e92d7 verified about 2 months ago

preview code

raw

history blame contribute delete

8.39 kB

	---
	license: other
	license_name: modified-mit
	license_link: LICENSE
	base_model:
	- MiniMaxAI/MiniMax-M2.5
	- MiniMaxAI/MiniMax-M2.7
	tags:
	- merge
	- slerp
	- moe
	- fp8
	- minimax
	- minimax_m2
	- code
	- reasoning
	- agents
	model_type: minimax_m2
	pipeline_tag: text-generation
	library_name: transformers
	---

	![image](https://cdn-uploads.huggingface.co/production/uploads/63adf1fa42fd3b8dbaeb0c92/JuwTD-9eczmeBf5P8NLDP.png)

	# MiniMax-SLURPY

	A mathematically unique blend of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) and [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) — neither parent, entirely its own model.

	SLURPY inherits M2.5's architect-first coding style and MIT freedom, absorbs M2.7's RL-tuned precision on multi-agent collaboration and real-world engineering — without a single training step. It beats its parents on HumanEval pass@5 (89.6% vs M2.5's 85.4%) with zero retraining.

	Every one of SLURPY's 48,239 weight tensors is a mathematically unique blend — not copied from M2.5, not copied from M2.7, belonging entirely to neither parent.

	---

	## What SLURPY inherits

	SLURPY's weights are a forensically-driven interpolation of two complementary parents. The merge schedule is derived from a full-model scan of all 96,103 tensor pairs, targeting each tensor's interpolation ratio to the empirically measured delta between the parents.

	### From M2.5 — the architect

	M2.5 is the foundation-builder: strong on greenfield engineering, deep reasoning, and research-grade benchmarks.

	\| Benchmark \| M2.5 Published \|
	\|---\|---\|
	\| SWE-Bench Verified \| 80.2% \|
	\| BrowseComp (with context mgmt) \| 76.3% \|
	\| Multi-SWE-Bench \| 51.3% \|
	\| AIME 2025 \| 86.3 \|
	\| GPQA Diamond \| 85.2 \|
	\| SciCode \| 44.4 \|
	\| IFBench \| 70.0 \|
	\| HLE (w/o tools) \| 19.4 \|
	\| GDPval-MM (office work) \| 59.0% avg win rate \|

	### From M2.7 — the operator

	M2.7 is the execution specialist: RL-tuned for multi-step tool use, terminal ops, agentic scaffolding, and production-grade software engineering.

	\| Benchmark \| M2.7 Published \|
	\|---\|---\|
	\| SWE-Pro \| 56.2% (matches GPT-5.3-Codex) \|
	\| SWE Multilingual \| 76.5% \|
	\| Multi-SWE-Bench \| 52.7% \|
	\| MLE Bench Lite \| 66.6% medal rate (22 ML competitions) \|
	\| VIBE-Pro \| 55.6% (near Opus 4.6) \|
	\| TerminalBench 2 \| 57.0% \|
	\| NL2Repo \| 39.8% \|
	\| GDPval-AA ELO \| 1495 (highest open-weight) \|
	\| Toolathon \| 46.3% accuracy \|
	\| MM Claw (skill compliance) \| 97% across 40+ skills \|
	\| MM Claw (end-to-end) \| 62.7% (near Sonnet 4.6) \|

	### SLURPY — best of both

	SLURPY's merge schedule preserves M2.5's deep reasoning character in the early-to-mid layers (where the two models barely differ) while absorbing M2.7's agentic improvements in the late layers (where M2.7's training signal concentrates). The result is a model that carries both parents' strengths without the training cost of either.

	---

	## Merge method

	Per-tensor empirical SLERP — each of the 48,239 mergeable weight tensors gets its own interpolation ratio `t(k)` derived from the measured cosine similarity between M2.5 and M2.7 on that specific tensor:

	```
	delta(k) = 1 - cos(M2.5_k, M2.7_k)
	delta_norm(k) = clip(delta(k) / delta_p99, 0, 1)
	t(k) = 0.50 + 0.35 * delta_norm(k)
	```

	- Tensors that barely changed (cos ~ 1.0): `t ~ 0.50` — neutral midpoint, preserving both parents
	- Tensors that changed the most (layer 61 MoE experts): `t = 0.85` — absorbing M2.7's concentrated training signal
	- FP8 weights: dequantized to BF16 before SLERP, re-quantized with fresh block-wise scales
	- No scale_inv pass-through: forensics confirmed 0% bit-identical scales between parents — all 47,864 FP8 scale tensors are recomputed, not copied

	### Forensic highlights

	- 99.18% of tensors sit in a tight cosine cluster around 0.9946 — most weights barely moved between M2.5 and M2.7
	- Layer 61 MoE experts {76, 74, 61, 30, 43, 138, 226, 126, 58, 159} have deltas 2-5x baseline — this is where M2.7's RL training signal concentrates
	- lm_head.weight (cos=0.9905, rel_l2=0.139) carries M2.7's vocabulary-level improvements

	---

	## Architecture

	Identical to MiniMax-M2.5 / M2.7 — weight merge only, no architecture changes:

	- Model type: `minimax_m2` / `MiniMaxM2ForCausalLM`
	- Parameters: 228.7B total, ~10B active (MoE)
	- Layers: 62
	- Hidden size: 3072
	- MoE: 256 experts, top-8, sigmoid routing + learned bias
	- Attention: 48 query / 8 KV heads (GQA 6:1), head_dim=128
	- Quantization: FP8 (`float8_e4m3fn`), block size [128, 128]
	- Vocab: 200,064 tokens
	- Context: up to 196,608 tokens
	- Thinking: Interleaved `<think>...</think>` (always-on)
	- `trust_remote_code=True` required

	---

	## Serving with vLLM

	Recommended command (8x H100 80GB):

	```bash
	SAFETENSORS_FAST_GPU=1 vllm serve \
	Ex0bit/MiniMax-SLURPY --trust-remote-code \
	--enable-expert-parallel --tensor-parallel-size 8 \
	--enable-auto-tool-choice --tool-call-parser minimax_m2 \
	--reasoning-parser minimax_m2_append_think \
	--enforce-eager
	```

	For 4x GPU (no expert parallel):

	```bash
	SAFETENSORS_FAST_GPU=1 vllm serve \
	Ex0bit/MiniMax-SLURPY --trust-remote-code \
	--tensor-parallel-size 4 \
	--enable-auto-tool-choice --tool-call-parser minimax_m2 \
	--reasoning-parser minimax_m2_append_think
	```

	If you encounter CUDA memory errors, add:
	```bash
	--compilation-config '{"cudagraph_mode": "PIECEWISE"}'
	```

	### Recommended sampling parameters

	\| Parameter \| Value \|
	\|---\|---\|
	\| temperature \| 1.0 \|
	\| top_p \| 0.95 \|
	\| top_k \| 40 \|

	### Important: preserve thinking in conversation history

	MiniMax-M2 uses interleaved thinking. The model outputs `<think>...</think>` blocks during generation. You must pass these back verbatim in conversation history. Removing them degrades performance.

	---

	## Tool calling

	Same format as MiniMax-M2.7. Tool calls use `<minimax:tool_call>` / `</minimax:tool_call>` XML wrappers:

	```xml
	<minimax:tool_call>
	<invoke name="get_weather">
	<parameter name="city">San Francisco</parameter>
	</invoke>
	</minimax:tool_call>
	```

	Enable with `--enable-auto-tool-choice --tool-call-parser minimax_m2` in vLLM.

	---

	## Using with Transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model = AutoModelForCausalLM.from_pretrained(
	"Ex0bit/MiniMax-SLURPY",
	trust_remote_code=True,
	torch_dtype="auto",
	device_map="auto",
	)
	tokenizer = AutoTokenizer.from_pretrained(
	"Ex0bit/MiniMax-SLURPY",
	trust_remote_code=True,
	)

	messages = [{"role": "user", "content": "Write a Python function that reverses a linked list."}]
	input_ids = tokenizer.apply_chat_template(
	messages, add_generation_prompt=True, return_tensors="pt"
	).to(model.device)

	with torch.no_grad():
	output = model.generate(
	input_ids,
	max_new_tokens=2048,
	do_sample=True,
	temperature=1.0,
	top_p=0.95,
	top_k=40,
	)

	print(tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True))
	```

	---

	## Config notes

	- `use_mtp` is set to `False` in config.json (MTP tensors don't exist in the checkpoint)
	- `quantization_config` is preserved — native FP8
	- Chat template and tokenizer are sourced from M2.7

	## Files

	- 43 safetensors shards (~5 GB each, 214.3 GB total)
	- Native FP8 (`float8_e4m3fn`) with block-wise `[128, 128]` scale factors
	- `chat_template.jinja` — M2.7's chat template with tool calling support
	- `modeling_minimax_m2.py` / `configuration_minimax_m2.py` — custom model code

	---

	## License

	Modified MIT — same as MiniMax-M2.5. See [LICENSE](LICENSE) for full text.

	The only modification to the standard MIT license: if the Software (or any derivative works) is used for commercial products or services with more than 100 million monthly active users or more than $30M annual recurring revenue, you must prominently display "MiniMax M2" on the user interface.

	---

	## Citation

	```
	@misc{minimax-slurpy-2026,
	title={MiniMax-SLURPY: Per-tensor empirical SLERP merge of MiniMax-M2.5 and M2.7},
	author={Ex0bit},
	year={2026},
	url={https://huggingface.co/Ex0bit/MiniMax-SLURPY}
	}
	```

	## Acknowledgments

	- [MiniMax](https://www.minimaxi.com/) for the M2.5 and M2.7 base models
	- Merge infrastructure adapted from the PRISM abliteration pipeline