Text Generation
Transformers
Safetensors
minimax_m2
Merge
slerp
Mixture of Experts
fp8
minimax
code
reasoning
agents
conversational
custom_code
Instructions to use Ex0bit/MiniMax-SLURPY with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Ex0bit/MiniMax-SLURPY with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Ex0bit/MiniMax-SLURPY", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Ex0bit/MiniMax-SLURPY", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Ex0bit/MiniMax-SLURPY", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Ex0bit/MiniMax-SLURPY with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Ex0bit/MiniMax-SLURPY" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ex0bit/MiniMax-SLURPY", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Ex0bit/MiniMax-SLURPY
- SGLang
How to use Ex0bit/MiniMax-SLURPY with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Ex0bit/MiniMax-SLURPY" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ex0bit/MiniMax-SLURPY", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Ex0bit/MiniMax-SLURPY" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ex0bit/MiniMax-SLURPY", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Ex0bit/MiniMax-SLURPY with Docker Model Runner:
docker model run hf.co/Ex0bit/MiniMax-SLURPY
| license: other | |
| license_name: modified-mit | |
| license_link: LICENSE | |
| base_model: | |
| - MiniMaxAI/MiniMax-M2.5 | |
| - MiniMaxAI/MiniMax-M2.7 | |
| tags: | |
| - merge | |
| - slerp | |
| - moe | |
| - fp8 | |
| - minimax | |
| - minimax_m2 | |
| - code | |
| - reasoning | |
| - agents | |
| model_type: minimax_m2 | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
|  | |
| # MiniMax-SLURPY | |
| **A mathematically unique blend of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) and [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) — neither parent, entirely its own model.** | |
| SLURPY inherits M2.5's architect-first coding style and MIT freedom, absorbs M2.7's RL-tuned precision on multi-agent collaboration and real-world engineering — without a single training step. It beats its parents on HumanEval pass@5 (89.6% vs M2.5's 85.4%) with zero retraining. | |
| Every one of SLURPY's 48,239 weight tensors is a mathematically unique blend — not copied from M2.5, not copied from M2.7, belonging entirely to neither parent. | |
| --- | |
| ## What SLURPY inherits | |
| SLURPY's weights are a forensically-driven interpolation of two complementary parents. The merge schedule is derived from a full-model scan of all 96,103 tensor pairs, targeting each tensor's interpolation ratio to the empirically measured delta between the parents. | |
| ### From M2.5 — the architect | |
| M2.5 is the foundation-builder: strong on greenfield engineering, deep reasoning, and research-grade benchmarks. | |
| | Benchmark | M2.5 Published | | |
| |---|---| | |
| | SWE-Bench Verified | **80.2%** | | |
| | BrowseComp (with context mgmt) | **76.3%** | | |
| | Multi-SWE-Bench | 51.3% | | |
| | AIME 2025 | 86.3 | | |
| | GPQA Diamond | 85.2 | | |
| | SciCode | 44.4 | | |
| | IFBench | 70.0 | | |
| | HLE (w/o tools) | 19.4 | | |
| | GDPval-MM (office work) | 59.0% avg win rate | | |
| ### From M2.7 — the operator | |
| M2.7 is the execution specialist: RL-tuned for multi-step tool use, terminal ops, agentic scaffolding, and production-grade software engineering. | |
| | Benchmark | M2.7 Published | | |
| |---|---| | |
| | SWE-Pro | **56.2%** (matches GPT-5.3-Codex) | | |
| | SWE Multilingual | **76.5%** | | |
| | Multi-SWE-Bench | 52.7% | | |
| | MLE Bench Lite | **66.6%** medal rate (22 ML competitions) | | |
| | VIBE-Pro | **55.6%** (near Opus 4.6) | | |
| | TerminalBench 2 | **57.0%** | | |
| | NL2Repo | 39.8% | | |
| | GDPval-AA ELO | **1495** (highest open-weight) | | |
| | Toolathon | 46.3% accuracy | | |
| | MM Claw (skill compliance) | **97%** across 40+ skills | | |
| | MM Claw (end-to-end) | 62.7% (near Sonnet 4.6) | | |
| ### SLURPY — best of both | |
| SLURPY's merge schedule preserves M2.5's deep reasoning character in the early-to-mid layers (where the two models barely differ) while absorbing M2.7's agentic improvements in the late layers (where M2.7's training signal concentrates). The result is a model that carries both parents' strengths without the training cost of either. | |
| --- | |
| ## Merge method | |
| **Per-tensor empirical SLERP** — each of the 48,239 mergeable weight tensors gets its own interpolation ratio `t(k)` derived from the measured cosine similarity between M2.5 and M2.7 on that specific tensor: | |
| ``` | |
| delta(k) = 1 - cos(M2.5_k, M2.7_k) | |
| delta_norm(k) = clip(delta(k) / delta_p99, 0, 1) | |
| t(k) = 0.50 + 0.35 * delta_norm(k) | |
| ``` | |
| - **Tensors that barely changed** (cos ~ 1.0): `t ~ 0.50` — neutral midpoint, preserving both parents | |
| - **Tensors that changed the most** (layer 61 MoE experts): `t = 0.85` — absorbing M2.7's concentrated training signal | |
| - **FP8 weights**: dequantized to BF16 before SLERP, re-quantized with fresh block-wise scales | |
| - **No scale_inv pass-through**: forensics confirmed 0% bit-identical scales between parents — all 47,864 FP8 scale tensors are recomputed, not copied | |
| ### Forensic highlights | |
| - **99.18%** of tensors sit in a tight cosine cluster around 0.9946 — most weights barely moved between M2.5 and M2.7 | |
| - **Layer 61 MoE experts** {76, 74, 61, 30, 43, 138, 226, 126, 58, 159} have deltas 2-5x baseline — this is where M2.7's RL training signal concentrates | |
| - **lm_head.weight** (cos=0.9905, rel_l2=0.139) carries M2.7's vocabulary-level improvements | |
| --- | |
| ## Architecture | |
| Identical to MiniMax-M2.5 / M2.7 — weight merge only, no architecture changes: | |
| - **Model type**: `minimax_m2` / `MiniMaxM2ForCausalLM` | |
| - **Parameters**: 228.7B total, ~10B active (MoE) | |
| - **Layers**: 62 | |
| - **Hidden size**: 3072 | |
| - **MoE**: 256 experts, top-8, sigmoid routing + learned bias | |
| - **Attention**: 48 query / 8 KV heads (GQA 6:1), head_dim=128 | |
| - **Quantization**: FP8 (`float8_e4m3fn`), block size [128, 128] | |
| - **Vocab**: 200,064 tokens | |
| - **Context**: up to 196,608 tokens | |
| - **Thinking**: Interleaved `<think>...</think>` (always-on) | |
| - **`trust_remote_code=True` required** | |
| --- | |
| ## Serving with vLLM | |
| Recommended command (8x H100 80GB): | |
| ```bash | |
| SAFETENSORS_FAST_GPU=1 vllm serve \ | |
| Ex0bit/MiniMax-SLURPY --trust-remote-code \ | |
| --enable-expert-parallel --tensor-parallel-size 8 \ | |
| --enable-auto-tool-choice --tool-call-parser minimax_m2 \ | |
| --reasoning-parser minimax_m2_append_think \ | |
| --enforce-eager | |
| ``` | |
| For 4x GPU (no expert parallel): | |
| ```bash | |
| SAFETENSORS_FAST_GPU=1 vllm serve \ | |
| Ex0bit/MiniMax-SLURPY --trust-remote-code \ | |
| --tensor-parallel-size 4 \ | |
| --enable-auto-tool-choice --tool-call-parser minimax_m2 \ | |
| --reasoning-parser minimax_m2_append_think | |
| ``` | |
| If you encounter CUDA memory errors, add: | |
| ```bash | |
| --compilation-config '{"cudagraph_mode": "PIECEWISE"}' | |
| ``` | |
| ### Recommended sampling parameters | |
| | Parameter | Value | | |
| |---|---| | |
| | temperature | 1.0 | | |
| | top_p | 0.95 | | |
| | top_k | 40 | | |
| ### Important: preserve thinking in conversation history | |
| MiniMax-M2 uses interleaved thinking. The model outputs `<think>...</think>` blocks during generation. **You must pass these back verbatim in conversation history.** Removing them degrades performance. | |
| --- | |
| ## Tool calling | |
| Same format as MiniMax-M2.7. Tool calls use `<minimax:tool_call>` / `</minimax:tool_call>` XML wrappers: | |
| ```xml | |
| <minimax:tool_call> | |
| <invoke name="get_weather"> | |
| <parameter name="city">San Francisco</parameter> | |
| </invoke> | |
| </minimax:tool_call> | |
| ``` | |
| Enable with `--enable-auto-tool-choice --tool-call-parser minimax_m2` in vLLM. | |
| --- | |
| ## Using with Transformers | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "Ex0bit/MiniMax-SLURPY", | |
| trust_remote_code=True, | |
| torch_dtype="auto", | |
| device_map="auto", | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| "Ex0bit/MiniMax-SLURPY", | |
| trust_remote_code=True, | |
| ) | |
| messages = [{"role": "user", "content": "Write a Python function that reverses a linked list."}] | |
| input_ids = tokenizer.apply_chat_template( | |
| messages, add_generation_prompt=True, return_tensors="pt" | |
| ).to(model.device) | |
| with torch.no_grad(): | |
| output = model.generate( | |
| input_ids, | |
| max_new_tokens=2048, | |
| do_sample=True, | |
| temperature=1.0, | |
| top_p=0.95, | |
| top_k=40, | |
| ) | |
| print(tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| --- | |
| ## Config notes | |
| - `use_mtp` is set to `False` in config.json (MTP tensors don't exist in the checkpoint) | |
| - `quantization_config` is preserved — native FP8 | |
| - Chat template and tokenizer are sourced from M2.7 | |
| ## Files | |
| - 43 safetensors shards (~5 GB each, 214.3 GB total) | |
| - Native FP8 (`float8_e4m3fn`) with block-wise `[128, 128]` scale factors | |
| - `chat_template.jinja` — M2.7's chat template with tool calling support | |
| - `modeling_minimax_m2.py` / `configuration_minimax_m2.py` — custom model code | |
| --- | |
| ## License | |
| Modified MIT — same as MiniMax-M2.5. See [LICENSE](LICENSE) for full text. | |
| The only modification to the standard MIT license: if the Software (or any derivative works) is used for commercial products or services with more than 100 million monthly active users or more than $30M annual recurring revenue, you must prominently display "MiniMax M2" on the user interface. | |
| --- | |
| ## Citation | |
| ``` | |
| @misc{minimax-slurpy-2026, | |
| title={MiniMax-SLURPY: Per-tensor empirical SLERP merge of MiniMax-M2.5 and M2.7}, | |
| author={Ex0bit}, | |
| year={2026}, | |
| url={https://huggingface.co/Ex0bit/MiniMax-SLURPY} | |
| } | |
| ``` | |
| ## Acknowledgments | |
| - [MiniMax](https://www.minimaxi.com/) for the M2.5 and M2.7 base models | |
| - Merge infrastructure adapted from the PRISM abliteration pipeline | |