Text Generation
MLX
Safetensors
minimax_m2
apple-silicon
Mixture of Experts
prism-dq
dynamic-quantization
minimax
code
reasoning
agents
quantized
conversational
custom_code
2-bit
Instructions to use Ex0bit/MiniMax-SLURPY-DQ-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Ex0bit/MiniMax-SLURPY-DQ-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("Ex0bit/MiniMax-SLURPY-DQ-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use Ex0bit/MiniMax-SLURPY-DQ-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Ex0bit/MiniMax-SLURPY-DQ-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Ex0bit/MiniMax-SLURPY-DQ-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Ex0bit/MiniMax-SLURPY-DQ-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Ex0bit/MiniMax-SLURPY-DQ-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Ex0bit/MiniMax-SLURPY-DQ-MLX
Run Hermes
hermes
- MLX LM
How to use Ex0bit/MiniMax-SLURPY-DQ-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "Ex0bit/MiniMax-SLURPY-DQ-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "Ex0bit/MiniMax-SLURPY-DQ-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ex0bit/MiniMax-SLURPY-DQ-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
| license: other | |
| license_name: modified-mit | |
| license_link: LICENSE | |
| base_model: | |
| - Ex0bit/MiniMax-SLURPY | |
| base_model_relation: quantized | |
| tags: | |
| - mlx | |
| - apple-silicon | |
| - moe | |
| - prism-dq | |
| - dynamic-quantization | |
| - minimax | |
| - minimax_m2 | |
| - code | |
| - reasoning | |
| - agents | |
| - quantized | |
| model_type: minimax_m2 | |
| pipeline_tag: text-generation | |
| library_name: mlx | |
| quantized_by: Ex0bit | |
|  | |
| # MiniMax-SLURPY-DQ-MLX | |
| **Per-tensor mixed-precision quantization of [MiniMax-SLURPY](https://huggingface.co/Ex0bit/MiniMax-SLURPY) for Apple Silicon — 2.54 BPW with 498 per-tensor-projection allocations (plus 16,122 per-expert PRISM decisions collapsed into MLX's SwitchGLU format).** | |
| The full SLURPY model (228.7B params) compressed from 215 GB → 68 GB (68% reduction) using **PRISM Dynamic Quantization** — a per-tensor-class mixed-precision allocation derived entirely from weight structure sensitivity analysis. Zero calibration data, zero training, zero datasets. | |
| Created by [Ex0bit](https://hf.co/Ex0bit) | |
| --- | |
| <div align="center"> | |
| ### 💡 Support our Research & Development efforts. PRISM Members Receive access to the latest PRISM-PRO Model drops on Day-0 | |
| [](https://ko-fi.com/Ex0bit) | |
| </div> | |
| --- | |
| ## Model Details | |
| | Property | Value | | |
| |----------|-------| | |
| | Base Model | [Ex0bit/MiniMax-SLURPY](https://huggingface.co/Ex0bit/MiniMax-SLURPY) | | |
| | Architecture | MiniMax M2 MoE (256 experts, top-8) | | |
| | Parameters | 228.7B total / ~10B active | | |
| | Quantization | PRISM-DYNAMIC-QUANT (MLX native) | | |
| | Achieved BPW | 2.54 | | |
| | File Size | 68 GB (vs 215 GB source = 68% reduction) | | |
| | Per-tensor overrides | 498 (MoE: per-layer-projection modal of 16,122 per-expert decisions) | | |
| | Default precision | 2-bit | | |
| | Group size | 64 | | |
| | Context Length | 196,608 tokens | | |
| | Runtime | mlx-lm (Apple Silicon Metal) | | |
| | Creator | [Ex0bit](https://hf.co/Ex0bit) | | |
| ## What SLURPY inherits | |
| A mathematically unique Designer Baby of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) and [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) — neither parent, entirely its own model. | |
| SLURPY inherits M2.5's architect-first coding style and MIT freedom, absorbs M2.7's RL-tuned precision on multi-agent collaboration and real-world engineering — without a single training step. | |
| | Benchmark | M2.5 | M2.7 | SLURPY | | |
| |---|---|---|---| | |
| | HumanEval pass@5 | 85.4% | — | **89.6%** | | |
| | SWE-Bench Verified | 80.2% | — | inherited | | |
| | SWE-Pro | — | 56.2% | inherited | | |
| | MLE Bench Lite | — | 66.6% | inherited | | |
| | GDPval-AA ELO | — | 1495 | inherited | | |
| See [Ex0bit/MiniMax-SLURPY](https://huggingface.co/Ex0bit/MiniMax-SLURPY) for full benchmark details. | |
| --- | |
| ## PRISM Dynamic Quantization | |
| This model uses **PRISM Dynamic Quantization** — a per-tensor mixed-precision allocation that assigns different quantization types to different tensor classes based on weight structure sensitivity analysis. | |
| Unlike uniform quantization (Q3, Q4, Q5), PRISM-DQ analyzes each tensor's sensitivity to quantization error and allocates precision where it matters most. Critical tensors (attention projections, key MoE experts, lm_head) receive higher precision while less impactful tensors get aggressive compression. | |
| PRISM produced 16,122 per-expert decisions (256 experts × 62 layers × 3 projections, plus attention and embeddings). MLX's `SwitchGLU` packs all 256 experts per layer-projection into a single 3D tensor sharing one bit width, so the per-expert decisions collapse to the modal bit width for each of the 186 MoE projections. The remaining 312 per-tensor decisions (attention, embeddings, lm_head, routers) retain full PRISM granularity, giving 498 effective overrides. | |
| The model's `config.json` contains the per-tensor quantization overrides that mlx-lm loads natively — no custom runtime required. Apple Silicon's compiled Metal kernels automatically handle mixed-precision tensors in a single forward pass at full GPU speed. | |
| **No calibration data, no importance matrices, no training data required.** | |
| --- | |
| ## Architecture | |
| Identical to MiniMax-M2.5 / M2.7 — quantization-only: | |
| - **Model type**: `minimax_m2` / `MiniMaxM2ForCausalLM` | |
| - **Parameters**: 228.7B total, ~10B active (MoE) | |
| - **Layers**: 62 | |
| - **Hidden size**: 3072 | |
| - **MoE**: 256 experts, top-8, sigmoid routing + learned bias | |
| - **Attention**: 48 query / 8 KV heads (GQA 6:1), head_dim=128 | |
| - **Quantization**: MLX affine, mixed 2-6 bit | |
| - **Vocab**: 200,064 tokens | |
| - **Context**: up to 196,608 tokens | |
| - **Thinking**: Interleaved `<think>...</think>` (always-on) | |
| - **`trust_remote_code=True` required** | |
| --- | |
| ## Usage on Apple Silicon | |
| ### mlx-lm (CLI) | |
| ```bash | |
| pip install mlx-lm | |
| # Interactive chat | |
| mlx_lm.chat --model Ex0bit/MiniMax-SLURPY-PRISM-3BPW-MLX \ | |
| --temperature 1.0 --top-p 0.95 --max-tokens 4096 | |
| # Single prompt | |
| python -m mlx_lm.generate \ | |
| --model Ex0bit/MiniMax-SLURPY-PRISM-3BPW-MLX \ | |
| --prompt "Write a Python function that reverses a linked list." \ | |
| --max-tokens 2048 \ | |
| --temp 1.0 --top-p 0.95 | |
| ``` | |
| ### Python API | |
| ```python | |
| from mlx_lm import load, generate | |
| model, tokenizer = load("Ex0bit/MiniMax-SLURPY-PRISM-3BPW-MLX") | |
| response = generate( | |
| model, tokenizer, | |
| prompt="Write a Python function that reverses a linked list.", | |
| max_tokens=2048, | |
| temp=1.0, | |
| top_p=0.95, | |
| ) | |
| print(response) | |
| ``` | |
| ### Recommended sampling parameters | |
| | Parameter | Value | | |
| |---|---| | |
| | temperature | 1.0 | | |
| | top_p | 0.95 | | |
| | top_k | 40 | | |
| ### Important: preserve thinking in conversation history | |
| MiniMax-M2 uses interleaved thinking. The model outputs `<think>...</think>` blocks during generation. **You must pass these back verbatim in conversation history.** Removing them degrades performance. | |
| --- | |
| ## Tool calling | |
| Same format as base SLURPY. Tool calls use `<minimax:tool_call>` / `</minimax:tool_call>` XML wrappers: | |
| ```xml | |
| <minimax:tool_call> | |
| <invoke name="get_weather"> | |
| <parameter name="city">San Francisco</parameter> | |
| </invoke> | |
| </minimax:tool_call> | |
| ``` | |
| --- | |
| ## Hardware requirements | |
| - **Apple Silicon Mac** with unified memory | |
| - **80 GB RAM minimum** (model is 68 GB; needs headroom for KV cache) | |
| - **128 GB RAM recommended** for full context length | |
| - **M2 Ultra / M3 Max / M4 Max** for best throughput | |
| For non-Apple platforms, use the FP8 [Ex0bit/MiniMax-SLURPY](https://huggingface.co/Ex0bit/MiniMax-SLURPY) variant with vLLM. | |
| --- | |
| ## Files | |
| - 14 MLX safetensors shards (68 GB total) | |
| - `config.json` with 498 per-tensor quantization overrides (collapsed from 16,122 PRISM decisions via SwitchGLU packing) | |
| - `chat_template.jinja` — M2.7's chat template with tool calling support | |
| - `modeling_minimax_m2.py` / `configuration_minimax_m2.py` — custom model code (inherited from base) | |
| --- | |
| ## License | |
| Modified MIT — same as MiniMax-M2.5. See [LICENSE](LICENSE) for full text. | |
| The only modification to the standard MIT license: if the Software (or any derivative works) is used for commercial products or services with more than 100 million monthly active users or more than $30M annual recurring revenue, you must prominently display "MiniMax M2" on the user interface. | |
| --- | |
| ## Credits | |
| - Creator: [Ex0bit](https://hf.co/Ex0bit) | |
| - Base model: [Ex0bit/MiniMax-SLURPY](https://huggingface.co/Ex0bit/MiniMax-SLURPY) | |
| - Parents: [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5), [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) | |
| - Quantization engine: PRISM-DQ by [Ex0bit](https://hf.co/Ex0bit) | |
| --- | |
| ## Citation | |
| ``` | |
| @misc{minimax-slurpy-prism-mlx-2026, | |
| title={MiniMax-SLURPY-PRISM-3BPW-MLX: Per-tensor mixed-precision quantization of MiniMax-SLURPY for Apple Silicon}, | |
| author={Ex0bit}, | |
| year={2026}, | |
| url={https://huggingface.co/Ex0bit/MiniMax-SLURPY-PRISM-3BPW-MLX} | |
| } | |
| ``` | |