Instructions to use 0xSero/MiniMax-M2.1-REAP-30 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 0xSero/MiniMax-M2.1-REAP-30 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="0xSero/MiniMax-M2.1-REAP-30", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("0xSero/MiniMax-M2.1-REAP-30", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("0xSero/MiniMax-M2.1-REAP-30", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use 0xSero/MiniMax-M2.1-REAP-30 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "0xSero/MiniMax-M2.1-REAP-30"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/MiniMax-M2.1-REAP-30",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/0xSero/MiniMax-M2.1-REAP-30

SGLang

How to use 0xSero/MiniMax-M2.1-REAP-30 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "0xSero/MiniMax-M2.1-REAP-30" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/MiniMax-M2.1-REAP-30",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "0xSero/MiniMax-M2.1-REAP-30" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/MiniMax-M2.1-REAP-30",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use 0xSero/MiniMax-M2.1-REAP-30 with Docker Model Runner:
```
docker model run hf.co/0xSero/MiniMax-M2.1-REAP-30
```

Support this work: donate.sybilsolutions.ai

REAP surfaces: GLM | MiniMax | Qwen | Gemma | Paper | Code | PR17 | Cerebras Collection

MiniMax-M2.1-REAP-30

30% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)

Property	Value
Base Model	MiniMaxAI/MiniMax-M2.1
Parameters	~162B
Experts	180/256 (70% retained)
Architecture	MoE (Mixture of Experts)
Precision	BF16
VRAM Required	~324GB
Stability	0 loops in stress tests

Stress Test Results

Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests):

Temperature	math_word	reasoning	code	json	instruction	creative
0.0	OK	OK	OK	OK	OK	OK
0.2	OK	OK	OK	OK	OK	OK
0.7	OK	OK	OK	OK	OK	OK
1.0	OK	OK	OK	OK	OK	OK

Result: 24/24 tests passed, 0 loops detected

Extended High-Temperature Testing

Additional tests at temperatures 0.5, 0.8, 0.9, 1.2 (results in stress_test_results.json).

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "0xSero/MiniMax-M2.1-REAP-30",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "0xSero/MiniMax-M2.1-REAP-30",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

DynamicCache Compatibility Fix (transformers 4.55+)

If you encounter TypeError: CacheLayerMixin.__init__() got an unexpected keyword argument, add this before importing the model:

from transformers import cache_utils
_orig = cache_utils.DynamicCache.__init__
def _patched(self, *args, **kwargs):
    cfg = kwargs.get("config")
    if cfg and hasattr(cfg, "model_type") and "minimax" in str(getattr(cfg, "model_type", "")):
        kwargs.pop("config", None)
        kwargs.pop("max_cache_len", None)
        kwargs.pop("max_batch_size", None)
        return _orig(self, None)
    return _orig(self, *args, **kwargs)
cache_utils.DynamicCache.__init__ = _patched

Model Comparison

Model	Experts	Loops	Size	Status
MiniMax-M2.1-REAP-20	204	1	185B	Deprecated
MiniMax-M2.1-REAP-30	180	0	162B	Recommended
MiniMax-M2.1-REAP-40	154	0	139B	Recommended
MiniMax-M2.1-REAP-50	128	2	116B	Deprecated

Quantized Versions

MiniMax-M2.1-REAP-40-W4A16 (Coming Soon) - 4-bit weights, ~58GB

REAP Methodology

REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.

Calibration Dataset: 2098 samples

pile-10k: 498 samples (general text)
evol-codealpaca: 800 samples (code generation)
xlam-function-calling: 800 samples (function calling)

Acknowledgments

Sponsored by Prime Intellect
REAP implementation by Cerebras
Base model by MiniMax

Support

If this work is useful, support Sybil Solutions here: https://donate.sybilsolutions.ai

Support and links

Donate: https://donate.sybilsolutions.ai
X: https://x.com/0xsero
GitHub: https://github.com/0xsero

Model tree for 0xSero/MiniMax-M2.1-REAP-30

Base model

MiniMaxAI/MiniMax-M2.1

Quantized

(39)

this model

Quantizations

7 models

Space using 0xSero/MiniMax-M2.1-REAP-30 1

Paper for 0xSero/MiniMax-M2.1-REAP-30

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 19

0xSero
/

MiniMax-M2.1-REAP-30