Instructions to use plezan/MiniMax-M2.1-REAP-50-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use plezan/MiniMax-M2.1-REAP-50-W4A16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="plezan/MiniMax-M2.1-REAP-50-W4A16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("plezan/MiniMax-M2.1-REAP-50-W4A16", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("plezan/MiniMax-M2.1-REAP-50-W4A16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use plezan/MiniMax-M2.1-REAP-50-W4A16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "plezan/MiniMax-M2.1-REAP-50-W4A16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "plezan/MiniMax-M2.1-REAP-50-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/plezan/MiniMax-M2.1-REAP-50-W4A16

SGLang

How to use plezan/MiniMax-M2.1-REAP-50-W4A16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "plezan/MiniMax-M2.1-REAP-50-W4A16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "plezan/MiniMax-M2.1-REAP-50-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "plezan/MiniMax-M2.1-REAP-50-W4A16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "plezan/MiniMax-M2.1-REAP-50-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use plezan/MiniMax-M2.1-REAP-50-W4A16 with Docker Model Runner:
```
docker model run hf.co/plezan/MiniMax-M2.1-REAP-50-W4A16
```

𓌳 REAP𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression
📄 Paper • 💻 Code • 📝 Blog

MiniMax-M2.1-REAP-50-W4A16

⚠️ Note: This is a re-upload of 0xSero's quantized and pruned MiniMax-M2.1-REAP-50-W4A16 model. The original creator (0xSero) has explicitly authorized this re-upload. All credit for the quantization and pruning work goes to 0xSero.

✨ Highlights

50% Expert-Pruned + INT4 Quantized — Double compression for efficient deployment.

REAP + AutoRound: Expert pruning + weight quantization
Optimized for Code & Tools: Calibrated on code generation and function calling
Lower VRAM: Fits on 96GB of VRAM

50% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)

Property	Value
Base Model	MiniMaxAI/MiniMax-M2.1
After REAP 50%	~116B
Experts	128/256 (50% retained)
Architecture	MoE (Mixture of Experts)
Quantization	INT4 weights, FP16 activations
Format	GPTQ (AutoRound)
Disk Size	62.6GB
(Un)Stability	2 loops in stress tests

Stress Test Results

Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests): MiniMax-M2.1 REAP Stress Test Observations

Temperature	math_word	reasoning	code	json	instruction	creative
0.0	Loop	OK	OK	OK	OK	OK
0.2	Loop	OK	OK	OK	OK	OK
0.7	OK	OK	OK	OK	OK	OK
1.0	OK	OK	OK	OK	OK	OK

Result: 24/24 tests passed, 2 loops detected

🚀 Deployment

vLLM (Recommended)

vllm serve plezan/MiniMax-M2.1-REAP-50-W4A16 \
    --tensor-parallel-size 4 \
    --trust-remote-code \
    --quantization gptq

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "plezan/MiniMax-M2.1-REAP-50-W4A16",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("plezan/MiniMax-M2.1-REAP-50-W4A16", trust_remote_code=True)

Why 50% Pruning?

The 50% pruning ratio offers a balance of:

Size reduction: 116B vs 456B original (75% smaller)
Performance: Minimal quality degradation from strategic expert selection
At the cost of Stability: 2 loops in comprehensive stress testing

Using a 40% runing ratio would offers an overal better balance.

Model Comparison

Model	Experts	Loops	Size	Status
MiniMax-M2.1-REAP-20	204	1	185B	Deprecated
MiniMax-M2.1-REAP-30	180	0	162B	Recommended
MiniMax-M2.1-REAP-40	154	0	139B	Recommended
MiniMax-M2.1-REAP-50	128	2	116B	Deprecated

Note: Links in the table above point to the original models on 0xSero's account, some of them were removed by the creator. This re-upload preserves the 50% pruned + quantized version with authorization.

REAP Methodology

REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.

Calibration Dataset: 2098 samples

pile-10k: 498 samples (general text)
evol-codealpaca: 800 samples (code generation)
xlam-function-calling: 800 samples (function calling)

🙏 Acknowledgments

This model is derivative work based on extensive research and development by:

0xSero — Original quantization (GPTQ/AutoRound) and REAP pruning of MiniMax-M2.1. This re-upload is posted with explicit authorization from 0xSero.
Prime Intellect — Compute sponsorship for the original work
Cerebras — REAP methodology and implementation
Intel — AutoRound quantization framework
MiniMax — Base model (MiniMax-M2.1)

Downloads last month: 186

Model tree for plezan/MiniMax-M2.1-REAP-50-W4A16

Base model

MiniMaxAI/MiniMax-M2.1

Quantized

(39)

this model

Paper for plezan/MiniMax-M2.1-REAP-50-W4A16

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 19