𓌳 REAP𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression
📄 Paper💻 Code📝 Blog

MiniMax-M2.1-REAP-50-W4A16

⚠️ Note: This is a re-upload of 0xSero's quantized and pruned MiniMax-M2.1-REAP-50-W4A16 model. The original creator (0xSero) has explicitly authorized this re-upload. All credit for the quantization and pruning work goes to 0xSero.

✨ Highlights

50% Expert-Pruned + INT4 Quantized — Double compression for efficient deployment.

  • REAP + AutoRound: Expert pruning + weight quantization
  • Optimized for Code & Tools: Calibrated on code generation and function calling
  • Lower VRAM: Fits on 96GB of VRAM

50% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)

Property Value
Base Model MiniMaxAI/MiniMax-M2.1
After REAP 50% ~116B
Experts 128/256 (50% retained)
Architecture MoE (Mixture of Experts)
Quantization INT4 weights, FP16 activations
Format GPTQ (AutoRound)
Disk Size 62.6GB
(Un)Stability 2 loops in stress tests

Stress Test Results

Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests): MiniMax-M2.1 REAP Stress Test Observations

Temperature math_word reasoning code json instruction creative
0.0 Loop OK OK OK OK OK
0.2 Loop OK OK OK OK OK
0.7 OK OK OK OK OK OK
1.0 OK OK OK OK OK OK

Result: 24/24 tests passed, 2 loops detected

🚀 Deployment

vLLM (Recommended)

vllm serve plezan/MiniMax-M2.1-REAP-50-W4A16 \
    --tensor-parallel-size 4 \
    --trust-remote-code \
    --quantization gptq

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "plezan/MiniMax-M2.1-REAP-50-W4A16",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("plezan/MiniMax-M2.1-REAP-50-W4A16", trust_remote_code=True)

Why 50% Pruning?

The 50% pruning ratio offers a balance of:

  • Size reduction: 116B vs 456B original (75% smaller)
  • Performance: Minimal quality degradation from strategic expert selection
  • At the cost of Stability: 2 loops in comprehensive stress testing

Using a 40% runing ratio would offers an overal better balance.

Model Comparison

Model Experts Loops Size Status
MiniMax-M2.1-REAP-20 204 1 185B Deprecated
MiniMax-M2.1-REAP-30 180 0 162B Recommended
MiniMax-M2.1-REAP-40 154 0 139B Recommended
MiniMax-M2.1-REAP-50 128 2 116B Deprecated

Note: Links in the table above point to the original models on 0xSero's account, some of them were removed by the creator. This re-upload preserves the 50% pruned + quantized version with authorization.

REAP Methodology

REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.

Calibration Dataset: 2098 samples

  • pile-10k: 498 samples (general text)
  • evol-codealpaca: 800 samples (code generation)
  • xlam-function-calling: 800 samples (function calling)

🙏 Acknowledgments

This model is derivative work based on extensive research and development by:

  • 0xSero — Original quantization (GPTQ/AutoRound) and REAP pruning of MiniMax-M2.1. This re-upload is posted with explicit authorization from 0xSero.
  • Prime Intellect — Compute sponsorship for the original work
  • CerebrasREAP methodology and implementation
  • Intel — AutoRound quantization framework
  • MiniMax — Base model (MiniMax-M2.1)
Downloads last month
523
Safetensors
Model size
17B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plezan/MiniMax-M2.1-REAP-50-W4A16

Quantized
(38)
this model

Paper for plezan/MiniMax-M2.1-REAP-50-W4A16