𓌳 REAP𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression
📄 Paper • 💻 Code • 📝 Blog
MiniMax-M2.1-REAP-50-W4A16
⚠️ Note: This is a re-upload of 0xSero's quantized and pruned MiniMax-M2.1-REAP-50-W4A16 model. The original creator (0xSero) has explicitly authorized this re-upload. All credit for the quantization and pruning work goes to 0xSero.
✨ Highlights
50% Expert-Pruned + INT4 Quantized — Double compression for efficient deployment.
- REAP + AutoRound: Expert pruning + weight quantization
- Optimized for Code & Tools: Calibrated on code generation and function calling
- Lower VRAM: Fits on 96GB of VRAM
50% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)
| Property | Value |
|---|---|
| Base Model | MiniMaxAI/MiniMax-M2.1 |
| After REAP 50% | ~116B |
| Experts | 128/256 (50% retained) |
| Architecture | MoE (Mixture of Experts) |
| Quantization | INT4 weights, FP16 activations |
| Format | GPTQ (AutoRound) |
| Disk Size | 62.6GB |
| (Un)Stability | 2 loops in stress tests |
Stress Test Results
Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests): MiniMax-M2.1 REAP Stress Test Observations
| Temperature | math_word | reasoning | code | json | instruction | creative |
|---|---|---|---|---|---|---|
| 0.0 | Loop | OK | OK | OK | OK | OK |
| 0.2 | Loop | OK | OK | OK | OK | OK |
| 0.7 | OK | OK | OK | OK | OK | OK |
| 1.0 | OK | OK | OK | OK | OK | OK |
Result: 24/24 tests passed, 2 loops detected
🚀 Deployment
vLLM (Recommended)
vllm serve plezan/MiniMax-M2.1-REAP-50-W4A16 \
--tensor-parallel-size 4 \
--trust-remote-code \
--quantization gptq
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"plezan/MiniMax-M2.1-REAP-50-W4A16",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("plezan/MiniMax-M2.1-REAP-50-W4A16", trust_remote_code=True)
Why 50% Pruning?
The 50% pruning ratio offers a balance of:
- Size reduction: 116B vs 456B original (75% smaller)
- Performance: Minimal quality degradation from strategic expert selection
- At the cost of Stability: 2 loops in comprehensive stress testing
Using a 40% runing ratio would offers an overal better balance.
Model Comparison
| Model | Experts | Loops | Size | Status |
|---|---|---|---|---|
| MiniMax-M2.1-REAP-20 | 204 | 1 | 185B | Deprecated |
| MiniMax-M2.1-REAP-30 | 180 | 0 | 162B | Recommended |
| MiniMax-M2.1-REAP-40 | 154 | 0 | 139B | Recommended |
| MiniMax-M2.1-REAP-50 | 128 | 2 | 116B | Deprecated |
Note: Links in the table above point to the original models on 0xSero's account, some of them were removed by the creator. This re-upload preserves the 50% pruned + quantized version with authorization.
REAP Methodology
REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.
Calibration Dataset: 2098 samples
- pile-10k: 498 samples (general text)
- evol-codealpaca: 800 samples (code generation)
- xlam-function-calling: 800 samples (function calling)
🙏 Acknowledgments
This model is derivative work based on extensive research and development by:
- 0xSero — Original quantization (GPTQ/AutoRound) and REAP pruning of MiniMax-M2.1. This re-upload is posted with explicit authorization from 0xSero.
- Prime Intellect — Compute sponsorship for the original work
- Cerebras — REAP methodology and implementation
- Intel — AutoRound quantization framework
- MiniMax — Base model (MiniMax-M2.1)
- Downloads last month
- 523
Model tree for plezan/MiniMax-M2.1-REAP-50-W4A16
Base model
MiniMaxAI/MiniMax-M2.1