--- license: apache-2.0 base_model: MiniMaxAI/MiniMax-M2.1 tags: - minimax - moe - reap - pruned - text-generation library_name: transformers pipeline_tag: text-generation --- > [!TIP] > Support this work: **[donate.sybilsolutions.ai](https://donate.sybilsolutions.ai)** > > REAP surfaces: [GLM](https://huggingface.co/spaces/0xSero/reap-glm-family) | [MiniMax](https://huggingface.co/spaces/0xSero/reap-minimax-family) | [Qwen](https://huggingface.co/spaces/0xSero/reap-qwen-family) | [Gemma](https://huggingface.co/spaces/0xSero/reap-gemma-family) | [Paper](https://arxiv.org/abs/2510.13999) | [Code](https://github.com/CerebrasResearch/reap) | [PR17](https://github.com/CerebrasResearch/reap/pull/17) | [Cerebras Collection](https://huggingface.co/collections/cerebras/cerebras-reap) # MiniMax-M2.1-REAP-40 **40% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)** | Property | Value | |----------|-------| | Base Model | [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) | | Parameters | ~139B | | Experts | 154/256 (60% retained) | | Architecture | MoE (Mixture of Experts) | | Precision | BF16 | | VRAM Required | ~278GB | | Stability | **0 loops** in stress tests | ## Stress Test Results Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests): | Temperature | math_word | reasoning | code | json | instruction | creative | |-------------|-----------|-----------|------|------|-------------|----------| | 0.0 | OK | OK | OK | OK | OK | OK | | 0.2 | OK | OK | OK | OK | OK | OK | | 0.7 | OK | OK | OK | OK | OK | OK | | 1.0 | OK | OK | OK | OK | OK | OK | **Result: 24/24 tests passed, 0 loops detected** ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained( "0xSero/MiniMax-M2.1-REAP-40", torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained( "0xSero/MiniMax-M2.1-REAP-40", trust_remote_code=True, ) messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ## DynamicCache Compatibility Fix (transformers 4.55+) If you encounter `TypeError: CacheLayerMixin.__init__() got an unexpected keyword argument`, add this before importing the model: ```python from transformers import cache_utils _orig = cache_utils.DynamicCache.__init__ def _patched(self, *args, **kwargs): cfg = kwargs.get("config") if cfg and hasattr(cfg, "model_type") and "minimax" in str(getattr(cfg, "model_type", "")): kwargs.pop("config", None) kwargs.pop("max_cache_len", None) kwargs.pop("max_batch_size", None) return _orig(self, None) return _orig(self, *args, **kwargs) cache_utils.DynamicCache.__init__ = _patched ``` ## Model Comparison | Model | Experts | Loops | Size | Status | |-------|---------|-------|------|--------| | [MiniMax-M2.1-REAP-20](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-20-REPAIR-IN-PROGRESS) | 204 | 1 | 185B | Deprecated | | [MiniMax-M2.1-REAP-30](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-30) | 180 | 0 | 162B | Recommended | | **MiniMax-M2.1-REAP-40** | **154** | **0** | **139B** | **Recommended** | | [MiniMax-M2.1-REAP-50](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-50-REPAIR-IN-PROGRESS) | 128 | 2 | 116B | Deprecated | ## Quantized Versions - **MiniMax-M2.1-REAP-40-W4A16** (Coming Soon) - 4-bit weights, ~58GB VRAM ## Why 40% Pruning? The 40% pruning ratio offers the best balance of: - **Size reduction**: 139B vs 456B original (70% smaller) - **VRAM savings**: ~278GB vs ~912GB (fits on 4x H100 80GB) - **Stability**: 0 loops in comprehensive stress testing - **Performance**: Minimal quality degradation from strategic expert selection ## REAP Methodology REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference. **Calibration Dataset**: 2098 samples - pile-10k: 498 samples (general text) - evol-codealpaca: 800 samples (code generation) - xlam-function-calling: 800 samples (function calling) ## Acknowledgments - Sponsored by [Prime Intellect](https://www.primeintellect.ai/) - REAP implementation by [Cerebras](https://github.com/Cerebras/reap) - Base model by [MiniMax](https://huggingface.co/MiniMaxAI) ## Support If this work is useful, support Sybil Solutions here: [https://donate.sybilsolutions.ai](https://donate.sybilsolutions.ai) ## Support and links - Donate: https://donate.sybilsolutions.ai - X: https://x.com/0xsero - GitHub: https://github.com/0xsero ## Sponsors Thank you for the kind sponsors, wouldn't be possible without them: - Nvidia - TNG Technology - Lambda - Prime Intellect - HotAisle