File size: 5,332 Bytes
190fb90 af2399c 190fb90 7414b51 444dfbe c46752b 444dfbe 2aaf641 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | ---
license: apache-2.0
base_model: MiniMaxAI/MiniMax-M2.1
tags:
- minimax
- moe
- reap
- pruned
- text-generation
library_name: transformers
pipeline_tag: text-generation
---
> [!TIP]
> Support this work: **[donate.sybilsolutions.ai](https://donate.sybilsolutions.ai)**
>
> REAP surfaces: [GLM](https://huggingface.co/spaces/0xSero/reap-glm-family) | [MiniMax](https://huggingface.co/spaces/0xSero/reap-minimax-family) | [Qwen](https://huggingface.co/spaces/0xSero/reap-qwen-family) | [Gemma](https://huggingface.co/spaces/0xSero/reap-gemma-family) | [Paper](https://arxiv.org/abs/2510.13999) | [Code](https://github.com/CerebrasResearch/reap) | [PR17](https://github.com/CerebrasResearch/reap/pull/17) | [Cerebras Collection](https://huggingface.co/collections/cerebras/cerebras-reap)
# MiniMax-M2.1-REAP-40
**40% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)**
| Property | Value |
|----------|-------|
| Base Model | [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) |
| Parameters | ~139B |
| Experts | 154/256 (60% retained) |
| Architecture | MoE (Mixture of Experts) |
| Precision | BF16 |
| VRAM Required | ~278GB |
| Stability | **0 loops** in stress tests |
## Stress Test Results
Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests):
| Temperature | math_word | reasoning | code | json | instruction | creative |
|-------------|-----------|-----------|------|------|-------------|----------|
| 0.0 | OK | OK | OK | OK | OK | OK |
| 0.2 | OK | OK | OK | OK | OK | OK |
| 0.7 | OK | OK | OK | OK | OK | OK |
| 1.0 | OK | OK | OK | OK | OK | OK |
**Result: 24/24 tests passed, 0 loops detected**
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"0xSero/MiniMax-M2.1-REAP-40",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"0xSero/MiniMax-M2.1-REAP-40",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
## DynamicCache Compatibility Fix (transformers 4.55+)
If you encounter `TypeError: CacheLayerMixin.__init__() got an unexpected keyword argument`, add this before importing the model:
```python
from transformers import cache_utils
_orig = cache_utils.DynamicCache.__init__
def _patched(self, *args, **kwargs):
cfg = kwargs.get("config")
if cfg and hasattr(cfg, "model_type") and "minimax" in str(getattr(cfg, "model_type", "")):
kwargs.pop("config", None)
kwargs.pop("max_cache_len", None)
kwargs.pop("max_batch_size", None)
return _orig(self, None)
return _orig(self, *args, **kwargs)
cache_utils.DynamicCache.__init__ = _patched
```
## Model Comparison
| Model | Experts | Loops | Size | Status |
|-------|---------|-------|------|--------|
| [MiniMax-M2.1-REAP-20](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-20-REPAIR-IN-PROGRESS) | 204 | 1 | 185B | Deprecated |
| [MiniMax-M2.1-REAP-30](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-30) | 180 | 0 | 162B | Recommended |
| **MiniMax-M2.1-REAP-40** | **154** | **0** | **139B** | **Recommended** |
| [MiniMax-M2.1-REAP-50](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-50-REPAIR-IN-PROGRESS) | 128 | 2 | 116B | Deprecated |
## Quantized Versions
- **MiniMax-M2.1-REAP-40-W4A16** (Coming Soon) - 4-bit weights, ~58GB VRAM
## Why 40% Pruning?
The 40% pruning ratio offers the best balance of:
- **Size reduction**: 139B vs 456B original (70% smaller)
- **VRAM savings**: ~278GB vs ~912GB (fits on 4x H100 80GB)
- **Stability**: 0 loops in comprehensive stress testing
- **Performance**: Minimal quality degradation from strategic expert selection
## REAP Methodology
REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.
**Calibration Dataset**: 2098 samples
- pile-10k: 498 samples (general text)
- evol-codealpaca: 800 samples (code generation)
- xlam-function-calling: 800 samples (function calling)
## Acknowledgments
- Sponsored by [Prime Intellect](https://www.primeintellect.ai/)
- REAP implementation by [Cerebras](https://github.com/Cerebras/reap)
- Base model by [MiniMax](https://huggingface.co/MiniMaxAI)
## Support
If this work is useful, support Sybil Solutions here: [https://donate.sybilsolutions.ai](https://donate.sybilsolutions.ai)
<!-- SERO_MANAGED_TOP_LINKS_START -->
## Support and links
- Donate: https://donate.sybilsolutions.ai
- X: https://x.com/0xsero
- GitHub: https://github.com/0xsero
<!-- SERO_MANAGED_TOP_LINKS_END -->
## Sponsors
Thank you for the kind sponsors, wouldn't be possible without them:
- Nvidia
- TNG Technology
- Lambda
- Prime Intellect
- HotAisle
|