0xSero
/

MiniMax-M2.1-REAP-40

+---
+license: apache-2.0
+base_model: MiniMaxAI/MiniMax-M2.1
+tags:
+- minimax
+- moe
+- reap
+- pruned
+- text-generation
+library_name: transformers
+pipeline_tag: text-generation
+---
+# MiniMax-M2.1-REAP-40
+**40% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)**
+| Property | Value |
+|----------|-------|
+| Base Model | [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) |
+| Parameters | ~139B |
+| Experts | 154/256 (60% retained) |
+| Architecture | MoE (Mixture of Experts) |
+| Precision | BF16 |
+| VRAM Required | ~278GB |
+| Stability | **0 loops** in stress tests |
+## Stress Test Results
+Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests):
+| Temperature | math_word | reasoning | code | json | instruction | creative |
+|-------------|-----------|-----------|------|------|-------------|----------|
+| 0.0 | OK | OK | OK | OK | OK | OK |
+| 0.2 | OK | OK | OK | OK | OK | OK |
+| 0.7 | OK | OK | OK | OK | OK | OK |
+| 1.0 | OK | OK | OK | OK | OK | OK |
+**Result: 24/24 tests passed, 0 loops detected**
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model = AutoModelForCausalLM.from_pretrained(
+    "0xSero/MiniMax-M2.1-REAP-40",
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True,
+)
+tokenizer = AutoTokenizer.from_pretrained(
+    "0xSero/MiniMax-M2.1-REAP-40",
+    trust_remote_code=True,
+)
+messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}]
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(text, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(response)
+```
+## DynamicCache Compatibility Fix (transformers 4.55+)
+If you encounter `TypeError: CacheLayerMixin.__init__() got an unexpected keyword argument`, add this before importing the model:
+```python
+from transformers import cache_utils
+_orig = cache_utils.DynamicCache.__init__
+def _patched(self, *args, **kwargs):
+    cfg = kwargs.get("config")
+    if cfg and hasattr(cfg, "model_type") and "minimax" in str(getattr(cfg, "model_type", "")):
+        kwargs.pop("config", None)
+        kwargs.pop("max_cache_len", None)
+        kwargs.pop("max_batch_size", None)
+        return _orig(self, None)
+    return _orig(self, *args, **kwargs)
+cache_utils.DynamicCache.__init__ = _patched
+```
+## Model Comparison
+| Model | Experts | Loops | Size | Status |
+|-------|---------|-------|------|--------|
+| [MiniMax-M2.1-REAP-20](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-20-REPAIR-IN-PROGRESS) | 204 | 1 | 185B | Deprecated |
+| [MiniMax-M2.1-REAP-30](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-30) | 180 | 0 | 162B | Recommended |
+| **MiniMax-M2.1-REAP-40** | **154** | **0** | **139B** | **Recommended** |
+| [MiniMax-M2.1-REAP-50](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-50-REPAIR-IN-PROGRESS) | 128 | 2 | 116B | Deprecated |
+## Quantized Versions
+- **MiniMax-M2.1-REAP-40-W4A16** (Coming Soon) - 4-bit weights, ~58GB VRAM
+## Why 40% Pruning?
+The 40% pruning ratio offers the best balance of:
+- **Size reduction**: 139B vs 456B original (70% smaller)
+- **VRAM savings**: ~278GB vs ~912GB (fits on 4x H100 80GB)
+- **Stability**: 0 loops in comprehensive stress testing
+- **Performance**: Minimal quality degradation from strategic expert selection
+## REAP Methodology
+REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.
+**Calibration Dataset**: 2098 samples
+- pile-10k: 498 samples (general text)
+- evol-codealpaca: 800 samples (code generation)
+- xlam-function-calling: 800 samples (function calling)
+## Acknowledgments
+- Sponsored by [Prime Intellect](https://www.primeintellect.ai/)
+- REAP implementation by [Cerebras](https://github.com/Cerebras/reap)
+- Base model by [MiniMax](https://huggingface.co/MiniMaxAI)