File size: 5,332 Bytes

---
license: apache-2.0
base_model: MiniMaxAI/MiniMax-M2.1
tags:
- minimax
- moe
- reap
- pruned
- text-generation
library_name: transformers
pipeline_tag: text-generation
---
> [!TIP]
> Support this work: **[donate.sybilsolutions.ai](https://donate.sybilsolutions.ai)**
> 
> REAP surfaces: [GLM](https://huggingface.co/spaces/0xSero/reap-glm-family) | [MiniMax](https://huggingface.co/spaces/0xSero/reap-minimax-family) | [Qwen](https://huggingface.co/spaces/0xSero/reap-qwen-family) | [Gemma](https://huggingface.co/spaces/0xSero/reap-gemma-family) | [Paper](https://arxiv.org/abs/2510.13999) | [Code](https://github.com/CerebrasResearch/reap) | [PR17](https://github.com/CerebrasResearch/reap/pull/17) | [Cerebras Collection](https://huggingface.co/collections/cerebras/cerebras-reap)

# MiniMax-M2.1-REAP-40

**40% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)**

| Property | Value |
|----------|-------|
| Base Model | [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) |
| Parameters | ~139B |
| Experts | 154/256 (60% retained) |
| Architecture | MoE (Mixture of Experts) |
| Precision | BF16 |
| VRAM Required | ~278GB |
| Stability | **0 loops** in stress tests |

## Stress Test Results

Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests):

| Temperature | math_word | reasoning | code | json | instruction | creative |
|-------------|-----------|-----------|------|------|-------------|----------|
| 0.0 | OK | OK | OK | OK | OK | OK |
| 0.2 | OK | OK | OK | OK | OK | OK |
| 0.7 | OK | OK | OK | OK | OK | OK |
| 1.0 | OK | OK | OK | OK | OK | OK |

**Result: 24/24 tests passed, 0 loops detected**

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "0xSero/MiniMax-M2.1-REAP-40",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "0xSero/MiniMax-M2.1-REAP-40",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

## DynamicCache Compatibility Fix (transformers 4.55+)

If you encounter `TypeError: CacheLayerMixin.__init__() got an unexpected keyword argument`, add this before importing the model:

```python
from transformers import cache_utils
_orig = cache_utils.DynamicCache.__init__
def _patched(self, *args, **kwargs):
    cfg = kwargs.get("config")
    if cfg and hasattr(cfg, "model_type") and "minimax" in str(getattr(cfg, "model_type", "")):
        kwargs.pop("config", None)
        kwargs.pop("max_cache_len", None)
        kwargs.pop("max_batch_size", None)
        return _orig(self, None)
    return _orig(self, *args, **kwargs)
cache_utils.DynamicCache.__init__ = _patched
```

## Model Comparison

| Model | Experts | Loops | Size | Status |
|-------|---------|-------|------|--------|
| [MiniMax-M2.1-REAP-20](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-20-REPAIR-IN-PROGRESS) | 204 | 1 | 185B | Deprecated |
| [MiniMax-M2.1-REAP-30](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-30) | 180 | 0 | 162B | Recommended |
| **MiniMax-M2.1-REAP-40** | **154** | **0** | **139B** | **Recommended** |
| [MiniMax-M2.1-REAP-50](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-50-REPAIR-IN-PROGRESS) | 128 | 2 | 116B | Deprecated |

## Quantized Versions

- **MiniMax-M2.1-REAP-40-W4A16** (Coming Soon) - 4-bit weights, ~58GB VRAM

## Why 40% Pruning?

The 40% pruning ratio offers the best balance of:
- **Size reduction**: 139B vs 456B original (70% smaller)
- **VRAM savings**: ~278GB vs ~912GB (fits on 4x H100 80GB)
- **Stability**: 0 loops in comprehensive stress testing
- **Performance**: Minimal quality degradation from strategic expert selection

## REAP Methodology

REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.

**Calibration Dataset**: 2098 samples
- pile-10k: 498 samples (general text)
- evol-codealpaca: 800 samples (code generation)
- xlam-function-calling: 800 samples (function calling)

## Acknowledgments

- Sponsored by [Prime Intellect](https://www.primeintellect.ai/)
- REAP implementation by [Cerebras](https://github.com/Cerebras/reap)
- Base model by [MiniMax](https://huggingface.co/MiniMaxAI)

## Support

If this work is useful, support Sybil Solutions here: [https://donate.sybilsolutions.ai](https://donate.sybilsolutions.ai)


<!-- SERO_MANAGED_TOP_LINKS_START -->
## Support and links
- Donate: https://donate.sybilsolutions.ai
- X: https://x.com/0xsero
- GitHub: https://github.com/0xsero
<!-- SERO_MANAGED_TOP_LINKS_END -->

## Sponsors

Thank you for the kind sponsors, wouldn't be possible without them:

- Nvidia
- TNG Technology
- Lambda
- Prime Intellect
- HotAisle