File size: 7,982 Bytes
80c823c 175115d 80c823c 175115d 80c823c 175115d 80c823c 175115d 80c823c 175115d 80c823c 175115d 80c823c 175115d 80c823c 175115d 80c823c 175115d 80c823c 175115d 80c823c 175115d 80c823c 175115d 80c823c 175115d 80c823c 175115d 80c823c 175115d 80c823c 175115d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 |
---
language:
- en
tags:
- mixture-of-experts
- moe
- pruning
- compression
- minimax
- reap
- efficient-inference
license: mit
library_name: transformers
base_model: MiniMaxAI/MiniMax-M2.5
pipeline_tag: text-generation
---
# MiniMax-M2.5 REAP-39 (39% Pruned)
[](https://opensource.org/licenses/MIT)
[](https://huggingface.co/MiniMaxAI/MiniMax-M2.5)
[](https://github.com/CerebrasResearch/reap)
## Support This Work
Pruning large MoE models requires substantial GPU resources (multi-H100 clusters). If you find these models useful, consider [buying me a coffee](https://www.buymeacoffee.com/Akicou) to help offset rental costs and enable further releases. Your support makes this work possible!
## Overview
This repository contains a **REAP-pruned** variant of the **MiniMax-M2.5** Mixture-of-Experts (MoE) language model with **39%** of experts removed while maintaining strong performance.
**REAP** (Router Expert Activation Pruning) is a structured pruning technique that identifies and removes under-utilized experts based on activation patterns. This achieves:
- Reduced model size and memory footprint
- Faster inference and lower cost
- Maintained active parameters per token
- Full compatibility with HuggingFace Transformers
## REAP Variant Selection
Choose the variant that best fits your deployment constraints:
| Model | Pruned | Kept | Size Reduction | Performance Trade-off |
|-------|--------|------|----------------|----------------------|
| **REAP-10** | 10% | 90% | Small | Minimal |
| **REAP-20** | 20% | 80% | Moderate | Small |
| **REAP-30** | 30% | 70% | Significant | Moderate |
| **REAP-40** | 40% | 60% | Large | Noticeable |
| **REAP-50** | 50% | 50% | Very Large | Significant |
**Repository Links:**
- [`Akicou/MiniMax-M2.5-REAP-19`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-19)
- [`Akicou/MiniMax-M2.5-REAP-29`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-29)
- [`Akicou/MiniMax-M2.5-REAP-39`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-39)
- [`Akicou/MiniMax-M2.5-REAP-50`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-50)
## Quick Start
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Akicou/MiniMax-M2.5-REAP-39"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True
)
prompt = "Explain quantum entanglement in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### Memory-Efficient Loading
For systems with limited GPU memory:
```python
# 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
load_in_8bit=True,
trust_remote_code=True
)
# 4-bit quantization
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
quantization_config=quantization_config,
trust_remote_code=True
)
```
## Quantized GGUF Versions
Quantized GGUF variants optimized for `llama.cpp`, `Ollama`, and similar backends are in preparation in collaboration with **mradermacher**. Planned formats include Q4_K_M, Q5_K_M, Q6_K, and Q8_0.
## 🔬 Pruning Methodology
### REAP Framework
Pruning was performed using the [REAP framework](https://github.com/CerebrasResearch/reap) (implementation: [Akicou/reap](https://github.com/Akicou/reap)) with the following configuration:
**Calibration Settings:**
- **Dataset:** Mixed-domain calibration corpus (150 samples per category)
- **Distance Metric:** Cosine similarity
- **Loading Precision:** 4-bit for memory efficiency during pruning
- **Selection Strategy:** Router activation frequency analysis
**Process:**
1. Collect expert activation statistics across calibration dataset
2. Compute similarity scores between experts
3. Identify and rank experts by utilization
4. Prune lowest-activated experts while maintaining coverage
5. Validate structural integrity and export pruned model
For full pruning commands, hyperparameters, and reproducibility details, see the [Akicou/reap repository](https://github.com/Akicou/reap).
## ⚖️ Performance Characteristics
**What Changes:**
- ✅ Reduced model size (fewer total experts)
- ✅ Faster inference (less expert routing overhead)
- ✅ Lower memory requirements
- ⚠️ Slight reduction in capability on edge cases
**What Stays the Same:**
- ✅ Active parameters per token (same compute per inference)
- ✅ Model architecture and API compatibility
- ✅ Tokenizer and input/output formats
**Trade-offs:** These models exchange a small amount of capability for significantly improved efficiency. Higher pruning rates (39 < 30%) may show more noticeable quality differences on complex or specialized tasks.
**Note:** Formal benchmarks are not provided due to resource constraints. Community evaluation contributions are welcome!
## 🛠️ Use Cases
**Ideal for:**
- 🏠 Running large language models on consumer GPUs
- 💻 Local development and testing
- 🌐 Edge deployment and on-device inference
- 💰 Cost-sensitive production environments
- 🔬 Research on efficient model architectures
**Consider the full model if:**
- You have abundant GPU resources
- Maximum quality is critical
- Working on highly specialized domains
## 📚 Citation
If you use these pruned models in your research or applications, please cite both the original REAP paper and the base model:
### REAP Citation
```bibtex
@article{lasby2025reap,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
journal={arXiv preprint arXiv:2510.13999},
year={2025}
}
```
### Base Model Citation
```bibtex
@misc{minimax2025m25,
title={MiniMax-M2.5: A State-of-the-Art Mixture-of-Experts Language Model},
author={MiniMaxAI},
year={2025},
howpublished={\url{https://huggingface.co/MiniMaxAI/MiniMax-M2.5}}
}
```
## 🙏 Acknowledgments
- **Original Model:** [MiniMaxAI](https://huggingface.co/MiniMaxAI) for developing MiniMax-M2.5
- **REAP Framework:** [Cerebras Research](https://github.com/CerebrasResearch/reap) for the pruning methodology
- **Community:** HuggingFace and the open-source AI community
## 💖 Support This Work
Pruning large MoE models requires substantial computational resources (multi-GPU H100 clusters). If you find these models useful:
- ☕ [Buy me a coffee](https://www.buymeacoffee.com/Akicou) to help offset GPU rental costs
- ⭐ Star the [GitHub repository](https://github.com/Akicou/reap)
- 📢 Share with others who might benefit
- 🐛 Report issues and contribute improvements
Your support enables continued development and release of efficient model variants!
## 📞 Contact & Feedback
- **Issues & Requests:** Open an issue on [GitHub](https://github.com/Akicou/reap/issues)
- **Discussions:** Use the HuggingFace Community tab above
- **Custom Pruning:** Reach out for specific pruning ratios or other MoE models
Feedback, bug reports, and collaboration inquiries are always welcome!
## 📄 License
This model inherits the MIT license from the original MiniMax-M2.5 model. See [LICENSE](LICENSE) for details.
---
<div align="center">
**Made with ❤️ by Akicou | Powered by REAP**
[🤗 Model Hub](https://huggingface.co/Akicou) | [💻 GitHub](https://github.com/Akicou) | [☕ Support](https://www.buymeacoffee.com/Akicou)
</div> |