|
|
--- |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- mixture-of-experts |
|
|
- moe |
|
|
- pruning |
|
|
- compression |
|
|
- minimax |
|
|
- reap |
|
|
- efficient-inference |
|
|
license: mit |
|
|
library_name: transformers |
|
|
base_model: MiniMaxAI/MiniMax-M2.5 |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# MiniMax-M2.5 REAP-39 (39% Pruned) |
|
|
|
|
|
[](https://opensource.org/licenses/MIT) |
|
|
[](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) |
|
|
[](https://github.com/CerebrasResearch/reap) |
|
|
|
|
|
## Support This Work |
|
|
|
|
|
Pruning large MoE models requires substantial GPU resources (multi-H100 clusters). If you find these models useful, consider [buying me a coffee](https://www.buymeacoffee.com/Akicou) to help offset rental costs and enable further releases. Your support makes this work possible! |
|
|
|
|
|
## Overview |
|
|
|
|
|
This repository contains a **REAP-pruned** variant of the **MiniMax-M2.5** Mixture-of-Experts (MoE) language model with **39%** of experts removed while maintaining strong performance. |
|
|
|
|
|
**REAP** (Router Expert Activation Pruning) is a structured pruning technique that identifies and removes under-utilized experts based on activation patterns. This achieves: |
|
|
- Reduced model size and memory footprint |
|
|
- Faster inference and lower cost |
|
|
- Maintained active parameters per token |
|
|
- Full compatibility with HuggingFace Transformers |
|
|
|
|
|
## REAP Variant Selection |
|
|
|
|
|
Choose the variant that best fits your deployment constraints: |
|
|
|
|
|
| Model | Pruned | Kept | Size Reduction | Performance Trade-off | |
|
|
|-------|--------|------|----------------|----------------------| |
|
|
| **REAP-10** | 10% | 90% | Small | Minimal | |
|
|
| **REAP-20** | 20% | 80% | Moderate | Small | |
|
|
| **REAP-30** | 30% | 70% | Significant | Moderate | |
|
|
| **REAP-40** | 40% | 60% | Large | Noticeable | |
|
|
| **REAP-50** | 50% | 50% | Very Large | Significant | |
|
|
|
|
|
**Repository Links:** |
|
|
- [`Akicou/MiniMax-M2.5-REAP-19`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-19) |
|
|
- [`Akicou/MiniMax-M2.5-REAP-29`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-29) |
|
|
- [`Akicou/MiniMax-M2.5-REAP-39`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-39) |
|
|
- [`Akicou/MiniMax-M2.5-REAP-50`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-50) |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
model_name = "Akicou/MiniMax-M2.5-REAP-39" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
device_map="auto", |
|
|
torch_dtype="auto", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
prompt = "Explain quantum entanglement in simple terms:" |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
outputs = model.generate(**inputs, max_new_tokens=256) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
### Memory-Efficient Loading |
|
|
|
|
|
For systems with limited GPU memory: |
|
|
|
|
|
```python |
|
|
# 8-bit quantization |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
device_map="auto", |
|
|
load_in_8bit=True, |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
# 4-bit quantization |
|
|
from transformers import BitsAndBytesConfig |
|
|
|
|
|
quantization_config = BitsAndBytesConfig( |
|
|
load_in_4bit=True, |
|
|
bnb_4bit_compute_dtype=torch.float16 |
|
|
) |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
device_map="auto", |
|
|
quantization_config=quantization_config, |
|
|
trust_remote_code=True |
|
|
) |
|
|
``` |
|
|
|
|
|
## Quantized GGUF Versions |
|
|
|
|
|
Quantized GGUF variants optimized for `llama.cpp`, `Ollama`, and similar backends are in preparation in collaboration with **mradermacher**. Planned formats include Q4_K_M, Q5_K_M, Q6_K, and Q8_0. |
|
|
|
|
|
## 🔬 Pruning Methodology |
|
|
|
|
|
### REAP Framework |
|
|
|
|
|
Pruning was performed using the [REAP framework](https://github.com/CerebrasResearch/reap) (implementation: [Akicou/reap](https://github.com/Akicou/reap)) with the following configuration: |
|
|
|
|
|
**Calibration Settings:** |
|
|
- **Dataset:** Mixed-domain calibration corpus (150 samples per category) |
|
|
- **Distance Metric:** Cosine similarity |
|
|
- **Loading Precision:** 4-bit for memory efficiency during pruning |
|
|
- **Selection Strategy:** Router activation frequency analysis |
|
|
|
|
|
**Process:** |
|
|
1. Collect expert activation statistics across calibration dataset |
|
|
2. Compute similarity scores between experts |
|
|
3. Identify and rank experts by utilization |
|
|
4. Prune lowest-activated experts while maintaining coverage |
|
|
5. Validate structural integrity and export pruned model |
|
|
|
|
|
For full pruning commands, hyperparameters, and reproducibility details, see the [Akicou/reap repository](https://github.com/Akicou/reap). |
|
|
|
|
|
## ⚖️ Performance Characteristics |
|
|
|
|
|
**What Changes:** |
|
|
- ✅ Reduced model size (fewer total experts) |
|
|
- ✅ Faster inference (less expert routing overhead) |
|
|
- ✅ Lower memory requirements |
|
|
- ⚠️ Slight reduction in capability on edge cases |
|
|
|
|
|
**What Stays the Same:** |
|
|
- ✅ Active parameters per token (same compute per inference) |
|
|
- ✅ Model architecture and API compatibility |
|
|
- ✅ Tokenizer and input/output formats |
|
|
|
|
|
**Trade-offs:** These models exchange a small amount of capability for significantly improved efficiency. Higher pruning rates (39 < 30%) may show more noticeable quality differences on complex or specialized tasks. |
|
|
|
|
|
**Note:** Formal benchmarks are not provided due to resource constraints. Community evaluation contributions are welcome! |
|
|
|
|
|
## 🛠️ Use Cases |
|
|
|
|
|
**Ideal for:** |
|
|
- 🏠 Running large language models on consumer GPUs |
|
|
- 💻 Local development and testing |
|
|
- 🌐 Edge deployment and on-device inference |
|
|
- 💰 Cost-sensitive production environments |
|
|
- 🔬 Research on efficient model architectures |
|
|
|
|
|
**Consider the full model if:** |
|
|
- You have abundant GPU resources |
|
|
- Maximum quality is critical |
|
|
- Working on highly specialized domains |
|
|
|
|
|
## 📚 Citation |
|
|
|
|
|
If you use these pruned models in your research or applications, please cite both the original REAP paper and the base model: |
|
|
|
|
|
### REAP Citation |
|
|
|
|
|
```bibtex |
|
|
@article{lasby2025reap, |
|
|
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression}, |
|
|
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan}, |
|
|
journal={arXiv preprint arXiv:2510.13999}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
### Base Model Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{minimax2025m25, |
|
|
title={MiniMax-M2.5: A State-of-the-Art Mixture-of-Experts Language Model}, |
|
|
author={MiniMaxAI}, |
|
|
year={2025}, |
|
|
howpublished={\url{https://huggingface.co/MiniMaxAI/MiniMax-M2.5}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## 🙏 Acknowledgments |
|
|
|
|
|
- **Original Model:** [MiniMaxAI](https://huggingface.co/MiniMaxAI) for developing MiniMax-M2.5 |
|
|
- **REAP Framework:** [Cerebras Research](https://github.com/CerebrasResearch/reap) for the pruning methodology |
|
|
- **Community:** HuggingFace and the open-source AI community |
|
|
|
|
|
## 💖 Support This Work |
|
|
|
|
|
Pruning large MoE models requires substantial computational resources (multi-GPU H100 clusters). If you find these models useful: |
|
|
|
|
|
- ☕ [Buy me a coffee](https://www.buymeacoffee.com/Akicou) to help offset GPU rental costs |
|
|
- ⭐ Star the [GitHub repository](https://github.com/Akicou/reap) |
|
|
- 📢 Share with others who might benefit |
|
|
- 🐛 Report issues and contribute improvements |
|
|
|
|
|
Your support enables continued development and release of efficient model variants! |
|
|
|
|
|
## 📞 Contact & Feedback |
|
|
|
|
|
- **Issues & Requests:** Open an issue on [GitHub](https://github.com/Akicou/reap/issues) |
|
|
- **Discussions:** Use the HuggingFace Community tab above |
|
|
- **Custom Pruning:** Reach out for specific pruning ratios or other MoE models |
|
|
|
|
|
Feedback, bug reports, and collaboration inquiries are always welcome! |
|
|
|
|
|
## 📄 License |
|
|
|
|
|
This model inherits the MIT license from the original MiniMax-M2.5 model. See [LICENSE](LICENSE) for details. |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**Made with ❤️ by Akicou | Powered by REAP** |
|
|
|
|
|
[🤗 Model Hub](https://huggingface.co/Akicou) | [💻 GitHub](https://github.com/Akicou) | [☕ Support](https://www.buymeacoffee.com/Akicou) |
|
|
|
|
|
</div> |