| | --- |
| | language: |
| | - en |
| | tags: |
| | - mixture-of-experts |
| | - moe |
| | - pruning |
| | - compression |
| | - minimax |
| | - reap |
| | - efficient-inference |
| | license: mit |
| | library_name: transformers |
| | base_model: MiniMaxAI/MiniMax-M2.5 |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # MiniMax-M2.5 REAP-50 (50% Pruned) |
| |
|
| | [](https://opensource.org/licenses/MIT) |
| | [](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) |
| | [](https://github.com/CerebrasResearch/reap) |
| |
|
| | ## Support This Work |
| |
|
| | Pruning large MoE models requires substantial GPU resources (multi-H100 clusters). If you find these models useful, consider [buying me a coffee](https://www.buymeacoffee.com/Akicou) to help offset rental costs and enable further releases. Your support makes this work possible! |
| |
|
| | ## Overview |
| |
|
| | This repository contains a **REAP-pruned** variant of the **MiniMax-M2.5** Mixture-of-Experts (MoE) language model with **50%** of experts removed while maintaining strong performance. |
| |
|
| | **REAP** (Router Expert Activation Pruning) is a structured pruning technique that identifies and removes under-utilized experts based on activation patterns. This achieves: |
| | - Reduced model size and memory footprint |
| | - Faster inference and lower cost |
| | - Maintained active parameters per token |
| | - Full compatibility with HuggingFace Transformers |
| |
|
| | ## REAP Variant Selection |
| |
|
| | Choose the variant that best fits your deployment constraints: |
| |
|
| | | Model | Pruned | Kept | Size Reduction | Performance Trade-off | |
| | |-------|--------|------|----------------|----------------------| |
| | | **REAP-10** | 10% | 90% | Small | Minimal | |
| | | **REAP-20** | 20% | 80% | Moderate | Small | |
| | | **REAP-30** | 30% | 70% | Significant | Moderate | |
| | | **REAP-40** | 40% | 60% | Large | Noticeable | |
| | | **REAP-50** | 50% | 50% | Very Large | Significant | |
| |
|
| | **Repository Links:** |
| | - [`Akicou/MiniMax-M2.5-REAP-19`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-19) |
| | - [`Akicou/MiniMax-M2.5-REAP-29`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-29) |
| | - [`Akicou/MiniMax-M2.5-REAP-39`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-39) |
| | - [`Akicou/MiniMax-M2.5-REAP-50`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-50) |
| |
|
| | ## Quick Start |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | |
| | model_name = "Akicou/MiniMax-M2.5-REAP-50" |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_name, |
| | device_map="auto", |
| | torch_dtype="auto", |
| | trust_remote_code=True |
| | ) |
| | |
| | prompt = "Explain quantum entanglement in simple terms:" |
| | inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| | outputs = model.generate(**inputs, max_new_tokens=256) |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| |
|
| | ### Memory-Efficient Loading |
| |
|
| | For systems with limited GPU memory: |
| |
|
| | ```python |
| | # 8-bit quantization |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_name, |
| | device_map="auto", |
| | load_in_8bit=True, |
| | trust_remote_code=True |
| | ) |
| | |
| | # 4-bit quantization |
| | from transformers import BitsAndBytesConfig |
| | |
| | quantization_config = BitsAndBytesConfig( |
| | load_in_4bit=True, |
| | bnb_4bit_compute_dtype=torch.float16 |
| | ) |
| | |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_name, |
| | device_map="auto", |
| | quantization_config=quantization_config, |
| | trust_remote_code=True |
| | ) |
| | ``` |
| |
|
| | ## Quantized GGUF Versions |
| |
|
| | Quantized GGUF variants optimized for `llama.cpp`, `Ollama`, and similar backends are in preparation in collaboration with **mradermacher**. Planned formats include Q4_K_M, Q5_K_M, Q6_K, and Q8_0. |
| |
|
| | ## 🔬 Pruning Methodology |
| |
|
| | ### REAP Framework |
| |
|
| | Pruning was performed using the [REAP framework](https://github.com/CerebrasResearch/reap) (implementation: [Akicou/reap](https://github.com/Akicou/reap)) with the following configuration: |
| |
|
| | **Calibration Settings:** |
| | - **Dataset:** Mixed-domain calibration corpus (150 samples per category) |
| | - **Distance Metric:** Cosine similarity |
| | - **Loading Precision:** 4-bit for memory efficiency during pruning |
| | - **Selection Strategy:** Router activation frequency analysis |
| |
|
| | **Process:** |
| | 1. Collect expert activation statistics across calibration dataset |
| | 2. Compute similarity scores between experts |
| | 3. Identify and rank experts by utilization |
| | 4. Prune lowest-activated experts while maintaining coverage |
| | 5. Validate structural integrity and export pruned model |
| |
|
| | For full pruning commands, hyperparameters, and reproducibility details, see the [Akicou/reap repository](https://github.com/Akicou/reap). |
| |
|
| | ## ⚖️ Performance Characteristics |
| |
|
| | **What Changes:** |
| | - ✅ Reduced model size (fewer total experts) |
| | - ✅ Faster inference (less expert routing overhead) |
| | - ✅ Lower memory requirements |
| | - ⚠️ Slight reduction in capability on edge cases |
| |
|
| | **What Stays the Same:** |
| | - ✅ Active parameters per token (same compute per inference) |
| | - ✅ Model architecture and API compatibility |
| | - ✅ Tokenizer and input/output formats |
| |
|
| | **Trade-offs:** These models exchange a small amount of capability for significantly improved efficiency. Higher pruning rates (50 < 30%) may show more noticeable quality differences on complex or specialized tasks. |
| |
|
| | **Note:** Formal benchmarks are not provided due to resource constraints. Community evaluation contributions are welcome! |
| |
|
| | ## 🛠️ Use Cases |
| |
|
| | **Ideal for:** |
| | - 🏠 Running large language models on consumer GPUs |
| | - 💻 Local development and testing |
| | - 🌐 Edge deployment and on-device inference |
| | - 💰 Cost-sensitive production environments |
| | - 🔬 Research on efficient model architectures |
| |
|
| | **Consider the full model if:** |
| | - You have abundant GPU resources |
| | - Maximum quality is critical |
| | - Working on highly specialized domains |
| |
|
| | ## 📚 Citation |
| |
|
| | If you use these pruned models in your research or applications, please cite both the original REAP paper and the base model: |
| |
|
| | ### REAP Citation |
| |
|
| | ```bibtex |
| | @article{lasby2025reap, |
| | title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression}, |
| | author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan}, |
| | journal={arXiv preprint arXiv:2510.13999}, |
| | year={2025} |
| | } |
| | ``` |
| |
|
| | ### Base Model Citation |
| |
|
| | ```bibtex |
| | @misc{minimax2025m25, |
| | title={MiniMax-M2.5: A State-of-the-Art Mixture-of-Experts Language Model}, |
| | author={MiniMaxAI}, |
| | year={2025}, |
| | howpublished={\url{https://huggingface.co/MiniMaxAI/MiniMax-M2.5}} |
| | } |
| | ``` |
| |
|
| | ## 🙏 Acknowledgments |
| |
|
| | - **Original Model:** [MiniMaxAI](https://huggingface.co/MiniMaxAI) for developing MiniMax-M2.5 |
| | - **REAP Framework:** [Cerebras Research](https://github.com/CerebrasResearch/reap) for the pruning methodology |
| | - **Community:** HuggingFace and the open-source AI community |
| |
|
| | ## 💖 Support This Work |
| |
|
| | Pruning large MoE models requires substantial computational resources (multi-GPU H100 clusters). If you find these models useful: |
| |
|
| | - ☕ [Buy me a coffee](https://www.buymeacoffee.com/Akicou) to help offset GPU rental costs |
| | - ⭐ Star the [GitHub repository](https://github.com/Akicou/reap) |
| | - 📢 Share with others who might benefit |
| | - 🐛 Report issues and contribute improvements |
| |
|
| | Your support enables continued development and release of efficient model variants! |
| |
|
| | ## 📞 Contact & Feedback |
| |
|
| | - **Issues & Requests:** Open an issue on [GitHub](https://github.com/Akicou/reap/issues) |
| | - **Discussions:** Use the HuggingFace Community tab above |
| | - **Custom Pruning:** Reach out for specific pruning ratios or other MoE models |
| |
|
| | Feedback, bug reports, and collaboration inquiries are always welcome! |
| |
|
| | ## 📄 License |
| |
|
| | This model inherits the MIT license from the original MiniMax-M2.5 model. See [LICENSE](LICENSE) for details. |
| |
|
| | --- |
| |
|
| | <div align="center"> |
| |
|
| | **Made with ❤️ by Akicou | Powered by REAP** |
| |
|
| | [🤗 Model Hub](https://huggingface.co/Akicou) | [💻 GitHub](https://github.com/Akicou) | [☕ Support](https://www.buymeacoffee.com/Akicou) |
| |
|
| | </div> |