--- language: - en tags: - mixture-of-experts - moe - pruning - compression - minimax - reap - efficient-inference license: mit library_name: transformers base_model: MiniMaxAI/MiniMax-M2.5 pipeline_tag: text-generation --- # MiniMax-M2.5 REAP-29 (29% Pruned) [](https://opensource.org/licenses/MIT) [](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) [](https://github.com/CerebrasResearch/reap) ## Support This Work Pruning large MoE models requires substantial GPU resources (multi-H100 clusters). If you find these models useful, consider [buying me a coffee](https://www.buymeacoffee.com/Akicou) to help offset rental costs and enable further releases. Your support makes this work possible! ## Overview This repository contains a **REAP-pruned** variant of the **MiniMax-M2.5** Mixture-of-Experts (MoE) language model with **29%** of experts removed while maintaining strong performance. **REAP** (Router Expert Activation Pruning) is a structured pruning technique that identifies and removes under-utilized experts based on activation patterns. This achieves: - Reduced model size and memory footprint - Faster inference and lower cost - Maintained active parameters per token - Full compatibility with HuggingFace Transformers ## REAP Variant Selection Choose the variant that best fits your deployment constraints: | Model | Pruned | Kept | Size Reduction | Performance Trade-off | |-------|--------|------|----------------|----------------------| | **REAP-10** | 10% | 90% | Small | Minimal | | **REAP-20** | 20% | 80% | Moderate | Small | | **REAP-30** | 30% | 70% | Significant | Moderate | | **REAP-40** | 40% | 60% | Large | Noticeable | | **REAP-50** | 50% | 50% | Very Large | Significant | **Repository Links:** - [`Akicou/MiniMax-M2.5-REAP-19`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-19) - [`Akicou/MiniMax-M2.5-REAP-29`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-29) - [`Akicou/MiniMax-M2.5-REAP-39`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-39) - [`Akicou/MiniMax-M2.5-REAP-50`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-50) ## Quick Start ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "Akicou/MiniMax-M2.5-REAP-29" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", torch_dtype="auto", trust_remote_code=True ) prompt = "Explain quantum entanglement in simple terms:" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Memory-Efficient Loading For systems with limited GPU memory: ```python # 8-bit quantization model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", load_in_8bit=True, trust_remote_code=True ) # 4-bit quantization from transformers import BitsAndBytesConfig quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 ) model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", quantization_config=quantization_config, trust_remote_code=True ) ``` ## Quantized GGUF Versions Quantized GGUF variants optimized for `llama.cpp`, `Ollama`, and similar backends are in preparation in collaboration with **mradermacher**. Planned formats include Q4_K_M, Q5_K_M, Q6_K, and Q8_0. ## 🔬 Pruning Methodology ### REAP Framework Pruning was performed using the [REAP framework](https://github.com/CerebrasResearch/reap) (implementation: [Akicou/reap](https://github.com/Akicou/reap)) with the following configuration: **Calibration Settings:** - **Dataset:** Mixed-domain calibration corpus (150 samples per category) - **Distance Metric:** Cosine similarity - **Loading Precision:** 4-bit for memory efficiency during pruning - **Selection Strategy:** Router activation frequency analysis **Process:** 1. Collect expert activation statistics across calibration dataset 2. Compute similarity scores between experts 3. Identify and rank experts by utilization 4. Prune lowest-activated experts while maintaining coverage 5. Validate structural integrity and export pruned model For full pruning commands, hyperparameters, and reproducibility details, see the [Akicou/reap repository](https://github.com/Akicou/reap). ## ⚖️ Performance Characteristics **What Changes:** - ✅ Reduced model size (fewer total experts) - ✅ Faster inference (less expert routing overhead) - ✅ Lower memory requirements - ⚠️ Slight reduction in capability on edge cases **What Stays the Same:** - ✅ Active parameters per token (same compute per inference) - ✅ Model architecture and API compatibility - ✅ Tokenizer and input/output formats **Trade-offs:** These models exchange a small amount of capability for significantly improved efficiency. Higher pruning rates (29 < 30%) may show more noticeable quality differences on complex or specialized tasks. **Note:** Formal benchmarks are not provided due to resource constraints. Community evaluation contributions are welcome! ## 🛠️ Use Cases **Ideal for:** - 🏠 Running large language models on consumer GPUs - 💻 Local development and testing - 🌐 Edge deployment and on-device inference - 💰 Cost-sensitive production environments - 🔬 Research on efficient model architectures **Consider the full model if:** - You have abundant GPU resources - Maximum quality is critical - Working on highly specialized domains ## 📚 Citation If you use these pruned models in your research or applications, please cite both the original REAP paper and the base model: ### REAP Citation ```bibtex @article{lasby2025reap, title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression}, author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan}, journal={arXiv preprint arXiv:2510.13999}, year={2025} } ``` ### Base Model Citation ```bibtex @misc{minimax2025m25, title={MiniMax-M2.5: A State-of-the-Art Mixture-of-Experts Language Model}, author={MiniMaxAI}, year={2025}, howpublished={\url{https://huggingface.co/MiniMaxAI/MiniMax-M2.5}} } ``` ## 🙏 Acknowledgments - **Original Model:** [MiniMaxAI](https://huggingface.co/MiniMaxAI) for developing MiniMax-M2.5 - **REAP Framework:** [Cerebras Research](https://github.com/CerebrasResearch/reap) for the pruning methodology - **Community:** HuggingFace and the open-source AI community ## 💖 Support This Work Pruning large MoE models requires substantial computational resources (multi-GPU H100 clusters). If you find these models useful: - ☕ [Buy me a coffee](https://www.buymeacoffee.com/Akicou) to help offset GPU rental costs - ⭐ Star the [GitHub repository](https://github.com/Akicou/reap) - 📢 Share with others who might benefit - 🐛 Report issues and contribute improvements Your support enables continued development and release of efficient model variants! ## 📞 Contact & Feedback - **Issues & Requests:** Open an issue on [GitHub](https://github.com/Akicou/reap/issues) - **Discussions:** Use the HuggingFace Community tab above - **Custom Pruning:** Reach out for specific pruning ratios or other MoE models Feedback, bug reports, and collaboration inquiries are always welcome! ## 📄 License This model inherits the MIT license from the original MiniMax-M2.5 model. See [LICENSE](LICENSE) for details. ---