Update README.md

Browse files

Files changed (1) hide show

README.md +204 -21

README.md CHANGED Viewed

@@ -1,42 +1,225 @@
 ---
-license: other
-base_model: MiniMaxAI/MiniMax-M2.5
 tags:
-- moe
 - mixture-of-experts
 - pruning
 - reap
-- quantized
 ---
-# MiniMaxAI/MiniMax-M2.5 - REAP Pruned (39% Compression)
-This is a pruned version of [`MiniMaxAI/MiniMax-M2.5`](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) using **REAP** (Router-weighted Expert Activation Pruning).
-## Pruning Details
-- **Original Experts per Layer**: 256
-- **Remaining Experts per Layer**: 154
-- **Compression**: 39%
-- **Method**: REAP (Router-weighted Expert Activation Pruning)
-## Usage
 ```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model_name = "Akicou/MiniMax-M2-5-REAP-39"
-model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", trust_remote_code=True)
 tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
 ```
-## Original Model
-[`MiniMaxAI/MiniMax-M2.5`](https://huggingface.co/MiniMaxAI/MiniMax-M2.5)
-## REAP
-REAP (Router-weighted Expert Activation Pruning) is a method for pruning Mixture-of-Experts (MoE) models by analyzing router activations during inference.
-- **GitHub**: [CerebrasResearch/reap](https://github.com/CerebrasResearch/reap)
-- **Paper**: [REAP: Pruning MoE Models via Router Weighted Expert Activation](https://arxiv.org/abs/...)

 ---
+language:
+- en
 tags:
 - mixture-of-experts
+- moe
 - pruning
+- compression
+- minimax
 - reap
+- efficient-inference
+license: mit
+library_name: transformers
+base_model: MiniMaxAI/MiniMax-M2.5
+pipeline_tag: text-generation
 ---
+# MiniMax-M2.5 REAP-39 (39% Pruned)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[![Base Model](https://img.shields.io/badge/Base-MiniMax--M2.5-blue)](https://huggingface.co/MiniMaxAI/MiniMax-M2.5)
+[![Pruning Method](https://img.shields.io/badge/Method-REAP-green)](https://github.com/CerebrasResearch/reap)
+## Support This Work
+Pruning large MoE models requires substantial GPU resources (multi-H100 clusters). If you find these models useful, consider [buying me a coffee](https://www.buymeacoffee.com/Akicou) to help offset rental costs and enable further releases. Your support makes this work possible!
+## Overview
+This repository contains a **REAP-pruned** variant of the **MiniMax-M2.5** Mixture-of-Experts (MoE) language model with **39%** of experts removed while maintaining strong performance.
+**REAP** (Router Expert Activation Pruning) is a structured pruning technique that identifies and removes under-utilized experts based on activation patterns. This achieves:
+- Reduced model size and memory footprint
+- Faster inference and lower cost
+- Maintained active parameters per token
+- Full compatibility with HuggingFace Transformers
+## REAP Variant Selection
+Choose the variant that best fits your deployment constraints:
+| Model | Pruned | Kept | Size Reduction | Performance Trade-off |
+|-------|--------|------|----------------|----------------------|
+| **REAP-10** | 10% | 90% | Small | Minimal |
+| **REAP-20** | 20% | 80% | Moderate | Small |
+| **REAP-30** | 30% | 70% | Significant | Moderate |
+| **REAP-40** | 40% | 60% | Large | Noticeable |
+| **REAP-50** | 50% | 50% | Very Large | Significant |
+**Repository Links:**
+- [`Akicou/MiniMax-M2.5-REAP-19`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-19)
+- [`Akicou/MiniMax-M2.5-REAP-29`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-29)
+- [`Akicou/MiniMax-M2.5-REAP-39`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-39)
+- [`Akicou/MiniMax-M2.5-REAP-50`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-50)
+## Quick Start
 ```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model_name = "Akicou/MiniMax-M2.5-REAP-39"
 tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    device_map="auto",
+    torch_dtype="auto",
+    trust_remote_code=True
+)
+prompt = "Explain quantum entanglement in simple terms:"
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=256)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
+### Memory-Efficient Loading
+For systems with limited GPU memory:
+```python
+# 8-bit quantization
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    device_map="auto",
+    load_in_8bit=True,
+    trust_remote_code=True
+)
+# 4-bit quantization
+from transformers import BitsAndBytesConfig
+quantization_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.float16
+)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    device_map="auto",
+    quantization_config=quantization_config,
+    trust_remote_code=True
+)
+```
+## Quantized GGUF Versions
+Quantized GGUF variants optimized for `llama.cpp`, `Ollama`, and similar backends are in preparation in collaboration with **mradermacher**. Planned formats include Q4_K_M, Q5_K_M, Q6_K, and Q8_0.
+## 🔬 Pruning Methodology
+### REAP Framework
+Pruning was performed using the [REAP framework](https://github.com/CerebrasResearch/reap) (implementation: [Akicou/reap](https://github.com/Akicou/reap)) with the following configuration:
+**Calibration Settings:**
+- **Dataset:** Mixed-domain calibration corpus (150 samples per category)
+- **Distance Metric:** Cosine similarity
+- **Loading Precision:** 4-bit for memory efficiency during pruning
+- **Selection Strategy:** Router activation frequency analysis
+**Process:**
+1. Collect expert activation statistics across calibration dataset
+2. Compute similarity scores between experts
+3. Identify and rank experts by utilization
+4. Prune lowest-activated experts while maintaining coverage
+5. Validate structural integrity and export pruned model
+For full pruning commands, hyperparameters, and reproducibility details, see the [Akicou/reap repository](https://github.com/Akicou/reap).
+## ⚖️ Performance Characteristics
+**What Changes:**
+- ✅ Reduced model size (fewer total experts)
+- ✅ Faster inference (less expert routing overhead)
+- ✅ Lower memory requirements
+- ⚠️ Slight reduction in capability on edge cases
+**What Stays the Same:**
+- ✅ Active parameters per token (same compute per inference)
+- ✅ Model architecture and API compatibility
+- ✅ Tokenizer and input/output formats
+**Trade-offs:** These models exchange a small amount of capability for significantly improved efficiency. Higher pruning rates (39 < 30%) may show more noticeable quality differences on complex or specialized tasks.
+**Note:** Formal benchmarks are not provided due to resource constraints. Community evaluation contributions are welcome!
+## 🛠️ Use Cases
+**Ideal for:**
+- 🏠 Running large language models on consumer GPUs
+- 💻 Local development and testing
+- 🌐 Edge deployment and on-device inference
+- 💰 Cost-sensitive production environments
+- 🔬 Research on efficient model architectures
+**Consider the full model if:**
+- You have abundant GPU resources
+- Maximum quality is critical
+- Working on highly specialized domains
+## 📚 Citation
+If you use these pruned models in your research or applications, please cite both the original REAP paper and the base model:
+### REAP Citation
+```bibtex
+@article{lasby2025reap,
+  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
+  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
+  journal={arXiv preprint arXiv:2510.13999},
+  year={2025}
+}
+```
+### Base Model Citation
+```bibtex
+@misc{minimax2025m25,
+  title={MiniMax-M2.5: A State-of-the-Art Mixture-of-Experts Language Model},
+  author={MiniMaxAI},
+  year={2025},
+  howpublished={\url{https://huggingface.co/MiniMaxAI/MiniMax-M2.5}}
+}
+```
+## 🙏 Acknowledgments
+- **Original Model:** [MiniMaxAI](https://huggingface.co/MiniMaxAI) for developing MiniMax-M2.5
+- **REAP Framework:** [Cerebras Research](https://github.com/CerebrasResearch/reap) for the pruning methodology
+- **Community:** HuggingFace and the open-source AI community
+## 💖 Support This Work
+Pruning large MoE models requires substantial computational resources (multi-GPU H100 clusters). If you find these models useful:
+- ☕ [Buy me a coffee](https://www.buymeacoffee.com/Akicou) to help offset GPU rental costs
+- ⭐ Star the [GitHub repository](https://github.com/Akicou/reap)
+- 📢 Share with others who might benefit
+- 🐛 Report issues and contribute improvements
+Your support enables continued development and release of efficient model variants!
+## 📞 Contact & Feedback
+- **Issues & Requests:** Open an issue on [GitHub](https://github.com/Akicou/reap/issues)
+- **Discussions:** Use the HuggingFace Community tab above
+- **Custom Pruning:** Reach out for specific pruning ratios or other MoE models
+Feedback, bug reports, and collaboration inquiries are always welcome!
+## 📄 License
+This model inherits the MIT license from the original MiniMax-M2.5 model. See [LICENSE](LICENSE) for details.
+---
+<div align="center">
+**Made with ❤️ by Akicou | Powered by REAP**
+[🤗 Model Hub](https://huggingface.co/Akicou) | [💻 GitHub](https://github.com/Akicou) | [☕ Support](https://www.buymeacoffee.com/Akicou)
+</div>