Akicou commited on
Commit
175115d
·
verified ·
1 Parent(s): 80c823c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +204 -21
README.md CHANGED
@@ -1,42 +1,225 @@
1
  ---
2
- license: other
3
- base_model: MiniMaxAI/MiniMax-M2.5
4
  tags:
5
- - moe
6
  - mixture-of-experts
 
7
  - pruning
 
 
8
  - reap
9
- - quantized
 
 
 
 
10
  ---
11
 
12
- # MiniMaxAI/MiniMax-M2.5 - REAP Pruned (39% Compression)
 
 
 
 
 
 
 
 
 
 
13
 
14
- This is a pruned version of [`MiniMaxAI/MiniMax-M2.5`](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) using **REAP** (Router-weighted Expert Activation Pruning).
15
 
16
- ## Pruning Details
 
 
 
 
17
 
18
- - **Original Experts per Layer**: 256
19
- - **Remaining Experts per Layer**: 154
20
- - **Compression**: 39%
21
- - **Method**: REAP (Router-weighted Expert Activation Pruning)
22
 
23
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ```python
26
- from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
27
 
28
- model_name = "Akicou/MiniMax-M2-5-REAP-39"
29
- model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", trust_remote_code=True)
30
  tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
 
 
 
 
 
 
 
 
 
 
 
31
  ```
32
 
33
- ## Original Model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
- [`MiniMaxAI/MiniMax-M2.5`](https://huggingface.co/MiniMaxAI/MiniMax-M2.5)
36
 
37
- ## REAP
38
 
39
- REAP (Router-weighted Expert Activation Pruning) is a method for pruning Mixture-of-Experts (MoE) models by analyzing router activations during inference.
40
 
41
- - **GitHub**: [CerebrasResearch/reap](https://github.com/CerebrasResearch/reap)
42
- - **Paper**: [REAP: Pruning MoE Models via Router Weighted Expert Activation](https://arxiv.org/abs/...)
 
1
  ---
2
+ language:
3
+ - en
4
  tags:
 
5
  - mixture-of-experts
6
+ - moe
7
  - pruning
8
+ - compression
9
+ - minimax
10
  - reap
11
+ - efficient-inference
12
+ license: mit
13
+ library_name: transformers
14
+ base_model: MiniMaxAI/MiniMax-M2.5
15
+ pipeline_tag: text-generation
16
  ---
17
 
18
+ # MiniMax-M2.5 REAP-39 (39% Pruned)
19
+
20
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
21
+ [![Base Model](https://img.shields.io/badge/Base-MiniMax--M2.5-blue)](https://huggingface.co/MiniMaxAI/MiniMax-M2.5)
22
+ [![Pruning Method](https://img.shields.io/badge/Method-REAP-green)](https://github.com/CerebrasResearch/reap)
23
+
24
+ ## Support This Work
25
+
26
+ Pruning large MoE models requires substantial GPU resources (multi-H100 clusters). If you find these models useful, consider [buying me a coffee](https://www.buymeacoffee.com/Akicou) to help offset rental costs and enable further releases. Your support makes this work possible!
27
+
28
+ ## Overview
29
 
30
+ This repository contains a **REAP-pruned** variant of the **MiniMax-M2.5** Mixture-of-Experts (MoE) language model with **39%** of experts removed while maintaining strong performance.
31
 
32
+ **REAP** (Router Expert Activation Pruning) is a structured pruning technique that identifies and removes under-utilized experts based on activation patterns. This achieves:
33
+ - Reduced model size and memory footprint
34
+ - Faster inference and lower cost
35
+ - Maintained active parameters per token
36
+ - Full compatibility with HuggingFace Transformers
37
 
38
+ ## REAP Variant Selection
 
 
 
39
 
40
+ Choose the variant that best fits your deployment constraints:
41
+
42
+ | Model | Pruned | Kept | Size Reduction | Performance Trade-off |
43
+ |-------|--------|------|----------------|----------------------|
44
+ | **REAP-10** | 10% | 90% | Small | Minimal |
45
+ | **REAP-20** | 20% | 80% | Moderate | Small |
46
+ | **REAP-30** | 30% | 70% | Significant | Moderate |
47
+ | **REAP-40** | 40% | 60% | Large | Noticeable |
48
+ | **REAP-50** | 50% | 50% | Very Large | Significant |
49
+
50
+ **Repository Links:**
51
+ - [`Akicou/MiniMax-M2.5-REAP-19`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-19)
52
+ - [`Akicou/MiniMax-M2.5-REAP-29`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-29)
53
+ - [`Akicou/MiniMax-M2.5-REAP-39`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-39)
54
+ - [`Akicou/MiniMax-M2.5-REAP-50`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-50)
55
+
56
+ ## Quick Start
57
 
58
  ```python
59
+ from transformers import AutoTokenizer, AutoModelForCausalLM
60
+
61
+ model_name = "Akicou/MiniMax-M2.5-REAP-39"
62
 
 
 
63
  tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
64
+ model = AutoModelForCausalLM.from_pretrained(
65
+ model_name,
66
+ device_map="auto",
67
+ torch_dtype="auto",
68
+ trust_remote_code=True
69
+ )
70
+
71
+ prompt = "Explain quantum entanglement in simple terms:"
72
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
73
+ outputs = model.generate(**inputs, max_new_tokens=256)
74
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
75
  ```
76
 
77
+ ### Memory-Efficient Loading
78
+
79
+ For systems with limited GPU memory:
80
+
81
+ ```python
82
+ # 8-bit quantization
83
+ model = AutoModelForCausalLM.from_pretrained(
84
+ model_name,
85
+ device_map="auto",
86
+ load_in_8bit=True,
87
+ trust_remote_code=True
88
+ )
89
+
90
+ # 4-bit quantization
91
+ from transformers import BitsAndBytesConfig
92
+
93
+ quantization_config = BitsAndBytesConfig(
94
+ load_in_4bit=True,
95
+ bnb_4bit_compute_dtype=torch.float16
96
+ )
97
+
98
+ model = AutoModelForCausalLM.from_pretrained(
99
+ model_name,
100
+ device_map="auto",
101
+ quantization_config=quantization_config,
102
+ trust_remote_code=True
103
+ )
104
+ ```
105
+
106
+ ## Quantized GGUF Versions
107
+
108
+ Quantized GGUF variants optimized for `llama.cpp`, `Ollama`, and similar backends are in preparation in collaboration with **mradermacher**. Planned formats include Q4_K_M, Q5_K_M, Q6_K, and Q8_0.
109
+
110
+ ## 🔬 Pruning Methodology
111
+
112
+ ### REAP Framework
113
+
114
+ Pruning was performed using the [REAP framework](https://github.com/CerebrasResearch/reap) (implementation: [Akicou/reap](https://github.com/Akicou/reap)) with the following configuration:
115
+
116
+ **Calibration Settings:**
117
+ - **Dataset:** Mixed-domain calibration corpus (150 samples per category)
118
+ - **Distance Metric:** Cosine similarity
119
+ - **Loading Precision:** 4-bit for memory efficiency during pruning
120
+ - **Selection Strategy:** Router activation frequency analysis
121
+
122
+ **Process:**
123
+ 1. Collect expert activation statistics across calibration dataset
124
+ 2. Compute similarity scores between experts
125
+ 3. Identify and rank experts by utilization
126
+ 4. Prune lowest-activated experts while maintaining coverage
127
+ 5. Validate structural integrity and export pruned model
128
+
129
+ For full pruning commands, hyperparameters, and reproducibility details, see the [Akicou/reap repository](https://github.com/Akicou/reap).
130
+
131
+ ## ⚖️ Performance Characteristics
132
+
133
+ **What Changes:**
134
+ - ✅ Reduced model size (fewer total experts)
135
+ - ✅ Faster inference (less expert routing overhead)
136
+ - ✅ Lower memory requirements
137
+ - ⚠️ Slight reduction in capability on edge cases
138
+
139
+ **What Stays the Same:**
140
+ - ✅ Active parameters per token (same compute per inference)
141
+ - ✅ Model architecture and API compatibility
142
+ - ✅ Tokenizer and input/output formats
143
+
144
+ **Trade-offs:** These models exchange a small amount of capability for significantly improved efficiency. Higher pruning rates (39 < 30%) may show more noticeable quality differences on complex or specialized tasks.
145
+
146
+ **Note:** Formal benchmarks are not provided due to resource constraints. Community evaluation contributions are welcome!
147
+
148
+ ## 🛠️ Use Cases
149
+
150
+ **Ideal for:**
151
+ - 🏠 Running large language models on consumer GPUs
152
+ - 💻 Local development and testing
153
+ - 🌐 Edge deployment and on-device inference
154
+ - 💰 Cost-sensitive production environments
155
+ - 🔬 Research on efficient model architectures
156
+
157
+ **Consider the full model if:**
158
+ - You have abundant GPU resources
159
+ - Maximum quality is critical
160
+ - Working on highly specialized domains
161
+
162
+ ## 📚 Citation
163
+
164
+ If you use these pruned models in your research or applications, please cite both the original REAP paper and the base model:
165
+
166
+ ### REAP Citation
167
+
168
+ ```bibtex
169
+ @article{lasby2025reap,
170
+ title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
171
+ author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
172
+ journal={arXiv preprint arXiv:2510.13999},
173
+ year={2025}
174
+ }
175
+ ```
176
+
177
+ ### Base Model Citation
178
+
179
+ ```bibtex
180
+ @misc{minimax2025m25,
181
+ title={MiniMax-M2.5: A State-of-the-Art Mixture-of-Experts Language Model},
182
+ author={MiniMaxAI},
183
+ year={2025},
184
+ howpublished={\url{https://huggingface.co/MiniMaxAI/MiniMax-M2.5}}
185
+ }
186
+ ```
187
+
188
+ ## 🙏 Acknowledgments
189
+
190
+ - **Original Model:** [MiniMaxAI](https://huggingface.co/MiniMaxAI) for developing MiniMax-M2.5
191
+ - **REAP Framework:** [Cerebras Research](https://github.com/CerebrasResearch/reap) for the pruning methodology
192
+ - **Community:** HuggingFace and the open-source AI community
193
+
194
+ ## 💖 Support This Work
195
+
196
+ Pruning large MoE models requires substantial computational resources (multi-GPU H100 clusters). If you find these models useful:
197
+
198
+ - ☕ [Buy me a coffee](https://www.buymeacoffee.com/Akicou) to help offset GPU rental costs
199
+ - ⭐ Star the [GitHub repository](https://github.com/Akicou/reap)
200
+ - 📢 Share with others who might benefit
201
+ - 🐛 Report issues and contribute improvements
202
+
203
+ Your support enables continued development and release of efficient model variants!
204
+
205
+ ## 📞 Contact & Feedback
206
+
207
+ - **Issues & Requests:** Open an issue on [GitHub](https://github.com/Akicou/reap/issues)
208
+ - **Discussions:** Use the HuggingFace Community tab above
209
+ - **Custom Pruning:** Reach out for specific pruning ratios or other MoE models
210
+
211
+ Feedback, bug reports, and collaboration inquiries are always welcome!
212
+
213
+ ## 📄 License
214
+
215
+ This model inherits the MIT license from the original MiniMax-M2.5 model. See [LICENSE](LICENSE) for details.
216
+
217
+ ---
218
 
219
+ <div align="center">
220
 
221
+ **Made with ❤️ by Akicou | Powered by REAP**
222
 
223
+ [🤗 Model Hub](https://huggingface.co/Akicou) | [💻 GitHub](https://github.com/Akicou) | [☕ Support](https://www.buymeacoffee.com/Akicou)
224
 
225
+ </div>