File size: 7,982 Bytes
80c823c
175115d
 
80c823c
 
175115d
80c823c
175115d
 
80c823c
175115d
 
 
 
 
80c823c
 
175115d
 
 
 
 
 
 
 
 
 
 
80c823c
175115d
80c823c
175115d
 
 
 
 
80c823c
175115d
80c823c
175115d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80c823c
 
175115d
 
 
80c823c
 
175115d
 
 
 
 
 
 
 
 
 
 
80c823c
 
175115d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80c823c
175115d
80c823c
175115d
80c823c
175115d
80c823c
175115d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
---
language:
- en
tags:
- mixture-of-experts
- moe
- pruning
- compression
- minimax
- reap
- efficient-inference
license: mit
library_name: transformers
base_model: MiniMaxAI/MiniMax-M2.5
pipeline_tag: text-generation
---

# MiniMax-M2.5 REAP-39 (39% Pruned)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Base Model](https://img.shields.io/badge/Base-MiniMax--M2.5-blue)](https://huggingface.co/MiniMaxAI/MiniMax-M2.5)
[![Pruning Method](https://img.shields.io/badge/Method-REAP-green)](https://github.com/CerebrasResearch/reap)

## Support This Work

Pruning large MoE models requires substantial GPU resources (multi-H100 clusters). If you find these models useful, consider [buying me a coffee](https://www.buymeacoffee.com/Akicou) to help offset rental costs and enable further releases. Your support makes this work possible!

## Overview

This repository contains a **REAP-pruned** variant of the **MiniMax-M2.5** Mixture-of-Experts (MoE) language model with **39%** of experts removed while maintaining strong performance.

**REAP** (Router Expert Activation Pruning) is a structured pruning technique that identifies and removes under-utilized experts based on activation patterns. This achieves:
- Reduced model size and memory footprint
- Faster inference and lower cost
- Maintained active parameters per token
- Full compatibility with HuggingFace Transformers

## REAP Variant Selection

Choose the variant that best fits your deployment constraints:

| Model | Pruned | Kept | Size Reduction | Performance Trade-off |
|-------|--------|------|----------------|----------------------|
| **REAP-10** | 10% | 90% | Small | Minimal |
| **REAP-20** | 20% | 80% | Moderate | Small |
| **REAP-30** | 30% | 70% | Significant | Moderate |
| **REAP-40** | 40% | 60% | Large | Noticeable |
| **REAP-50** | 50% | 50% | Very Large | Significant |

**Repository Links:**
- [`Akicou/MiniMax-M2.5-REAP-19`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-19)
- [`Akicou/MiniMax-M2.5-REAP-29`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-29)
- [`Akicou/MiniMax-M2.5-REAP-39`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-39)
- [`Akicou/MiniMax-M2.5-REAP-50`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-50)

## Quick Start

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Akicou/MiniMax-M2.5-REAP-39"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True
)

prompt = "Explain quantum entanglement in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Memory-Efficient Loading

For systems with limited GPU memory:

```python
# 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_8bit=True,
    trust_remote_code=True
)

# 4-bit quantization
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config,
    trust_remote_code=True
)
```

## Quantized GGUF Versions

Quantized GGUF variants optimized for `llama.cpp`, `Ollama`, and similar backends are in preparation in collaboration with **mradermacher**. Planned formats include Q4_K_M, Q5_K_M, Q6_K, and Q8_0.

## 🔬 Pruning Methodology

### REAP Framework

Pruning was performed using the [REAP framework](https://github.com/CerebrasResearch/reap) (implementation: [Akicou/reap](https://github.com/Akicou/reap)) with the following configuration:

**Calibration Settings:**
- **Dataset:** Mixed-domain calibration corpus (150 samples per category)
- **Distance Metric:** Cosine similarity
- **Loading Precision:** 4-bit for memory efficiency during pruning
- **Selection Strategy:** Router activation frequency analysis

**Process:**
1. Collect expert activation statistics across calibration dataset
2. Compute similarity scores between experts
3. Identify and rank experts by utilization
4. Prune lowest-activated experts while maintaining coverage
5. Validate structural integrity and export pruned model

For full pruning commands, hyperparameters, and reproducibility details, see the [Akicou/reap repository](https://github.com/Akicou/reap).

## ⚖️ Performance Characteristics

**What Changes:**
- ✅ Reduced model size (fewer total experts)
- ✅ Faster inference (less expert routing overhead)
- ✅ Lower memory requirements
- ⚠️ Slight reduction in capability on edge cases

**What Stays the Same:**
- ✅ Active parameters per token (same compute per inference)
- ✅ Model architecture and API compatibility
- ✅ Tokenizer and input/output formats

**Trade-offs:** These models exchange a small amount of capability for significantly improved efficiency. Higher pruning rates (39 < 30%) may show more noticeable quality differences on complex or specialized tasks.

**Note:** Formal benchmarks are not provided due to resource constraints. Community evaluation contributions are welcome!

## 🛠️ Use Cases

**Ideal for:**
- 🏠 Running large language models on consumer GPUs
- 💻 Local development and testing
- 🌐 Edge deployment and on-device inference
- 💰 Cost-sensitive production environments
- 🔬 Research on efficient model architectures

**Consider the full model if:**
- You have abundant GPU resources
- Maximum quality is critical
- Working on highly specialized domains

## 📚 Citation

If you use these pruned models in your research or applications, please cite both the original REAP paper and the base model:

### REAP Citation

```bibtex
@article{lasby2025reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}
```

### Base Model Citation

```bibtex
@misc{minimax2025m25,
  title={MiniMax-M2.5: A State-of-the-Art Mixture-of-Experts Language Model},
  author={MiniMaxAI},
  year={2025},
  howpublished={\url{https://huggingface.co/MiniMaxAI/MiniMax-M2.5}}
}
```

## 🙏 Acknowledgments

- **Original Model:** [MiniMaxAI](https://huggingface.co/MiniMaxAI) for developing MiniMax-M2.5
- **REAP Framework:** [Cerebras Research](https://github.com/CerebrasResearch/reap) for the pruning methodology
- **Community:** HuggingFace and the open-source AI community

## 💖 Support This Work

Pruning large MoE models requires substantial computational resources (multi-GPU H100 clusters). If you find these models useful:

- ☕ [Buy me a coffee](https://www.buymeacoffee.com/Akicou) to help offset GPU rental costs
- ⭐ Star the [GitHub repository](https://github.com/Akicou/reap)
- 📢 Share with others who might benefit
- 🐛 Report issues and contribute improvements

Your support enables continued development and release of efficient model variants!

## 📞 Contact & Feedback

- **Issues & Requests:** Open an issue on [GitHub](https://github.com/Akicou/reap/issues)
- **Discussions:** Use the HuggingFace Community tab above
- **Custom Pruning:** Reach out for specific pruning ratios or other MoE models

Feedback, bug reports, and collaboration inquiries are always welcome!

## 📄 License

This model inherits the MIT license from the original MiniMax-M2.5 model. See [LICENSE](LICENSE) for details.

---

<div align="center">

**Made with ❤️ by Akicou | Powered by REAP**

[🤗 Model Hub](https://huggingface.co/Akicou) | [💻 GitHub](https://github.com/Akicou) | [☕ Support](https://www.buymeacoffee.com/Akicou)

</div>