README.md · Akicou/MiniMax-M2-5-REAP-39 at main

MiniMax-M2-5-REAP-39 / README.md

Akicou

Update README.md

175115d verified 4 days ago

preview code

raw

history blame contribute delete

7.98 kB

	---
	language:
	- en
	tags:
	- mixture-of-experts
	- moe
	- pruning
	- compression
	- minimax
	- reap
	- efficient-inference
	license: mit
	library_name: transformers
	base_model: MiniMaxAI/MiniMax-M2.5
	pipeline_tag: text-generation
	---

	# MiniMax-M2.5 REAP-39 (39% Pruned)

	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
	[![Base Model](https://img.shields.io/badge/Base-MiniMax--M2.5-blue)](https://huggingface.co/MiniMaxAI/MiniMax-M2.5)
	[![Pruning Method](https://img.shields.io/badge/Method-REAP-green)](https://github.com/CerebrasResearch/reap)

	## Support This Work

	Pruning large MoE models requires substantial GPU resources (multi-H100 clusters). If you find these models useful, consider [buying me a coffee](https://www.buymeacoffee.com/Akicou) to help offset rental costs and enable further releases. Your support makes this work possible!

	## Overview

	This repository contains a REAP-pruned variant of the MiniMax-M2.5 Mixture-of-Experts (MoE) language model with 39% of experts removed while maintaining strong performance.

	REAP (Router Expert Activation Pruning) is a structured pruning technique that identifies and removes under-utilized experts based on activation patterns. This achieves:
	- Reduced model size and memory footprint
	- Faster inference and lower cost
	- Maintained active parameters per token
	- Full compatibility with HuggingFace Transformers

	## REAP Variant Selection

	Choose the variant that best fits your deployment constraints:

	\| Model \| Pruned \| Kept \| Size Reduction \| Performance Trade-off \|
	\|-------\|--------\|------\|----------------\|----------------------\|
	\| REAP-10 \| 10% \| 90% \| Small \| Minimal \|
	\| REAP-20 \| 20% \| 80% \| Moderate \| Small \|
	\| REAP-30 \| 30% \| 70% \| Significant \| Moderate \|
	\| REAP-40 \| 40% \| 60% \| Large \| Noticeable \|
	\| REAP-50 \| 50% \| 50% \| Very Large \| Significant \|

	Repository Links:
	- [`Akicou/MiniMax-M2.5-REAP-19`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-19)
	- [`Akicou/MiniMax-M2.5-REAP-29`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-29)
	- [`Akicou/MiniMax-M2.5-REAP-39`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-39)
	- [`Akicou/MiniMax-M2.5-REAP-50`](https://huggingface.co/Akicou/MiniMax-M2.5-REAP-50)

	## Quick Start

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_name = "Akicou/MiniMax-M2.5-REAP-39"

	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	device_map="auto",
	torch_dtype="auto",
	trust_remote_code=True
	)

	prompt = "Explain quantum entanglement in simple terms:"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=256)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### Memory-Efficient Loading

	For systems with limited GPU memory:

	```python
	# 8-bit quantization
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	device_map="auto",
	load_in_8bit=True,
	trust_remote_code=True
	)

	# 4-bit quantization
	from transformers import BitsAndBytesConfig

	quantization_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_compute_dtype=torch.float16
	)

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	device_map="auto",
	quantization_config=quantization_config,
	trust_remote_code=True
	)
	```

	## Quantized GGUF Versions

	Quantized GGUF variants optimized for `llama.cpp`, `Ollama`, and similar backends are in preparation in collaboration with mradermacher. Planned formats include Q4_K_M, Q5_K_M, Q6_K, and Q8_0.

	## 🔬 Pruning Methodology

	### REAP Framework

	Pruning was performed using the [REAP framework](https://github.com/CerebrasResearch/reap) (implementation: [Akicou/reap](https://github.com/Akicou/reap)) with the following configuration:

	Calibration Settings:
	- Dataset: Mixed-domain calibration corpus (150 samples per category)
	- Distance Metric: Cosine similarity
	- Loading Precision: 4-bit for memory efficiency during pruning
	- Selection Strategy: Router activation frequency analysis

	Process:
	1. Collect expert activation statistics across calibration dataset
	2. Compute similarity scores between experts
	3. Identify and rank experts by utilization
	4. Prune lowest-activated experts while maintaining coverage
	5. Validate structural integrity and export pruned model

	For full pruning commands, hyperparameters, and reproducibility details, see the [Akicou/reap repository](https://github.com/Akicou/reap).

	## ⚖️ Performance Characteristics

	What Changes:
	- ✅ Reduced model size (fewer total experts)
	- ✅ Faster inference (less expert routing overhead)
	- ✅ Lower memory requirements
	- ⚠️ Slight reduction in capability on edge cases

	What Stays the Same:
	- ✅ Active parameters per token (same compute per inference)
	- ✅ Model architecture and API compatibility
	- ✅ Tokenizer and input/output formats

	Trade-offs: These models exchange a small amount of capability for significantly improved efficiency. Higher pruning rates (39 < 30%) may show more noticeable quality differences on complex or specialized tasks.

	Note: Formal benchmarks are not provided due to resource constraints. Community evaluation contributions are welcome!

	## 🛠️ Use Cases

	Ideal for:
	- 🏠 Running large language models on consumer GPUs
	- 💻 Local development and testing
	- 🌐 Edge deployment and on-device inference
	- 💰 Cost-sensitive production environments
	- 🔬 Research on efficient model architectures

	Consider the full model if:
	- You have abundant GPU resources
	- Maximum quality is critical
	- Working on highly specialized domains

	## 📚 Citation

	If you use these pruned models in your research or applications, please cite both the original REAP paper and the base model:

	### REAP Citation

	```bibtex
	@article{lasby2025reap,
	title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
	author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
	journal={arXiv preprint arXiv:2510.13999},
	year={2025}
	}
	```

	### Base Model Citation

	```bibtex
	@misc{minimax2025m25,
	title={MiniMax-M2.5: A State-of-the-Art Mixture-of-Experts Language Model},
	author={MiniMaxAI},
	year={2025},
	howpublished={\url{https://huggingface.co/MiniMaxAI/MiniMax-M2.5}}
	}
	```

	## 🙏 Acknowledgments

	- Original Model: [MiniMaxAI](https://huggingface.co/MiniMaxAI) for developing MiniMax-M2.5
	- REAP Framework: [Cerebras Research](https://github.com/CerebrasResearch/reap) for the pruning methodology
	- Community: HuggingFace and the open-source AI community

	## 💖 Support This Work

	Pruning large MoE models requires substantial computational resources (multi-GPU H100 clusters). If you find these models useful:

	- ☕ [Buy me a coffee](https://www.buymeacoffee.com/Akicou) to help offset GPU rental costs
	- ⭐ Star the [GitHub repository](https://github.com/Akicou/reap)
	- 📢 Share with others who might benefit
	- 🐛 Report issues and contribute improvements

	Your support enables continued development and release of efficient model variants!

	## 📞 Contact & Feedback

	- Issues & Requests: Open an issue on [GitHub](https://github.com/Akicou/reap/issues)
	- Discussions: Use the HuggingFace Community tab above
	- Custom Pruning: Reach out for specific pruning ratios or other MoE models

	Feedback, bug reports, and collaboration inquiries are always welcome!

	## 📄 License

	This model inherits the MIT license from the original MiniMax-M2.5 model. See [LICENSE](LICENSE) for details.

	---

	<div align="center">

	Made with ❤️ by Akicou \| Powered by REAP

	[🤗 Model Hub](https://huggingface.co/Akicou) \| [💻 GitHub](https://github.com/Akicou) \| [☕ Support](https://www.buymeacoffee.com/Akicou)

	</div>