fix readme

Browse files

Files changed (2) hide show

README.MD +0 -129
README.md +129 -3

README.MD DELETED Viewed

@@ -1,129 +0,0 @@
----
-license: apache-2.0
-base_model: MiniMaxAI/MiniMax-M2.1
-tags:
-- minimax
-- moe
-- reap
-- pruned
-- cerebras
-- quantized
-- gptq
-- autoround
-- 4bit
-- text-generation
-library_name: transformers
-pipeline_tag: text-generation
----
-<p align="center">
-  <em>𓌳 <strong>REAP</strong>𓌳  the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br>
-  <a href="https://arxiv.org/abs/2510.13999">📄 Paper</a> • <a href="https://github.com/CerebrasResearch/reap">💻 Code</a> • <a href="https://www.cerebras.ai/blog/reap">📝 Blog</a>
-</p>
-# MiniMax-M2.1-REAP-50-W4A16
-> ⚠️ **Note**: This is a **re-upload of 0xSero's quantized and pruned MiniMax-M2.1-REAP-50-W4A16 model**.
-> The original creator ([0xSero](https://huggingface.co/0xSero)) has explicitly authorized this re-upload.
-> All credit for the quantization and pruning work goes to 0xSero.
-## ✨ Highlights
-**50% Expert-Pruned + INT4 Quantized** — Double compression for efficient deployment.
-- **REAP + AutoRound**: Expert pruning + weight quantization
-- **Optimized for Code & Tools**: Calibrated on code generation and function calling
-- **Lower VRAM**: Fits on 96GB of VRAM
-**50% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)**
-| Property | Value |
-|----------|-------|
-| Base Model | [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) |
-| **After REAP 50%** | ~116B |
-| Experts | 128/256 (50% retained) |
-| Architecture | MoE (Mixture of Experts) |
-| **Quantization** | INT4 weights, FP16 activations |
-| **Format** | GPTQ (AutoRound) |
-| Disk Size | 62.6GB |
-| (Un)Stability | **2 loops** in stress tests |
-## Stress Test Results
-Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests): [MiniMax-M2.1 REAP Stress Test Observations ](https://huggingface.co/datasets/0xSero/minimax-m2.1-reap-observations)
-| Temperature | math_word | reasoning | code | json | instruction | creative |
-|-------------|-----------|-----------|------|------|-------------|----------|
-| 0.0 | **Loop** | OK | OK | OK | OK | OK |
-| 0.2 | **Loop** | OK | OK | OK | OK | OK |
-| 0.7 | OK | OK | OK | OK | OK | OK |
-| 1.0 | OK | OK | OK | OK | OK | OK |
-**Result: 24/24 tests passed, 2 loops detected**
-## 🚀 Deployment
-### vLLM (Recommended)
-```bash
-vllm serve plezan/MiniMax-M2.1-REAP-50-W4A16 \
-    --tensor-parallel-size 4 \
-    --trust-remote-code \
-    --quantization gptq
-```
-### Transformers
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model = AutoModelForCausalLM.from_pretrained(
-    "plezan/MiniMax-M2.1-REAP-50-W4A16",
-    device_map="auto",
-    trust_remote_code=True
-)
-tokenizer = AutoTokenizer.from_pretrained("plezan/MiniMax-M2.1-REAP-50-W4A16", trust_remote_code=True)
-```
-## Why 50% Pruning?
-The 50% pruning ratio offers a balance of:
-- **Size reduction**: 116B vs 456B original (75% smaller)
-- **Performance**: Minimal quality degradation from strategic expert selection
-- **At the cost of Stability**: 2 loops in comprehensive stress testing
-Using a 40% runing ratio would offers an overal better balance.
-## Model Comparison
-| Model | Experts | Loops | Size | Status |
-|-------|---------|-------|------|--------|
-| [MiniMax-M2.1-REAP-20](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-20-REPAIR-IN-PROGRESS) | 204 | 1 | 185B | Deprecated |
-| [MiniMax-M2.1-REAP-30](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-30) | 180 | 0 | 162B | Recommended |
-| [MiniMax-M2.1-REAP-40](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-40) | 154 | 0 | 139B | Recommended |
-| [MiniMax-M2.1-REAP-50](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-50-REPAIR-IN-PROGRESS) | 128 | 2 | 116B | Deprecated |
-> **Note**: Links in the table above point to the original models on 0xSero's account, some of them were removed by the creator. This re-upload preserves the 50% pruned + **quantized** version with authorization.
-## REAP Methodology
-REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.
-**Calibration Dataset**: 2098 samples
-- pile-10k: 498 samples (general text)
-- evol-codealpaca: 800 samples (code generation)
-- xlam-function-calling: 800 samples (function calling)
-## 🙏 Acknowledgments
-This model is derivative work based on extensive research and development by:
-- **[0xSero](https://huggingface.co/0xSero)** — Original quantization (GPTQ/AutoRound) and REAP pruning of MiniMax-M2.1. This re-upload is posted with explicit authorization from 0xSero.
-- **[Prime Intellect](https://www.primeintellect.ai/)** — Compute sponsorship for the original work
-- **[Cerebras](https://www.cerebras.net/)** — [REAP methodology](https://arxiv.org/abs/2510.13999) and implementation
-- **[Intel](https://github.com/intel/auto-round)** — AutoRound quantization framework
-- **[MiniMax](https://huggingface.co/MiniMaxAI)** — Base model (MiniMax-M2.1)

README.md CHANGED Viewed

@@ -1,3 +1,129 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+base_model: MiniMaxAI/MiniMax-M2.1
+tags:
+- minimax
+- moe
+- reap
+- pruned
+- cerebras
+- quantized
+- gptq
+- autoround
+- 4bit
+- text-generation
+library_name: transformers
+pipeline_tag: text-generation
+---
+<p align="center">
+  <em>𓌳 <strong>REAP</strong>𓌳  the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br>
+  <a href="https://arxiv.org/abs/2510.13999">📄 Paper</a> • <a href="https://github.com/CerebrasResearch/reap">💻 Code</a> • <a href="https://www.cerebras.ai/blog/reap">📝 Blog</a>
+</p>
+# MiniMax-M2.1-REAP-50-W4A16
+> ⚠️ **Note**: This is a **re-upload of 0xSero's quantized and pruned MiniMax-M2.1-REAP-50-W4A16 model**.
+> The original creator ([0xSero](https://huggingface.co/0xSero)) has explicitly authorized this re-upload.
+> All credit for the quantization and pruning work goes to 0xSero.
+## ✨ Highlights
+**50% Expert-Pruned + INT4 Quantized** — Double compression for efficient deployment.
+- **REAP + AutoRound**: Expert pruning + weight quantization
+- **Optimized for Code & Tools**: Calibrated on code generation and function calling
+- **Lower VRAM**: Fits on 96GB of VRAM
+**50% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)**
+| Property | Value |
+|----------|-------|
+| Base Model | [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) |
+| **After REAP 50%** | ~116B |
+| Experts | 128/256 (50% retained) |
+| Architecture | MoE (Mixture of Experts) |
+| **Quantization** | INT4 weights, FP16 activations |
+| **Format** | GPTQ (AutoRound) |
+| Disk Size | 62.6GB |
+| (Un)Stability | **2 loops** in stress tests |
+## Stress Test Results
+Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests): [MiniMax-M2.1 REAP Stress Test Observations ](https://huggingface.co/datasets/0xSero/minimax-m2.1-reap-observations)
+| Temperature | math_word | reasoning | code | json | instruction | creative |
+|-------------|-----------|-----------|------|------|-------------|----------|
+| 0.0 | **Loop** | OK | OK | OK | OK | OK |
+| 0.2 | **Loop** | OK | OK | OK | OK | OK |
+| 0.7 | OK | OK | OK | OK | OK | OK |
+| 1.0 | OK | OK | OK | OK | OK | OK |
+**Result: 24/24 tests passed, 2 loops detected**
+## 🚀 Deployment
+### vLLM (Recommended)
+```bash
+vllm serve plezan/MiniMax-M2.1-REAP-50-W4A16 \
+    --tensor-parallel-size 4 \
+    --trust-remote-code \
+    --quantization gptq
+```
+### Transformers
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained(
+    "plezan/MiniMax-M2.1-REAP-50-W4A16",
+    device_map="auto",
+    trust_remote_code=True
+)
+tokenizer = AutoTokenizer.from_pretrained("plezan/MiniMax-M2.1-REAP-50-W4A16", trust_remote_code=True)
+```
+## Why 50% Pruning?
+The 50% pruning ratio offers a balance of:
+- **Size reduction**: 116B vs 456B original (75% smaller)
+- **Performance**: Minimal quality degradation from strategic expert selection
+- **At the cost of Stability**: 2 loops in comprehensive stress testing
+Using a 40% runing ratio would offers an overal better balance.
+## Model Comparison
+| Model | Experts | Loops | Size | Status |
+|-------|---------|-------|------|--------|
+| [MiniMax-M2.1-REAP-20](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-20-REPAIR-IN-PROGRESS) | 204 | 1 | 185B | Deprecated |
+| [MiniMax-M2.1-REAP-30](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-30) | 180 | 0 | 162B | Recommended |
+| [MiniMax-M2.1-REAP-40](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-40) | 154 | 0 | 139B | Recommended |
+| [MiniMax-M2.1-REAP-50](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-50-REPAIR-IN-PROGRESS) | 128 | 2 | 116B | Deprecated |
+> **Note**: Links in the table above point to the original models on 0xSero's account, some of them were removed by the creator. This re-upload preserves the 50% pruned + **quantized** version with authorization.
+## REAP Methodology
+REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.
+**Calibration Dataset**: 2098 samples
+- pile-10k: 498 samples (general text)
+- evol-codealpaca: 800 samples (code generation)
+- xlam-function-calling: 800 samples (function calling)
+## 🙏 Acknowledgments
+This model is derivative work based on extensive research and development by:
+- **[0xSero](https://huggingface.co/0xSero)** — Original quantization (GPTQ/AutoRound) and REAP pruning of MiniMax-M2.1. This re-upload is posted with explicit authorization from 0xSero.
+- **[Prime Intellect](https://www.primeintellect.ai/)** — Compute sponsorship for the original work
+- **[Cerebras](https://www.cerebras.net/)** — [REAP methodology](https://arxiv.org/abs/2510.13999) and implementation
+- **[Intel](https://github.com/intel/auto-round)** — AutoRound quantization framework
+- **[MiniMax](https://huggingface.co/MiniMaxAI)** — Base model (MiniMax-M2.1)