plezan commited on
Commit
da5d699
·
1 Parent(s): 8cffdc1

fix readme

Browse files
Files changed (2) hide show
  1. README.MD +0 -129
  2. README.md +129 -3
README.MD DELETED
@@ -1,129 +0,0 @@
1
- ---
2
- license: apache-2.0
3
- base_model: MiniMaxAI/MiniMax-M2.1
4
- tags:
5
- - minimax
6
- - moe
7
- - reap
8
- - pruned
9
- - cerebras
10
- - quantized
11
- - gptq
12
- - autoround
13
- - 4bit
14
- - text-generation
15
- library_name: transformers
16
- pipeline_tag: text-generation
17
- ---
18
-
19
- <p align="center">
20
- <em>𓌳 <strong>REAP</strong>𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br>
21
- <a href="https://arxiv.org/abs/2510.13999">📄 Paper</a> • <a href="https://github.com/CerebrasResearch/reap">💻 Code</a> • <a href="https://www.cerebras.ai/blog/reap">📝 Blog</a>
22
- </p>
23
-
24
- # MiniMax-M2.1-REAP-50-W4A16
25
-
26
- > ⚠️ **Note**: This is a **re-upload of 0xSero's quantized and pruned MiniMax-M2.1-REAP-50-W4A16 model**.
27
- > The original creator ([0xSero](https://huggingface.co/0xSero)) has explicitly authorized this re-upload.
28
- > All credit for the quantization and pruning work goes to 0xSero.
29
-
30
- ## ✨ Highlights
31
-
32
- **50% Expert-Pruned + INT4 Quantized** — Double compression for efficient deployment.
33
-
34
- - **REAP + AutoRound**: Expert pruning + weight quantization
35
- - **Optimized for Code & Tools**: Calibrated on code generation and function calling
36
- - **Lower VRAM**: Fits on 96GB of VRAM
37
-
38
-
39
- **50% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)**
40
-
41
- | Property | Value |
42
- |----------|-------|
43
- | Base Model | [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) |
44
- | **After REAP 50%** | ~116B |
45
- | Experts | 128/256 (50% retained) |
46
- | Architecture | MoE (Mixture of Experts) |
47
- | **Quantization** | INT4 weights, FP16 activations |
48
- | **Format** | GPTQ (AutoRound) |
49
- | Disk Size | 62.6GB |
50
- | (Un)Stability | **2 loops** in stress tests |
51
-
52
- ## Stress Test Results
53
-
54
- Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests): [MiniMax-M2.1 REAP Stress Test Observations ](https://huggingface.co/datasets/0xSero/minimax-m2.1-reap-observations)
55
-
56
- | Temperature | math_word | reasoning | code | json | instruction | creative |
57
- |-------------|-----------|-----------|------|------|-------------|----------|
58
- | 0.0 | **Loop** | OK | OK | OK | OK | OK |
59
- | 0.2 | **Loop** | OK | OK | OK | OK | OK |
60
- | 0.7 | OK | OK | OK | OK | OK | OK |
61
- | 1.0 | OK | OK | OK | OK | OK | OK |
62
-
63
- **Result: 24/24 tests passed, 2 loops detected**
64
-
65
-
66
- ## 🚀 Deployment
67
-
68
- ### vLLM (Recommended)
69
-
70
- ```bash
71
- vllm serve plezan/MiniMax-M2.1-REAP-50-W4A16 \
72
- --tensor-parallel-size 4 \
73
- --trust-remote-code \
74
- --quantization gptq
75
- ```
76
-
77
- ### Transformers
78
-
79
- ```python
80
- from transformers import AutoModelForCausalLM, AutoTokenizer
81
-
82
- model = AutoModelForCausalLM.from_pretrained(
83
- "plezan/MiniMax-M2.1-REAP-50-W4A16",
84
- device_map="auto",
85
- trust_remote_code=True
86
- )
87
- tokenizer = AutoTokenizer.from_pretrained("plezan/MiniMax-M2.1-REAP-50-W4A16", trust_remote_code=True)
88
- ```
89
-
90
-
91
- ## Why 50% Pruning?
92
-
93
- The 50% pruning ratio offers a balance of:
94
- - **Size reduction**: 116B vs 456B original (75% smaller)
95
- - **Performance**: Minimal quality degradation from strategic expert selection
96
- - **At the cost of Stability**: 2 loops in comprehensive stress testing
97
-
98
- Using a 40% runing ratio would offers an overal better balance.
99
-
100
- ## Model Comparison
101
-
102
- | Model | Experts | Loops | Size | Status |
103
- |-------|---------|-------|------|--------|
104
- | [MiniMax-M2.1-REAP-20](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-20-REPAIR-IN-PROGRESS) | 204 | 1 | 185B | Deprecated |
105
- | [MiniMax-M2.1-REAP-30](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-30) | 180 | 0 | 162B | Recommended |
106
- | [MiniMax-M2.1-REAP-40](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-40) | 154 | 0 | 139B | Recommended |
107
- | [MiniMax-M2.1-REAP-50](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-50-REPAIR-IN-PROGRESS) | 128 | 2 | 116B | Deprecated |
108
-
109
-
110
- > **Note**: Links in the table above point to the original models on 0xSero's account, some of them were removed by the creator. This re-upload preserves the 50% pruned + **quantized** version with authorization.
111
-
112
- ## REAP Methodology
113
-
114
- REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.
115
-
116
- **Calibration Dataset**: 2098 samples
117
- - pile-10k: 498 samples (general text)
118
- - evol-codealpaca: 800 samples (code generation)
119
- - xlam-function-calling: 800 samples (function calling)
120
-
121
- ## 🙏 Acknowledgments
122
-
123
- This model is derivative work based on extensive research and development by:
124
-
125
- - **[0xSero](https://huggingface.co/0xSero)** — Original quantization (GPTQ/AutoRound) and REAP pruning of MiniMax-M2.1. This re-upload is posted with explicit authorization from 0xSero.
126
- - **[Prime Intellect](https://www.primeintellect.ai/)** — Compute sponsorship for the original work
127
- - **[Cerebras](https://www.cerebras.net/)** — [REAP methodology](https://arxiv.org/abs/2510.13999) and implementation
128
- - **[Intel](https://github.com/intel/auto-round)** — AutoRound quantization framework
129
- - **[MiniMax](https://huggingface.co/MiniMaxAI)** — Base model (MiniMax-M2.1)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -1,3 +1,129 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: MiniMaxAI/MiniMax-M2.1
4
+ tags:
5
+ - minimax
6
+ - moe
7
+ - reap
8
+ - pruned
9
+ - cerebras
10
+ - quantized
11
+ - gptq
12
+ - autoround
13
+ - 4bit
14
+ - text-generation
15
+ library_name: transformers
16
+ pipeline_tag: text-generation
17
+ ---
18
+
19
+ <p align="center">
20
+ <em>𓌳 <strong>REAP</strong>𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br>
21
+ <a href="https://arxiv.org/abs/2510.13999">📄 Paper</a> • <a href="https://github.com/CerebrasResearch/reap">💻 Code</a> • <a href="https://www.cerebras.ai/blog/reap">📝 Blog</a>
22
+ </p>
23
+
24
+ # MiniMax-M2.1-REAP-50-W4A16
25
+
26
+ > ⚠️ **Note**: This is a **re-upload of 0xSero's quantized and pruned MiniMax-M2.1-REAP-50-W4A16 model**.
27
+ > The original creator ([0xSero](https://huggingface.co/0xSero)) has explicitly authorized this re-upload.
28
+ > All credit for the quantization and pruning work goes to 0xSero.
29
+
30
+ ## ✨ Highlights
31
+
32
+ **50% Expert-Pruned + INT4 Quantized** — Double compression for efficient deployment.
33
+
34
+ - **REAP + AutoRound**: Expert pruning + weight quantization
35
+ - **Optimized for Code & Tools**: Calibrated on code generation and function calling
36
+ - **Lower VRAM**: Fits on 96GB of VRAM
37
+
38
+
39
+ **50% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)**
40
+
41
+ | Property | Value |
42
+ |----------|-------|
43
+ | Base Model | [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) |
44
+ | **After REAP 50%** | ~116B |
45
+ | Experts | 128/256 (50% retained) |
46
+ | Architecture | MoE (Mixture of Experts) |
47
+ | **Quantization** | INT4 weights, FP16 activations |
48
+ | **Format** | GPTQ (AutoRound) |
49
+ | Disk Size | 62.6GB |
50
+ | (Un)Stability | **2 loops** in stress tests |
51
+
52
+ ## Stress Test Results
53
+
54
+ Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests): [MiniMax-M2.1 REAP Stress Test Observations ](https://huggingface.co/datasets/0xSero/minimax-m2.1-reap-observations)
55
+
56
+ | Temperature | math_word | reasoning | code | json | instruction | creative |
57
+ |-------------|-----------|-----------|------|------|-------------|----------|
58
+ | 0.0 | **Loop** | OK | OK | OK | OK | OK |
59
+ | 0.2 | **Loop** | OK | OK | OK | OK | OK |
60
+ | 0.7 | OK | OK | OK | OK | OK | OK |
61
+ | 1.0 | OK | OK | OK | OK | OK | OK |
62
+
63
+ **Result: 24/24 tests passed, 2 loops detected**
64
+
65
+
66
+ ## 🚀 Deployment
67
+
68
+ ### vLLM (Recommended)
69
+
70
+ ```bash
71
+ vllm serve plezan/MiniMax-M2.1-REAP-50-W4A16 \
72
+ --tensor-parallel-size 4 \
73
+ --trust-remote-code \
74
+ --quantization gptq
75
+ ```
76
+
77
+ ### Transformers
78
+
79
+ ```python
80
+ from transformers import AutoModelForCausalLM, AutoTokenizer
81
+
82
+ model = AutoModelForCausalLM.from_pretrained(
83
+ "plezan/MiniMax-M2.1-REAP-50-W4A16",
84
+ device_map="auto",
85
+ trust_remote_code=True
86
+ )
87
+ tokenizer = AutoTokenizer.from_pretrained("plezan/MiniMax-M2.1-REAP-50-W4A16", trust_remote_code=True)
88
+ ```
89
+
90
+
91
+ ## Why 50% Pruning?
92
+
93
+ The 50% pruning ratio offers a balance of:
94
+ - **Size reduction**: 116B vs 456B original (75% smaller)
95
+ - **Performance**: Minimal quality degradation from strategic expert selection
96
+ - **At the cost of Stability**: 2 loops in comprehensive stress testing
97
+
98
+ Using a 40% runing ratio would offers an overal better balance.
99
+
100
+ ## Model Comparison
101
+
102
+ | Model | Experts | Loops | Size | Status |
103
+ |-------|---------|-------|------|--------|
104
+ | [MiniMax-M2.1-REAP-20](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-20-REPAIR-IN-PROGRESS) | 204 | 1 | 185B | Deprecated |
105
+ | [MiniMax-M2.1-REAP-30](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-30) | 180 | 0 | 162B | Recommended |
106
+ | [MiniMax-M2.1-REAP-40](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-40) | 154 | 0 | 139B | Recommended |
107
+ | [MiniMax-M2.1-REAP-50](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-50-REPAIR-IN-PROGRESS) | 128 | 2 | 116B | Deprecated |
108
+
109
+
110
+ > **Note**: Links in the table above point to the original models on 0xSero's account, some of them were removed by the creator. This re-upload preserves the 50% pruned + **quantized** version with authorization.
111
+
112
+ ## REAP Methodology
113
+
114
+ REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.
115
+
116
+ **Calibration Dataset**: 2098 samples
117
+ - pile-10k: 498 samples (general text)
118
+ - evol-codealpaca: 800 samples (code generation)
119
+ - xlam-function-calling: 800 samples (function calling)
120
+
121
+ ## 🙏 Acknowledgments
122
+
123
+ This model is derivative work based on extensive research and development by:
124
+
125
+ - **[0xSero](https://huggingface.co/0xSero)** — Original quantization (GPTQ/AutoRound) and REAP pruning of MiniMax-M2.1. This re-upload is posted with explicit authorization from 0xSero.
126
+ - **[Prime Intellect](https://www.primeintellect.ai/)** — Compute sponsorship for the original work
127
+ - **[Cerebras](https://www.cerebras.net/)** — [REAP methodology](https://arxiv.org/abs/2510.13999) and implementation
128
+ - **[Intel](https://github.com/intel/auto-round)** — AutoRound quantization framework
129
+ - **[MiniMax](https://huggingface.co/MiniMaxAI)** — Base model (MiniMax-M2.1)