0xSero commited on
Commit
190fb90
·
verified ·
1 Parent(s): f530ed1

Initial: Professional model card

Browse files
Files changed (1) hide show
  1. README.md +119 -0
README.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: MiniMaxAI/MiniMax-M2.1
4
+ tags:
5
+ - minimax
6
+ - moe
7
+ - reap
8
+ - pruned
9
+ - text-generation
10
+ library_name: transformers
11
+ pipeline_tag: text-generation
12
+ ---
13
+
14
+ # MiniMax-M2.1-REAP-40
15
+
16
+ **40% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)**
17
+
18
+ | Property | Value |
19
+ |----------|-------|
20
+ | Base Model | [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) |
21
+ | Parameters | ~139B |
22
+ | Experts | 154/256 (60% retained) |
23
+ | Architecture | MoE (Mixture of Experts) |
24
+ | Precision | BF16 |
25
+ | VRAM Required | ~278GB |
26
+ | Stability | **0 loops** in stress tests |
27
+
28
+ ## Stress Test Results
29
+
30
+ Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests):
31
+
32
+ | Temperature | math_word | reasoning | code | json | instruction | creative |
33
+ |-------------|-----------|-----------|------|------|-------------|----------|
34
+ | 0.0 | OK | OK | OK | OK | OK | OK |
35
+ | 0.2 | OK | OK | OK | OK | OK | OK |
36
+ | 0.7 | OK | OK | OK | OK | OK | OK |
37
+ | 1.0 | OK | OK | OK | OK | OK | OK |
38
+
39
+ **Result: 24/24 tests passed, 0 loops detected**
40
+
41
+ ## Usage
42
+
43
+ ```python
44
+ from transformers import AutoModelForCausalLM, AutoTokenizer
45
+ import torch
46
+
47
+ model = AutoModelForCausalLM.from_pretrained(
48
+ "0xSero/MiniMax-M2.1-REAP-40",
49
+ torch_dtype=torch.bfloat16,
50
+ device_map="auto",
51
+ trust_remote_code=True,
52
+ )
53
+ tokenizer = AutoTokenizer.from_pretrained(
54
+ "0xSero/MiniMax-M2.1-REAP-40",
55
+ trust_remote_code=True,
56
+ )
57
+
58
+ messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}]
59
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
60
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
61
+
62
+ outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
63
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
64
+ print(response)
65
+ ```
66
+
67
+ ## DynamicCache Compatibility Fix (transformers 4.55+)
68
+
69
+ If you encounter `TypeError: CacheLayerMixin.__init__() got an unexpected keyword argument`, add this before importing the model:
70
+
71
+ ```python
72
+ from transformers import cache_utils
73
+ _orig = cache_utils.DynamicCache.__init__
74
+ def _patched(self, *args, **kwargs):
75
+ cfg = kwargs.get("config")
76
+ if cfg and hasattr(cfg, "model_type") and "minimax" in str(getattr(cfg, "model_type", "")):
77
+ kwargs.pop("config", None)
78
+ kwargs.pop("max_cache_len", None)
79
+ kwargs.pop("max_batch_size", None)
80
+ return _orig(self, None)
81
+ return _orig(self, *args, **kwargs)
82
+ cache_utils.DynamicCache.__init__ = _patched
83
+ ```
84
+
85
+ ## Model Comparison
86
+
87
+ | Model | Experts | Loops | Size | Status |
88
+ |-------|---------|-------|------|--------|
89
+ | [MiniMax-M2.1-REAP-20](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-20-REPAIR-IN-PROGRESS) | 204 | 1 | 185B | Deprecated |
90
+ | [MiniMax-M2.1-REAP-30](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-30) | 180 | 0 | 162B | Recommended |
91
+ | **MiniMax-M2.1-REAP-40** | **154** | **0** | **139B** | **Recommended** |
92
+ | [MiniMax-M2.1-REAP-50](https://huggingface.co/0xSero/MiniMax-M2.1-REAP-50-REPAIR-IN-PROGRESS) | 128 | 2 | 116B | Deprecated |
93
+
94
+ ## Quantized Versions
95
+
96
+ - **MiniMax-M2.1-REAP-40-W4A16** (Coming Soon) - 4-bit weights, ~58GB VRAM
97
+
98
+ ## Why 40% Pruning?
99
+
100
+ The 40% pruning ratio offers the best balance of:
101
+ - **Size reduction**: 139B vs 456B original (70% smaller)
102
+ - **VRAM savings**: ~278GB vs ~912GB (fits on 4x H100 80GB)
103
+ - **Stability**: 0 loops in comprehensive stress testing
104
+ - **Performance**: Minimal quality degradation from strategic expert selection
105
+
106
+ ## REAP Methodology
107
+
108
+ REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.
109
+
110
+ **Calibration Dataset**: 2098 samples
111
+ - pile-10k: 498 samples (general text)
112
+ - evol-codealpaca: 800 samples (code generation)
113
+ - xlam-function-calling: 800 samples (function calling)
114
+
115
+ ## Acknowledgments
116
+
117
+ - Sponsored by [Prime Intellect](https://www.primeintellect.ai/)
118
+ - REAP implementation by [Cerebras](https://github.com/Cerebras/reap)
119
+ - Base model by [MiniMax](https://huggingface.co/MiniMaxAI)