MiniMax-M2.1-REAP-30

30% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)

Property Value
Base Model MiniMaxAI/MiniMax-M2.1
Parameters ~162B
Experts 180/256 (70% retained)
Architecture MoE (Mixture of Experts)
Precision BF16
VRAM Required ~324GB
Stability 0 loops in stress tests

Stress Test Results

Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests):

Temperature math_word reasoning code json instruction creative
0.0 OK OK OK OK OK OK
0.2 OK OK OK OK OK OK
0.7 OK OK OK OK OK OK
1.0 OK OK OK OK OK OK

Result: 24/24 tests passed, 0 loops detected

Extended High-Temperature Testing

Additional tests at temperatures 0.5, 0.8, 0.9, 1.2 (results in stress_test_results.json).

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "0xSero/MiniMax-M2.1-REAP-30",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "0xSero/MiniMax-M2.1-REAP-30",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

DynamicCache Compatibility Fix (transformers 4.55+)

If you encounter TypeError: CacheLayerMixin.__init__() got an unexpected keyword argument, add this before importing the model:

from transformers import cache_utils
_orig = cache_utils.DynamicCache.__init__
def _patched(self, *args, **kwargs):
    cfg = kwargs.get("config")
    if cfg and hasattr(cfg, "model_type") and "minimax" in str(getattr(cfg, "model_type", "")):
        kwargs.pop("config", None)
        kwargs.pop("max_cache_len", None)
        kwargs.pop("max_batch_size", None)
        return _orig(self, None)
    return _orig(self, *args, **kwargs)
cache_utils.DynamicCache.__init__ = _patched

Model Comparison

Model Experts Loops Size Status
MiniMax-M2.1-REAP-20 204 1 185B Deprecated
MiniMax-M2.1-REAP-30 180 0 162B Recommended
MiniMax-M2.1-REAP-40 154 0 139B Recommended
MiniMax-M2.1-REAP-50 128 2 116B Deprecated

Quantized Versions

  • MiniMax-M2.1-REAP-40-W4A16 (Coming Soon) - 4-bit weights, ~58GB

REAP Methodology

REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.

Calibration Dataset: 2098 samples

  • pile-10k: 498 samples (general text)
  • evol-codealpaca: 800 samples (code generation)
  • xlam-function-calling: 800 samples (function calling)

Acknowledgments

Downloads last month
194
Safetensors
Model size
162B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/MiniMax-M2.1-REAP-30

Quantized
(33)
this model
Quantizations
5 models