Instructions to use 0xSero/MiniMax-M2.1-139B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 0xSero/MiniMax-M2.1-139B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="0xSero/MiniMax-M2.1-139B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("0xSero/MiniMax-M2.1-139B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("0xSero/MiniMax-M2.1-139B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use 0xSero/MiniMax-M2.1-139B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "0xSero/MiniMax-M2.1-139B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/MiniMax-M2.1-139B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/0xSero/MiniMax-M2.1-139B

SGLang

How to use 0xSero/MiniMax-M2.1-139B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "0xSero/MiniMax-M2.1-139B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/MiniMax-M2.1-139B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "0xSero/MiniMax-M2.1-139B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/MiniMax-M2.1-139B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use 0xSero/MiniMax-M2.1-139B with Docker Model Runner:
```
docker model run hf.co/0xSero/MiniMax-M2.1-139B
```

Support this work → · X · GitHub · REAP paper · Cerebras REAP

MiniMax-M2.1-139B

REAP-pruned MiniMaxAI/MiniMax-M2.1.

At a glance


Base model	MiniMaxAI/MiniMax-M2.1
Format	BF16
Total params	139B
Active / token	—
Experts / layer	154
Layers	62
Hidden size	3072
Context	196,608
On-disk size	140 GB

Which variant should I pick?

Variant	Format	Link
`MiniMax-M2.1-139B` (this)	BF16	link
`MiniMax-M2.1-162B`	BF16	link

40% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)

Property	Value
Base Model	MiniMaxAI/MiniMax-M2.1
Parameters	~139B
Experts	154/256 (60% retained)
Architecture	MoE (Mixture of Experts)
Precision	BF16
VRAM Required	~278GB
Stability	0 loops in stress tests

Stress Test Results

Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests):

Temperature	math_word	reasoning	code	json	instruction	creative
0.0	OK	OK	OK	OK	OK	OK
0.2	OK	OK	OK	OK	OK	OK
0.7	OK	OK	OK	OK	OK	OK
1.0	OK	OK	OK	OK	OK	OK

Result: 24/24 tests passed, 0 loops detected

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "0xSero/MiniMax-M2.1-139B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "0xSero/MiniMax-M2.1-139B",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

DynamicCache Compatibility Fix (transformers 4.55+)

If you encounter TypeError: CacheLayerMixin.__init__() got an unexpected keyword argument, add this before importing the model:

from transformers import cache_utils
_orig = cache_utils.DynamicCache.__init__
def _patched(self, *args, **kwargs):
    cfg = kwargs.get("config")
    if cfg and hasattr(cfg, "model_type") and "minimax" in str(getattr(cfg, "model_type", "")):
        kwargs.pop("config", None)
        kwargs.pop("max_cache_len", None)
        kwargs.pop("max_batch_size", None)
        return _orig(self, None)
    return _orig(self, *args, **kwargs)
cache_utils.DynamicCache.__init__ = _patched

Model Comparison

Model	Experts	Loops	Size	Status
MiniMax-M2.1-REAP-20	204	1	185B	Deprecated
MiniMax-M2.1-REAP-30	180	0	162B	Recommended
MiniMax-M2.1-REAP-40	154	0	139B	Recommended
MiniMax-M2.1-REAP-50	128	2	116B	Deprecated

Quantized Versions

MiniMax-M2.1-REAP-40-W4A16 (Coming Soon) - 4-bit weights, ~58GB VRAM

Why 40% Pruning?

The 40% pruning ratio offers the best balance of:

Size reduction: 139B vs 456B original (70% smaller)
VRAM savings: ~278GB vs ~912GB (fits on 4x H100 80GB)
Stability: 0 loops in comprehensive stress testing
Performance: Minimal quality degradation from strategic expert selection

REAP Methodology

REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.

Calibration Dataset: 2098 samples

pile-10k: 498 samples (general text)
evol-codealpaca: 800 samples (code generation)
xlam-function-calling: 800 samples (function calling)

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Model tree for 0xSero/MiniMax-M2.1-139B

Base model

MiniMaxAI/MiniMax-M2.1

Quantized

(39)

this model

Quantizations

8 models

Space using 0xSero/MiniMax-M2.1-139B 1

Collection including 0xSero/MiniMax-M2.1-139B

MiniMax — REAP

Collection

REAP-pruned & quantized MiniMax-M2.1 / M2.7. • 6 items • Updated 1 day ago

Paper for 0xSero/MiniMax-M2.1-139B

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 20

0xSero
/

MiniMax-M2.1-139B