Instructions to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8

SGLang

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8 with Docker Model Runner:
```
docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8
```

m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8

`v1.1.1` — router-gate quantization fix (2026-04-16)

What happened: The initial upload (2026-04-15) used ignore=["lm_head"] in the llm-compressor recipe, which meant the 62 MoE routers (block_sparse_moe.gate) got quantized along with the expert weights. vLLM's MiniMax-M2 loader expects an unquantized ReplicatedLinear router and fails at engine-init with:

KeyError: 'layers.0.block_sparse_moe.gate.weight_scale'       # FP8
KeyError: 'layers.0.block_sparse_moe.gate.input_global_scale' # NVFP4

This is a hard load failure — the engine never initializes, so no tokens are generated. (The earlier "degraded output" framing understated the severity.)

Root cause: Missing MoE-aware entries in the llm-compressor ignore list. The correct pattern (per saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10):

ignore = [
    "lm_head",
    "model.embed_tokens",
    r"re:.*block_sparse_moe\.gate$",
]

Fix: This variant was re-rolled 2026-04-16 with the corrected recipe. quantization_config.ignore now lists all 62 per-layer router gates alongside lm_head.

Verification: config.json on this repo now contains 62 model.layers.N.block_sparse_moe.gate entries in the ignore list. Loaders should open the model without the KeyError above.

Credit: Thanks to the community user who reported this first on the NVFP4-GB10 DGX Spark load. The saricles reference repo was invaluable for confirming the exact pattern.

Unaffected variants (no re-roll needed): BF16 safetensors, all GGUF quantizations.

FP8 dynamic quantization of dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B — the first publicly available REAP-40 % pruned variant of MiniMax-M2.7 — targeting H100 / H200 datacenter deployment via vLLM or TensorRT-LLM.

Aspect	Value
Base model	`dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B` (BF16)
Quantization	Dynamic FP8 per-tensor (W8A8-FP8)
Format	`compressed-tensors` (vLLM-native)
Calibration	Data-free (weight-only scales)
Tool	`llmcompressor`
File size	~140 GB across 29 safetensors shards
Ignored layers	`lm_head` (kept in BF16)

Hardware & deployment

Native FP8 tensor-core acceleration on NVIDIA Hopper (H100 / H200) and Blackwell (B100 / B200). On older Ampere (A100) hardware FP8 is emulated — works but slower than INT8/AWQ alternatives.

Memory footprint: ~140 GB weights + KV cache. Recommended:

2× H100 80 GB (tight at long context — use KV cache quantization)
2× H200 141 GB (comfortable headroom)
1× B100/B200 (native NVFP4+FP8)

Inference

vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8",
    tensor_parallel_size=2,       # 2× H100 or similar
    trust_remote_code=True,
    max_model_len=32768,
)

params = SamplingParams(temperature=1.0, top_p=0.95, top_k=40, max_tokens=2048)
out = llm.generate(["Explain REAP pruning briefly."], params)
print(out[0].outputs[0].text)

TensorRT-LLM

Convert via trtllm-quantize or load directly with the compressed-tensors loader (TRT-LLM 0.13+).

Quality

Inference quality validated on the BF16 parent via a 5 / 5 pre-publish smoke test and full HumanEval evaluation (see parent safetensors card). FP8 weight-only quantization with dynamic per-tensor scaling has near-lossless quality compared to BF16 in practice; activations stay BF16.

Base model summary

Property	Value
Architecture	MoE, 62 layers, 154 experts (pruned from 256), top-8 routing
Active parameters / token	~10 B
Total parameters	~139 B
Max position embeddings	196,608
Vocabulary size	200,064
Pruning	REAP 40 %, seed 42, calibration on 3 × 2,048 samples (code / math / tool)

See the parent safetensors card for full architecture, pruning details, evaluation numbers, and the known minor layer-0 bias imperfection.

Recommended generation parameters

temperature: 1.0
top_p: 0.95
top_k: 40
repeat_penalty: 1.05

These match the base MiniMax-M2.7 recommendations.

Companion repos

Parent safetensors (BF16): dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B
GGUF (Mac / llama.cpp / Ollama / LM Studio): dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-GGUF
NVFP4 (Blackwell-native): coming soon
AWQ-4bit (vLLM / HF Transformers INT4): coming soon

Citation

See the safetensors repo for full citations. Core references:

Lasby et al., REAP the Experts (arXiv:2510.13999)
MiniMax AI, MiniMax-M2.7

License

Inherits the Modified MIT License from MiniMaxAI/MiniMax-M2.7.

Published by m51Lab — open-source LLM contributions from the M51 AI OS group.

Downloads last month: 139

Safetensors

Model size

139B params

Tensor type

BF16

F8_E4M3

Model tree for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8

Base model

MiniMaxAI/MiniMax-M2.7

Finetuned

dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B

Quantized

(6)

this model

Paper for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 20

m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8

v1.1.1 — router-gate quantization fix (2026-04-16)

Hardware & deployment

Inference

vLLM

TensorRT-LLM

Quality

Base model summary

Recommended generation parameters

Companion repos

Citation

License

Model tree for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8

Paper for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B-FP8

`v1.1.1` — router-gate quantization fix (2026-04-16)