MiniMax-M2.7 — Abliterated + NVFP4

MiniMax-M2.7 with refusal directions removed, requantized to NVIDIA NVFP4 for Blackwell GPUs.

What's changed

The source BF16 model has refusal behaviours removed at the weight level via abliteration. This NVFP4 version is a direct requantization of those weights — the architecture, tokenizer, and all other parameters are identical to the original MiniMax-M2.7.

Quantization details match nvidia's original NVFP4 format exactly:

  • MoE expert weights: NVFP4 (E2M1 FP4, group_size=16, FP8 per-block scales, FP32 per-tensor scale)
  • KV cache: FP8
  • Kept in BF16: embeddings (embed_tokens), attention projections (self_attn.*), MoE router (block_sparse_moe.gate), lm_head
  • input_scale scalars borrowed from nvidia/MiniMax-M2.7-NVFP4 calibration data

Model architecture

Parameter Value
Total parameters ~230B
Active parameters per token ~10B
Hidden size 3072
Layers 62
Experts (total / active) 256 / 8
Attention heads 48 (GQA, 8 KV heads)
Max context 204,800 tokens
Vocabulary 200,064 tokens

Hardware requirements

NVIDIA Blackwell only. NVFP4 (E2M1) is a Blackwell-native format and requires:

  • B200, GB200, or RTX 5090 (or later Blackwell GPUs)
  • Minimum 2× B200 (2× 180GB = 360GB HBM) recommended for tensor parallel serving
  • vLLM 0.22+ with modelopt quantization backend

This model will not run on Ampere or Hopper GPUs. If you need a version compatible with older hardware, use a GGUF or GPTQ quantization of the source BF16 model instead.

Usage

vLLM (recommended)

vllm serve random-robbie/MiniMax-M2.7-NVFP4-abliterated \
  --served-model-name minimax-m2.7 \
  --tensor-parallel-size 2 \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --max-model-len 196608 \
  --gpu-memory-utilization 0.92 \
  --trust-remote-code

OpenAI-compatible API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="minimax-m2.7",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=512,
)
print(response.choices[0].message.content)

File structure

File Description
model-000XX-of-00014.safetensors 14 NVFP4 weight shards (~126GB total)
model.safetensors.index.json Tensor → shard index
hf_quant_config.json modelopt quantization metadata
config.json Model architecture config
modeling_minimax_m2.py Custom model code (trust_remote_code)
tokenizer.json / tokenizer_config.json Tokenizer

License

This model is derived from MiniMaxAI/MiniMax-M2.7 and is subject to the MiniMax Model License. Please review that license before use, particularly regarding commercial applications.

The abliteration technique (weight-level removal of refusal directions) was applied to the intermediate BF16 weights by llmfan46. This NVFP4 requantization was produced by random-robbie.

Credits

Downloads last month
226
Safetensors
Model size
130B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for random-robbie/MiniMax-M2.7-NVFP4-abliterated