MiniMax-M2.7 — Abliterated + NVFP4

MiniMax-M2.7 with refusal directions removed, requantized to NVIDIA NVFP4 for Blackwell GPUs.

Base weights: llmfan46/MiniMax-M2.7-BF16-ultra-uncensored-heretic (itself derived from MiniMaxAI/MiniMax-M2)
Quantization format: NVFP4 — drop-in replacement for nvidia/MiniMax-M2.7-NVFP4
Quantizer: custom streaming quantizer (processes one BF16 shard at a time; no full model load required)

What's changed

The source BF16 model has refusal behaviours removed at the weight level via abliteration. This NVFP4 version is a direct requantization of those weights — the architecture, tokenizer, and all other parameters are identical to the original MiniMax-M2.7.

Quantization details match nvidia's original NVFP4 format exactly:

MoE expert weights: NVFP4 (E2M1 FP4, group_size=16, FP8 per-block scales, FP32 per-tensor scale)
KV cache: FP8
Kept in BF16: embeddings (embed_tokens), attention projections (self_attn.*), MoE router (block_sparse_moe.gate), lm_head
input_scale scalars borrowed from nvidia/MiniMax-M2.7-NVFP4 calibration data

Model architecture

Parameter	Value
Total parameters	~230B
Active parameters per token	~10B
Hidden size	3072
Layers	62
Experts (total / active)	256 / 8
Attention heads	48 (GQA, 8 KV heads)
Max context	204,800 tokens
Vocabulary	200,064 tokens

Hardware requirements

NVIDIA Blackwell only. NVFP4 (E2M1) is a Blackwell-native format and requires:

B200, GB200, or RTX 5090 (or later Blackwell GPUs)
Minimum 2× B200 (2× 180GB = 360GB HBM) recommended for tensor parallel serving
vLLM 0.22+ with modelopt quantization backend

This model will not run on Ampere or Hopper GPUs. If you need a version compatible with older hardware, use a GGUF or GPTQ quantization of the source BF16 model instead.

Usage

vLLM (recommended)

vllm serve random-robbie/MiniMax-M2.7-NVFP4-abliterated \
  --served-model-name minimax-m2.7 \
  --tensor-parallel-size 2 \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --max-model-len 196608 \
  --gpu-memory-utilization 0.92 \
  --trust-remote-code

OpenAI-compatible API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="minimax-m2.7",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=512,
)
print(response.choices[0].message.content)

File structure

File	Description
`model-000XX-of-00014.safetensors`	14 NVFP4 weight shards (~126GB total)
`model.safetensors.index.json`	Tensor → shard index
`hf_quant_config.json`	modelopt quantization metadata
`config.json`	Model architecture config
`modeling_minimax_m2.py`	Custom model code (trust_remote_code)
`tokenizer.json` / `tokenizer_config.json`	Tokenizer

License

This model is derived from MiniMaxAI/MiniMax-M2.7 and is subject to the MiniMax Model License. Please review that license before use, particularly regarding commercial applications.

The abliteration technique (weight-level removal of refusal directions) was applied to the intermediate BF16 weights by llmfan46. This NVFP4 requantization was produced by random-robbie.

Credits

Original model: MiniMaxAI/MiniMax-M2.7
BF16 abliterated source: llmfan46
NVFP4 format reference: nvidia/MiniMax-M2.7-NVFP4
NVFP4 requantization: random-robbie

Downloads last month: 226

Safetensors

Model size

130B params

Tensor type

BF16

F8_E4M3

Model tree for random-robbie/MiniMax-M2.7-NVFP4-abliterated

Base model

MiniMaxAI/MiniMax-M2.7

Finetuned

amd/MiniMax-M2.7-BF16

Finetuned

llmfan46/MiniMax-M2.7-BF16-ultra-uncensored-heretic

Finetuned

(1)

this model