MiniMax-M2.7 — Abliterated + NVFP4
MiniMax-M2.7 with refusal directions removed, requantized to NVIDIA NVFP4 for Blackwell GPUs.
- Base weights:
llmfan46/MiniMax-M2.7-BF16-ultra-uncensored-heretic(itself derived fromMiniMaxAI/MiniMax-M2) - Quantization format: NVFP4 — drop-in replacement for
nvidia/MiniMax-M2.7-NVFP4 - Quantizer: custom streaming quantizer (processes one BF16 shard at a time; no full model load required)
What's changed
The source BF16 model has refusal behaviours removed at the weight level via abliteration. This NVFP4 version is a direct requantization of those weights — the architecture, tokenizer, and all other parameters are identical to the original MiniMax-M2.7.
Quantization details match nvidia's original NVFP4 format exactly:
- MoE expert weights: NVFP4 (E2M1 FP4, group_size=16, FP8 per-block scales, FP32 per-tensor scale)
- KV cache: FP8
- Kept in BF16: embeddings (
embed_tokens), attention projections (self_attn.*), MoE router (block_sparse_moe.gate),lm_head input_scalescalars borrowed fromnvidia/MiniMax-M2.7-NVFP4calibration data
Model architecture
| Parameter | Value |
|---|---|
| Total parameters | ~230B |
| Active parameters per token | ~10B |
| Hidden size | 3072 |
| Layers | 62 |
| Experts (total / active) | 256 / 8 |
| Attention heads | 48 (GQA, 8 KV heads) |
| Max context | 204,800 tokens |
| Vocabulary | 200,064 tokens |
Hardware requirements
NVIDIA Blackwell only. NVFP4 (E2M1) is a Blackwell-native format and requires:
- B200, GB200, or RTX 5090 (or later Blackwell GPUs)
- Minimum 2× B200 (2× 180GB = 360GB HBM) recommended for tensor parallel serving
- vLLM 0.22+ with modelopt quantization backend
This model will not run on Ampere or Hopper GPUs. If you need a version compatible with older hardware, use a GGUF or GPTQ quantization of the source BF16 model instead.
Usage
vLLM (recommended)
vllm serve random-robbie/MiniMax-M2.7-NVFP4-abliterated \
--served-model-name minimax-m2.7 \
--tensor-parallel-size 2 \
--quantization modelopt \
--kv-cache-dtype fp8 \
--max-model-len 196608 \
--gpu-memory-utilization 0.92 \
--trust-remote-code
OpenAI-compatible API
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="minimax-m2.7",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=512,
)
print(response.choices[0].message.content)
File structure
| File | Description |
|---|---|
model-000XX-of-00014.safetensors |
14 NVFP4 weight shards (~126GB total) |
model.safetensors.index.json |
Tensor → shard index |
hf_quant_config.json |
modelopt quantization metadata |
config.json |
Model architecture config |
modeling_minimax_m2.py |
Custom model code (trust_remote_code) |
tokenizer.json / tokenizer_config.json |
Tokenizer |
License
This model is derived from MiniMaxAI/MiniMax-M2.7 and is subject to the MiniMax Model License. Please review that license before use, particularly regarding commercial applications.
The abliteration technique (weight-level removal of refusal directions) was applied to the intermediate BF16 weights by llmfan46. This NVFP4 requantization was produced by random-robbie.
Credits
- Original model: MiniMaxAI/MiniMax-M2.7
- BF16 abliterated source: llmfan46
- NVFP4 format reference: nvidia/MiniMax-M2.7-NVFP4
- NVFP4 requantization: random-robbie
- Downloads last month
- 226
Model tree for random-robbie/MiniMax-M2.7-NVFP4-abliterated
Base model
MiniMaxAI/MiniMax-M2.7