|
|
--- |
|
|
base_model: |
|
|
- MiniMaxAI/MiniMax-M2.5 |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
Updated 2/16/2026 - Improved calibration coverage. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**MiniMax-M2.5-NVFP4** is an NVFP4-quantized version of [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5), a 230B-parameter Mixture-of-Experts language model with 10B active parameters. |
|
|
|
|
|
The original model weights were converted from the official FP8 checkpoint to BF16, then quantized to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer). |
|
|
|
|
|
### What's quantized |
|
|
|
|
|
Only the MoE expert MLP layers (gate, up, and down projections) are quantized to NVFP4. Attention layers are left in BF16. Since the expert weights constitute the vast majority of model parameters in an MoE architecture, this still yields significant memory savings. |
|
|
|
|
|
Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a vastly larger number of samples than typical to ensure broad expert coverage through natural routing alone. |
|
|
|
|
|
### Calibration dataset |
|
|
|
|
|
Samples were drawn from a diverse mix of publicly available datasets spanning code generation, function/tool calling, multi-turn reasoning, math, and multilingual (English + Chinese) instruction following. System prompts were randomly varied across samples. The dataset was designed to broadly exercise the model's capabilities and activate diverse token distributions across expert modules. |
|
|
|
|
|
### Quality |
|
|
|
|
|
Initial testing has been very positive, but you should evaluate against your specific use case. |
|
|
|
|
|
### How to Run |
|
|
|
|
|
If you experience NCCL hangs with P2P, make sure you have `iommu=pt` (and `amd_iommu=pt` on AMD platforms) in your kernel command line. |
|
|
|
|
|
#### SGLang |
|
|
|
|
|
Tested on 2x and 4x RTX Pro 6000 Blackwell. |
|
|
|
|
|
``` |
|
|
export NCCL_IB_DISABLE=1 |
|
|
export NCCL_P2P_LEVEL=PHB |
|
|
export NCCL_ALLOC_P2P_NET_LL_BUFFERS=1 |
|
|
export NCCL_MIN_NCHANNELS=8 |
|
|
export OMP_NUM_THREADS=8 |
|
|
export SAFETENSORS_FAST_GPU=1 |
|
|
|
|
|
python3 -m sglang.launch_server \ |
|
|
--model lukealonso/MiniMax-M2.5-NVFP4 \ |
|
|
--served-model-name MiniMax-M2.5 \ |
|
|
--reasoning-parser minimax \ |
|
|
--tool-call-parser minimax-m2 \ |
|
|
--enable-torch-compile \ |
|
|
--trust-remote-code \ |
|
|
--tp 2 --ep 2 \ |
|
|
--mem-fraction-static 0.9 \ |
|
|
--max-running-requests 16 \ |
|
|
--kv-cache-dtype bf16 # or fp8_e4m3 \ |
|
|
--quantization modelopt_fp4 \ |
|
|
--attention-backend flashinfer \ |
|
|
--moe-runner-backend flashinfer_cutlass \ |
|
|
--disable-custom-all-reduce \ |
|
|
--enable-flashinfer-allreduce-fusion \ |
|
|
--host 0.0.0.0 \ |
|
|
--port 8000 |
|
|
|
|
|
``` |
|
|
|
|
|
#### vLLM |
|
|
|
|
|
(thanks to @zenmagnets) |
|
|
|
|
|
Set your Hugging Face cache and GPUs, then run (from project root with venv activated). |
|
|
``` |
|
|
export CUDA_DEVICE_ORDER=PCI_BUS_ID |
|
|
export CUDA_VISIBLE_DEVICES=0,1 |
|
|
export HF_HOME=/path/to/huggingface |
|
|
export HUGGINGFACE_HUB_CACHE=$HF_HOME/hub |
|
|
export VLLM_WORKER_MULTIPROC_METHOD=spawn |
|
|
export SAFETENSORS_FAST_GPU=1 |
|
|
export VLLM_NVFP4_GEMM_BACKEND=cutlass |
|
|
export VLLM_USE_FLASHINFER_MOE_FP4=0 |
|
|
export NCCL_IB_DISABLE=1 |
|
|
export OMP_NUM_THREADS=8 |
|
|
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 |
|
|
|
|
|
python -m vllm.entrypoints.openai.api_server \ |
|
|
--model lukealonso/MiniMax-M2.5-NVFP4 \ |
|
|
--download-dir $HUGGINGFACE_HUB_CACHE \ |
|
|
--host 0.0.0.0 \ |
|
|
--port 1235 \ |
|
|
--served-model-name MiniMax-M2.5-NVFP4 \ |
|
|
--trust-remote-code \ |
|
|
--tensor-parallel-size 2 \ |
|
|
--attention-backend FLASH_ATTN \ |
|
|
--gpu-memory-utilization 0.95 \ |
|
|
--max-model-len 190000 \ |
|
|
--max-num-batched-tokens 16384 \ |
|
|
--max-num-seqs 64 \ |
|
|
--disable-custom-all-reduce \ |
|
|
--enable-auto-tool-choice \ |
|
|
--tool-call-parser minimax_m2 \ |
|
|
--reasoning-parser minimax_m2_append_think |
|
|
``` |
|
|
|
|
|
Dependencies |
|
|
|
|
|
Install in a Python 3.12 venv; use CUDA 12.x on the host. |
|
|
|
|
|
``` |
|
|
Package Version Note |
|
|
vllm 0.15.1 OpenAI server + NVFP4 MoE |
|
|
torch 2.9.1+cu128 CUDA 12.8 build |
|
|
transformers 4.57.6 |
|
|
safetensors 0.7.0 |
|
|
nvidia-modelopt 0.41.0 NVFP4 / ModelOpt format |
|
|
flashinfer-python 0.6.1 Optional (we use FLASH_ATTN) |
|
|
nvidia-nccl-cu12 2.27.5 Multi-GPU |
|
|
nvidia-cutlass-dsl* 4.4.0.dev1 NVFP4 GEMM (script uses cutlass backend) |
|
|
System: CUDA 12.8, cuDNN 9.10.2 (or matching torch cuDNN). Driver must support your GPUs (e.g. Blackwell). |
|
|
``` |