File size: 8,301 Bytes
527470d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | ---
license: other
license_name: modified-mit
license_link: https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE
base_model: MiniMaxAI/MiniMax-M2.5
tags:
- moe
- nvfp4
- modelopt
- blackwell
- vllm
---
# MiniMax-M2.5-NVFP4
NVFP4-quantized version of [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) for deployment on NVIDIA Blackwell GPUs.
## Model Details
| | |
|---|---|
| **Base model** | [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) |
| **Architecture** | MiniMaxM2ForCausalLM (Mixture-of-Experts) |
| **Parameters** | 456B total |
| **Layers** | 62 (all MoE) |
| **Experts** | 256 per layer, 8 active per token |
| **Hidden size** | 3072 |
| **Intermediate size** | 1536 per expert |
| **Attention** | 48 heads, 8 KV heads (GQA) |
| **Context length** | 196,608 tokens |
| **Vocabulary** | 200,064 tokens |
## Quantization
| | |
|---|---|
| **Method** | NVFP4 (4-bit floating point) |
| **Tool** | [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 0.41.0 |
| **Group size** | 16 |
| **Calibration** | 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512 |
| **Quantized layers** | MoE expert weights only (`gate_up_proj`, `down_proj`) |
| **BF16 layers** | Attention (Q/K/V/O projections), embeddings, router gates, score correction biases, layer norms, lm_head |
| **Source precision** | FP8 (dequantized to BF16 for calibration) |
### Compression
| Format | Size |
|--------|------|
| BF16 (theoretical) | ~456 GB |
| FP8 (source) | 287 GB |
| **NVFP4 (this model)** | **126 GB** |
3.6x compression vs BF16 equivalent.
## Running with vLLM
[vLLM](https://github.com/vllm-project/vllm) >= 0.15.1 supports this model natively with the `modelopt` quantization backend. Blackwell GPUs (SM100/SM120) are **required** for NVFP4 inference.
### Requirements
- **VRAM**: ~126 GB total model weight. Two GPUs with ≥64 GB VRAM each can run via tensor parallelism; heterogeneous setups can use pipeline parallelism with CPU offloading.
- **System RAM**: If using `cpu_offload_gb`, you need sufficient system RAM for pinned memory.
### Installation
```bash
pip install "vllm>=0.15.1"
```
### Environment Variables
```bash
export VLLM_USE_FLASHINFER_MOE_FP4=0 # Use VLLM_CUTLASS MoE backend (avoids OOM from flashinfer's weight reordering)
export CUDA_DEVICE_ORDER=PCI_BUS_ID # Consistent GPU ordering
```
### Two-GPU Tensor Parallelism (2x ≥64 GB VRAM)
```python
from vllm import LLM, SamplingParams
llm = LLM(
model="mconcat/MiniMax-M2.5-NVFP4",
quantization="modelopt",
trust_remote_code=True,
tensor_parallel_size=2,
max_model_len=4096,
max_num_seqs=64,
enforce_eager=True,
gpu_memory_utilization=0.95,
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)
```
### Multi-GPU Pipeline Parallelism (Heterogeneous GPUs)
For setups with unequal VRAM (e.g., one large GPU + smaller GPUs), use pipeline parallelism:
```python
import os
os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0"
os.environ["VLLM_PP_LAYER_PARTITION"] = "40,11,11" # Adjust per your GPU VRAM ratios
from vllm import LLM, SamplingParams
llm = LLM(
model="mconcat/MiniMax-M2.5-NVFP4",
quantization="modelopt",
trust_remote_code=True,
pipeline_parallel_size=3,
cpu_offload_gb=10,
max_model_len=4096,
max_num_seqs=64,
enforce_eager=True,
gpu_memory_utilization=0.95,
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)
```
**Tuning tips:**
- `VLLM_PP_LAYER_PARTITION` controls how many of the 62 layers each GPU gets. Assign more layers to GPUs with more VRAM.
- Each MoE layer is ~2 GB (NVFP4). Distribute so that `(layer_weights - cpu_offload_gb)` fits on each GPU.
- `cpu_offload_gb` is **per GPU**. Ensure total pinned memory fits in system RAM.
- `max_num_seqs` may need lowering for GPUs with ≤32 GB VRAM.
### OpenAI-Compatible API Server
```bash
VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \
--model mconcat/MiniMax-M2.5-NVFP4 \
--quantization modelopt \
--trust-remote-code \
--tensor-parallel-size 2 \
--max-model-len 4096 \
--max-num-seqs 64 \
--enforce-eager \
--gpu-memory-utilization 0.95 \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--port 8000
```
For pipeline parallelism, replace `--tensor-parallel-size` with `--pipeline-parallel-size N --cpu-offload-gb X` and set `VLLM_PP_LAYER_PARTITION`.
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "mconcat/MiniMax-M2.5-NVFP4", "prompt": "Hello", "max_tokens": 64}'
```
## Important Notes
- **Blackwell required**: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs.
- **trust-remote-code**: Required because MiniMax-M2.5 uses custom configuration code (`auto_map` in config.json). vLLM itself has native `MiniMaxM2ForCausalLM` support.
- **vLLM quantization flag**: Use `--quantization modelopt`. vLLM auto-detects the NVFP4 algorithm and resolves to `modelopt_fp4` internally.
- **MoE backend**: Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. The default flashinfer backend can cause OOM from temporary allocations during weight reordering.
- **Tool calling**: vLLM has a built-in `minimax_m2` tool call parser. Use `--enable-auto-tool-choice --tool-call-parser minimax_m2` for OpenAI-compatible function calling.
- **Reasoning**: Use `--reasoning-parser minimax_m2_append_think` to extract `<think>` reasoning tokens.
## Quantization Recipe
Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)):
- Only MoE expert weights (`gate_up_proj`, `down_proj`) are quantized to FP4
- All attention projections remain in BF16 to preserve quality
- Router gates (`mlp.gate`) and score correction biases remain in BF16
- Embeddings and lm_head remain in BF16
### Calibration Data
| Domain | Samples | Dataset |
|--------|---------|---------|
| Korean | 128 | [heegyu/open-korean-instructions](https://huggingface.co/datasets/heegyu/open-korean-instructions) |
| Code | 128 | [m-a-p/CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) |
| Creative Writing | 128 | [Gryphe/ChatGPT-4o-Writing-Prompts](https://huggingface.co/datasets/Gryphe/ChatGPT-4o-Writing-Prompts) |
| General English | 128 | [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) |
## Files
| File | Description |
|------|-------------|
| `model-00001-of-00032.safetensors` ... `model-00032-of-00032.safetensors` | Quantized model weights (32 shards, ~4 GB each) |
| `model.safetensors.index.json` | Weight shard index |
| `config.json` | Model configuration with `quantization_config` |
| `hf_quant_config.json` | ModelOpt quantization metadata |
| `configuration_minimax_m2.py` | Custom model configuration class |
| `modeling_minimax_m2.py` | Custom model implementation |
| `tokenizer.json` | Tokenizer |
| `tokenizer_config.json` | Tokenizer configuration |
| `chat_template.jinja` | Chat template |
## Hardware
Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Quantization (calibration on A100) does not require Blackwell hardware; only inference with native FP4 execution does.
## Limitations
- Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference
- Quality may differ from the original FP8/BF16 model, particularly on tasks sensitive to numerical precision
- Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
- This quantization targets the MLP/expert layers only; KV cache is not quantized
## License
Same license as the base model: [Modified MIT](https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE).
|