File size: 8,301 Bytes
527470d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
license: other
license_name: modified-mit
license_link: https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE
base_model: MiniMaxAI/MiniMax-M2.5
tags:
  - moe
  - nvfp4
  - modelopt
  - blackwell
  - vllm
---

# MiniMax-M2.5-NVFP4

NVFP4-quantized version of [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) for deployment on NVIDIA Blackwell GPUs.

## Model Details

| | |
|---|---|
| **Base model** | [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) |
| **Architecture** | MiniMaxM2ForCausalLM (Mixture-of-Experts) |
| **Parameters** | 456B total |
| **Layers** | 62 (all MoE) |
| **Experts** | 256 per layer, 8 active per token |
| **Hidden size** | 3072 |
| **Intermediate size** | 1536 per expert |
| **Attention** | 48 heads, 8 KV heads (GQA) |
| **Context length** | 196,608 tokens |
| **Vocabulary** | 200,064 tokens |

## Quantization

| | |
|---|---|
| **Method** | NVFP4 (4-bit floating point) |
| **Tool** | [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 0.41.0 |
| **Group size** | 16 |
| **Calibration** | 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512 |
| **Quantized layers** | MoE expert weights only (`gate_up_proj`, `down_proj`) |
| **BF16 layers** | Attention (Q/K/V/O projections), embeddings, router gates, score correction biases, layer norms, lm_head |
| **Source precision** | FP8 (dequantized to BF16 for calibration) |

### Compression

| Format | Size |
|--------|------|
| BF16 (theoretical) | ~456 GB |
| FP8 (source) | 287 GB |
| **NVFP4 (this model)** | **126 GB** |

3.6x compression vs BF16 equivalent.

## Running with vLLM

[vLLM](https://github.com/vllm-project/vllm) >= 0.15.1 supports this model natively with the `modelopt` quantization backend. Blackwell GPUs (SM100/SM120) are **required** for NVFP4 inference.

### Requirements

- **VRAM**: ~126 GB total model weight. Two GPUs with ≥64 GB VRAM each can run via tensor parallelism; heterogeneous setups can use pipeline parallelism with CPU offloading.
- **System RAM**: If using `cpu_offload_gb`, you need sufficient system RAM for pinned memory.

### Installation

```bash
pip install "vllm>=0.15.1"
```

### Environment Variables

```bash
export VLLM_USE_FLASHINFER_MOE_FP4=0   # Use VLLM_CUTLASS MoE backend (avoids OOM from flashinfer's weight reordering)
export CUDA_DEVICE_ORDER=PCI_BUS_ID     # Consistent GPU ordering
```

### Two-GPU Tensor Parallelism (2x ≥64 GB VRAM)

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="mconcat/MiniMax-M2.5-NVFP4",
    quantization="modelopt",
    trust_remote_code=True,
    tensor_parallel_size=2,
    max_model_len=4096,
    max_num_seqs=64,
    enforce_eager=True,
    gpu_memory_utilization=0.95,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)
```

### Multi-GPU Pipeline Parallelism (Heterogeneous GPUs)

For setups with unequal VRAM (e.g., one large GPU + smaller GPUs), use pipeline parallelism:

```python
import os
os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0"
os.environ["VLLM_PP_LAYER_PARTITION"] = "40,11,11"  # Adjust per your GPU VRAM ratios

from vllm import LLM, SamplingParams

llm = LLM(
    model="mconcat/MiniMax-M2.5-NVFP4",
    quantization="modelopt",
    trust_remote_code=True,
    pipeline_parallel_size=3,
    cpu_offload_gb=10,
    max_model_len=4096,
    max_num_seqs=64,
    enforce_eager=True,
    gpu_memory_utilization=0.95,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)
```

**Tuning tips:**
- `VLLM_PP_LAYER_PARTITION` controls how many of the 62 layers each GPU gets. Assign more layers to GPUs with more VRAM.
- Each MoE layer is ~2 GB (NVFP4). Distribute so that `(layer_weights - cpu_offload_gb)` fits on each GPU.
- `cpu_offload_gb` is **per GPU**. Ensure total pinned memory fits in system RAM.
- `max_num_seqs` may need lowering for GPUs with ≤32 GB VRAM.

### OpenAI-Compatible API Server

```bash
VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \
    --model mconcat/MiniMax-M2.5-NVFP4 \
    --quantization modelopt \
    --trust-remote-code \
    --tensor-parallel-size 2 \
    --max-model-len 4096 \
    --max-num-seqs 64 \
    --enforce-eager \
    --gpu-memory-utilization 0.95 \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --port 8000
```

For pipeline parallelism, replace `--tensor-parallel-size` with `--pipeline-parallel-size N --cpu-offload-gb X` and set `VLLM_PP_LAYER_PARTITION`.

```bash
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mconcat/MiniMax-M2.5-NVFP4", "prompt": "Hello", "max_tokens": 64}'
```

## Important Notes

- **Blackwell required**: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs.
- **trust-remote-code**: Required because MiniMax-M2.5 uses custom configuration code (`auto_map` in config.json). vLLM itself has native `MiniMaxM2ForCausalLM` support.
- **vLLM quantization flag**: Use `--quantization modelopt`. vLLM auto-detects the NVFP4 algorithm and resolves to `modelopt_fp4` internally.
- **MoE backend**: Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. The default flashinfer backend can cause OOM from temporary allocations during weight reordering.
- **Tool calling**: vLLM has a built-in `minimax_m2` tool call parser. Use `--enable-auto-tool-choice --tool-call-parser minimax_m2` for OpenAI-compatible function calling.
- **Reasoning**: Use `--reasoning-parser minimax_m2_append_think` to extract `<think>` reasoning tokens.

## Quantization Recipe

Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)):

- Only MoE expert weights (`gate_up_proj`, `down_proj`) are quantized to FP4
- All attention projections remain in BF16 to preserve quality
- Router gates (`mlp.gate`) and score correction biases remain in BF16
- Embeddings and lm_head remain in BF16

### Calibration Data

| Domain | Samples | Dataset |
|--------|---------|---------|
| Korean | 128 | [heegyu/open-korean-instructions](https://huggingface.co/datasets/heegyu/open-korean-instructions) |
| Code | 128 | [m-a-p/CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) |
| Creative Writing | 128 | [Gryphe/ChatGPT-4o-Writing-Prompts](https://huggingface.co/datasets/Gryphe/ChatGPT-4o-Writing-Prompts) |
| General English | 128 | [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) |

## Files

| File | Description |
|------|-------------|
| `model-00001-of-00032.safetensors` ... `model-00032-of-00032.safetensors` | Quantized model weights (32 shards, ~4 GB each) |
| `model.safetensors.index.json` | Weight shard index |
| `config.json` | Model configuration with `quantization_config` |
| `hf_quant_config.json` | ModelOpt quantization metadata |
| `configuration_minimax_m2.py` | Custom model configuration class |
| `modeling_minimax_m2.py` | Custom model implementation |
| `tokenizer.json` | Tokenizer |
| `tokenizer_config.json` | Tokenizer configuration |
| `chat_template.jinja` | Chat template |

## Hardware

Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Quantization (calibration on A100) does not require Blackwell hardware; only inference with native FP4 execution does.

## Limitations

- Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference
- Quality may differ from the original FP8/BF16 model, particularly on tasks sensitive to numerical precision
- Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
- This quantization targets the MLP/expert layers only; KV cache is not quantized

## License

Same license as the base model: [Modified MIT](https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE).