MiniMax-M2.5-NVFP4 / README.md
mconcat's picture
Upload folder using huggingface_hub
527470d verified
---
license: other
license_name: modified-mit
license_link: https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE
base_model: MiniMaxAI/MiniMax-M2.5
tags:
- moe
- nvfp4
- modelopt
- blackwell
- vllm
---
# MiniMax-M2.5-NVFP4
NVFP4-quantized version of [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) for deployment on NVIDIA Blackwell GPUs.
## Model Details
| | |
|---|---|
| **Base model** | [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) |
| **Architecture** | MiniMaxM2ForCausalLM (Mixture-of-Experts) |
| **Parameters** | 456B total |
| **Layers** | 62 (all MoE) |
| **Experts** | 256 per layer, 8 active per token |
| **Hidden size** | 3072 |
| **Intermediate size** | 1536 per expert |
| **Attention** | 48 heads, 8 KV heads (GQA) |
| **Context length** | 196,608 tokens |
| **Vocabulary** | 200,064 tokens |
## Quantization
| | |
|---|---|
| **Method** | NVFP4 (4-bit floating point) |
| **Tool** | [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 0.41.0 |
| **Group size** | 16 |
| **Calibration** | 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512 |
| **Quantized layers** | MoE expert weights only (`gate_up_proj`, `down_proj`) |
| **BF16 layers** | Attention (Q/K/V/O projections), embeddings, router gates, score correction biases, layer norms, lm_head |
| **Source precision** | FP8 (dequantized to BF16 for calibration) |
### Compression
| Format | Size |
|--------|------|
| BF16 (theoretical) | ~456 GB |
| FP8 (source) | 287 GB |
| **NVFP4 (this model)** | **126 GB** |
3.6x compression vs BF16 equivalent.
## Running with vLLM
[vLLM](https://github.com/vllm-project/vllm) >= 0.15.1 supports this model natively with the `modelopt` quantization backend. Blackwell GPUs (SM100/SM120) are **required** for NVFP4 inference.
### Requirements
- **VRAM**: ~126 GB total model weight. Two GPUs with ≥64 GB VRAM each can run via tensor parallelism; heterogeneous setups can use pipeline parallelism with CPU offloading.
- **System RAM**: If using `cpu_offload_gb`, you need sufficient system RAM for pinned memory.
### Installation
```bash
pip install "vllm>=0.15.1"
```
### Environment Variables
```bash
export VLLM_USE_FLASHINFER_MOE_FP4=0 # Use VLLM_CUTLASS MoE backend (avoids OOM from flashinfer's weight reordering)
export CUDA_DEVICE_ORDER=PCI_BUS_ID # Consistent GPU ordering
```
### Two-GPU Tensor Parallelism (2x ≥64 GB VRAM)
```python
from vllm import LLM, SamplingParams
llm = LLM(
model="mconcat/MiniMax-M2.5-NVFP4",
quantization="modelopt",
trust_remote_code=True,
tensor_parallel_size=2,
max_model_len=4096,
max_num_seqs=64,
enforce_eager=True,
gpu_memory_utilization=0.95,
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)
```
### Multi-GPU Pipeline Parallelism (Heterogeneous GPUs)
For setups with unequal VRAM (e.g., one large GPU + smaller GPUs), use pipeline parallelism:
```python
import os
os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0"
os.environ["VLLM_PP_LAYER_PARTITION"] = "40,11,11" # Adjust per your GPU VRAM ratios
from vllm import LLM, SamplingParams
llm = LLM(
model="mconcat/MiniMax-M2.5-NVFP4",
quantization="modelopt",
trust_remote_code=True,
pipeline_parallel_size=3,
cpu_offload_gb=10,
max_model_len=4096,
max_num_seqs=64,
enforce_eager=True,
gpu_memory_utilization=0.95,
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)
```
**Tuning tips:**
- `VLLM_PP_LAYER_PARTITION` controls how many of the 62 layers each GPU gets. Assign more layers to GPUs with more VRAM.
- Each MoE layer is ~2 GB (NVFP4). Distribute so that `(layer_weights - cpu_offload_gb)` fits on each GPU.
- `cpu_offload_gb` is **per GPU**. Ensure total pinned memory fits in system RAM.
- `max_num_seqs` may need lowering for GPUs with ≤32 GB VRAM.
### OpenAI-Compatible API Server
```bash
VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \
--model mconcat/MiniMax-M2.5-NVFP4 \
--quantization modelopt \
--trust-remote-code \
--tensor-parallel-size 2 \
--max-model-len 4096 \
--max-num-seqs 64 \
--enforce-eager \
--gpu-memory-utilization 0.95 \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--port 8000
```
For pipeline parallelism, replace `--tensor-parallel-size` with `--pipeline-parallel-size N --cpu-offload-gb X` and set `VLLM_PP_LAYER_PARTITION`.
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "mconcat/MiniMax-M2.5-NVFP4", "prompt": "Hello", "max_tokens": 64}'
```
## Important Notes
- **Blackwell required**: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs.
- **trust-remote-code**: Required because MiniMax-M2.5 uses custom configuration code (`auto_map` in config.json). vLLM itself has native `MiniMaxM2ForCausalLM` support.
- **vLLM quantization flag**: Use `--quantization modelopt`. vLLM auto-detects the NVFP4 algorithm and resolves to `modelopt_fp4` internally.
- **MoE backend**: Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. The default flashinfer backend can cause OOM from temporary allocations during weight reordering.
- **Tool calling**: vLLM has a built-in `minimax_m2` tool call parser. Use `--enable-auto-tool-choice --tool-call-parser minimax_m2` for OpenAI-compatible function calling.
- **Reasoning**: Use `--reasoning-parser minimax_m2_append_think` to extract `<think>` reasoning tokens.
## Quantization Recipe
Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)):
- Only MoE expert weights (`gate_up_proj`, `down_proj`) are quantized to FP4
- All attention projections remain in BF16 to preserve quality
- Router gates (`mlp.gate`) and score correction biases remain in BF16
- Embeddings and lm_head remain in BF16
### Calibration Data
| Domain | Samples | Dataset |
|--------|---------|---------|
| Korean | 128 | [heegyu/open-korean-instructions](https://huggingface.co/datasets/heegyu/open-korean-instructions) |
| Code | 128 | [m-a-p/CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) |
| Creative Writing | 128 | [Gryphe/ChatGPT-4o-Writing-Prompts](https://huggingface.co/datasets/Gryphe/ChatGPT-4o-Writing-Prompts) |
| General English | 128 | [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) |
## Files
| File | Description |
|------|-------------|
| `model-00001-of-00032.safetensors` ... `model-00032-of-00032.safetensors` | Quantized model weights (32 shards, ~4 GB each) |
| `model.safetensors.index.json` | Weight shard index |
| `config.json` | Model configuration with `quantization_config` |
| `hf_quant_config.json` | ModelOpt quantization metadata |
| `configuration_minimax_m2.py` | Custom model configuration class |
| `modeling_minimax_m2.py` | Custom model implementation |
| `tokenizer.json` | Tokenizer |
| `tokenizer_config.json` | Tokenizer configuration |
| `chat_template.jinja` | Chat template |
## Hardware
Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Quantization (calibration on A100) does not require Blackwell hardware; only inference with native FP4 execution does.
## Limitations
- Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference
- Quality may differ from the original FP8/BF16 model, particularly on tasks sensitive to numerical precision
- Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
- This quantization targets the MLP/expert layers only; KV cache is not quantized
## License
Same license as the base model: [Modified MIT](https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE).