| --- |
| license: other |
| license_name: modified-mit |
| license_link: https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE |
| base_model: MiniMaxAI/MiniMax-M2.5 |
| tags: |
| - moe |
| - nvfp4 |
| - modelopt |
| - blackwell |
| - vllm |
| --- |
| |
| # MiniMax-M2.5-NVFP4 |
|
|
| NVFP4-quantized version of [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) for deployment on NVIDIA Blackwell GPUs. |
|
|
| ## Model Details |
|
|
| | | | |
| |---|---| |
| | **Base model** | [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) | |
| | **Architecture** | MiniMaxM2ForCausalLM (Mixture-of-Experts) | |
| | **Parameters** | 456B total | |
| | **Layers** | 62 (all MoE) | |
| | **Experts** | 256 per layer, 8 active per token | |
| | **Hidden size** | 3072 | |
| | **Intermediate size** | 1536 per expert | |
| | **Attention** | 48 heads, 8 KV heads (GQA) | |
| | **Context length** | 196,608 tokens | |
| | **Vocabulary** | 200,064 tokens | |
|
|
| ## Quantization |
|
|
| | | | |
| |---|---| |
| | **Method** | NVFP4 (4-bit floating point) | |
| | **Tool** | [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 0.41.0 | |
| | **Group size** | 16 | |
| | **Calibration** | 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512 | |
| | **Quantized layers** | MoE expert weights only (`gate_up_proj`, `down_proj`) | |
| | **BF16 layers** | Attention (Q/K/V/O projections), embeddings, router gates, score correction biases, layer norms, lm_head | |
| | **Source precision** | FP8 (dequantized to BF16 for calibration) | |
| |
| ### Compression |
| |
| | Format | Size | |
| |--------|------| |
| | BF16 (theoretical) | ~456 GB | |
| | FP8 (source) | 287 GB | |
| | **NVFP4 (this model)** | **126 GB** | |
| |
| 3.6x compression vs BF16 equivalent. |
| |
| ## Running with vLLM |
| |
| [vLLM](https://github.com/vllm-project/vllm) >= 0.15.1 supports this model natively with the `modelopt` quantization backend. Blackwell GPUs (SM100/SM120) are **required** for NVFP4 inference. |
| |
| ### Requirements |
| |
| - **VRAM**: ~126 GB total model weight. Two GPUs with ≥64 GB VRAM each can run via tensor parallelism; heterogeneous setups can use pipeline parallelism with CPU offloading. |
| - **System RAM**: If using `cpu_offload_gb`, you need sufficient system RAM for pinned memory. |
| |
| ### Installation |
| |
| ```bash |
| pip install "vllm>=0.15.1" |
| ``` |
| |
| ### Environment Variables |
| |
| ```bash |
| export VLLM_USE_FLASHINFER_MOE_FP4=0 # Use VLLM_CUTLASS MoE backend (avoids OOM from flashinfer's weight reordering) |
| export CUDA_DEVICE_ORDER=PCI_BUS_ID # Consistent GPU ordering |
| ``` |
| |
| ### Two-GPU Tensor Parallelism (2x ≥64 GB VRAM) |
| |
| ```python |
| from vllm import LLM, SamplingParams |
|
|
| llm = LLM( |
| model="mconcat/MiniMax-M2.5-NVFP4", |
| quantization="modelopt", |
| trust_remote_code=True, |
| tensor_parallel_size=2, |
| max_model_len=4096, |
| max_num_seqs=64, |
| enforce_eager=True, |
| gpu_memory_utilization=0.95, |
| ) |
| |
| sampling_params = SamplingParams(temperature=0.7, max_tokens=256) |
| outputs = llm.generate(["The meaning of life is"], sampling_params) |
| print(outputs[0].outputs[0].text) |
| ``` |
| |
| ### Multi-GPU Pipeline Parallelism (Heterogeneous GPUs) |
| |
| For setups with unequal VRAM (e.g., one large GPU + smaller GPUs), use pipeline parallelism: |
| |
| ```python |
| import os |
| os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0" |
| os.environ["VLLM_PP_LAYER_PARTITION"] = "40,11,11" # Adjust per your GPU VRAM ratios |
|
|
| from vllm import LLM, SamplingParams |
|
|
| llm = LLM( |
| model="mconcat/MiniMax-M2.5-NVFP4", |
| quantization="modelopt", |
| trust_remote_code=True, |
| pipeline_parallel_size=3, |
| cpu_offload_gb=10, |
| max_model_len=4096, |
| max_num_seqs=64, |
| enforce_eager=True, |
| gpu_memory_utilization=0.95, |
| ) |
| |
| sampling_params = SamplingParams(temperature=0.7, max_tokens=256) |
| outputs = llm.generate(["The meaning of life is"], sampling_params) |
| print(outputs[0].outputs[0].text) |
| ``` |
| |
| **Tuning tips:** |
| - `VLLM_PP_LAYER_PARTITION` controls how many of the 62 layers each GPU gets. Assign more layers to GPUs with more VRAM. |
| - Each MoE layer is ~2 GB (NVFP4). Distribute so that `(layer_weights - cpu_offload_gb)` fits on each GPU. |
| - `cpu_offload_gb` is **per GPU**. Ensure total pinned memory fits in system RAM. |
| - `max_num_seqs` may need lowering for GPUs with ≤32 GB VRAM. |
|
|
| ### OpenAI-Compatible API Server |
|
|
| ```bash |
| VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \ |
| --model mconcat/MiniMax-M2.5-NVFP4 \ |
| --quantization modelopt \ |
| --trust-remote-code \ |
| --tensor-parallel-size 2 \ |
| --max-model-len 4096 \ |
| --max-num-seqs 64 \ |
| --enforce-eager \ |
| --gpu-memory-utilization 0.95 \ |
| --enable-auto-tool-choice \ |
| --tool-call-parser minimax_m2 \ |
| --reasoning-parser minimax_m2_append_think \ |
| --port 8000 |
| ``` |
|
|
| For pipeline parallelism, replace `--tensor-parallel-size` with `--pipeline-parallel-size N --cpu-offload-gb X` and set `VLLM_PP_LAYER_PARTITION`. |
|
|
| ```bash |
| curl http://localhost:8000/v1/completions \ |
| -H "Content-Type: application/json" \ |
| -d '{"model": "mconcat/MiniMax-M2.5-NVFP4", "prompt": "Hello", "max_tokens": 64}' |
| ``` |
|
|
| ## Important Notes |
|
|
| - **Blackwell required**: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs. |
| - **trust-remote-code**: Required because MiniMax-M2.5 uses custom configuration code (`auto_map` in config.json). vLLM itself has native `MiniMaxM2ForCausalLM` support. |
| - **vLLM quantization flag**: Use `--quantization modelopt`. vLLM auto-detects the NVFP4 algorithm and resolves to `modelopt_fp4` internally. |
| - **MoE backend**: Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. The default flashinfer backend can cause OOM from temporary allocations during weight reordering. |
| - **Tool calling**: vLLM has a built-in `minimax_m2` tool call parser. Use `--enable-auto-tool-choice --tool-call-parser minimax_m2` for OpenAI-compatible function calling. |
| - **Reasoning**: Use `--reasoning-parser minimax_m2_append_think` to extract `<think>` reasoning tokens. |
|
|
| ## Quantization Recipe |
|
|
| Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)): |
|
|
| - Only MoE expert weights (`gate_up_proj`, `down_proj`) are quantized to FP4 |
| - All attention projections remain in BF16 to preserve quality |
| - Router gates (`mlp.gate`) and score correction biases remain in BF16 |
| - Embeddings and lm_head remain in BF16 |
| |
| ### Calibration Data |
| |
| | Domain | Samples | Dataset | |
| |--------|---------|---------| |
| | Korean | 128 | [heegyu/open-korean-instructions](https://huggingface.co/datasets/heegyu/open-korean-instructions) | |
| | Code | 128 | [m-a-p/CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) | |
| | Creative Writing | 128 | [Gryphe/ChatGPT-4o-Writing-Prompts](https://huggingface.co/datasets/Gryphe/ChatGPT-4o-Writing-Prompts) | |
| | General English | 128 | [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | |
| |
| ## Files |
| |
| | File | Description | |
| |------|-------------| |
| | `model-00001-of-00032.safetensors` ... `model-00032-of-00032.safetensors` | Quantized model weights (32 shards, ~4 GB each) | |
| | `model.safetensors.index.json` | Weight shard index | |
| | `config.json` | Model configuration with `quantization_config` | |
| | `hf_quant_config.json` | ModelOpt quantization metadata | |
| | `configuration_minimax_m2.py` | Custom model configuration class | |
| | `modeling_minimax_m2.py` | Custom model implementation | |
| | `tokenizer.json` | Tokenizer | |
| | `tokenizer_config.json` | Tokenizer configuration | |
| | `chat_template.jinja` | Chat template | |
|
|
| ## Hardware |
|
|
| Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Quantization (calibration on A100) does not require Blackwell hardware; only inference with native FP4 execution does. |
|
|
| ## Limitations |
|
|
| - Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference |
| - Quality may differ from the original FP8/BF16 model, particularly on tasks sensitive to numerical precision |
| - Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation |
| - This quantization targets the MLP/expert layers only; KV cache is not quantized |
|
|
| ## License |
|
|
| Same license as the base model: [Modified MIT](https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE). |
|
|