--- license: other license_name: modified-mit license_link: https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE base_model: MiniMaxAI/MiniMax-M2.5 tags: - moe - nvfp4 - modelopt - blackwell - vllm --- # MiniMax-M2.5-NVFP4 NVFP4-quantized version of [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) for deployment on NVIDIA Blackwell GPUs. ## Model Details | | | |---|---| | **Base model** | [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) | | **Architecture** | MiniMaxM2ForCausalLM (Mixture-of-Experts) | | **Parameters** | 456B total | | **Layers** | 62 (all MoE) | | **Experts** | 256 per layer, 8 active per token | | **Hidden size** | 3072 | | **Intermediate size** | 1536 per expert | | **Attention** | 48 heads, 8 KV heads (GQA) | | **Context length** | 196,608 tokens | | **Vocabulary** | 200,064 tokens | ## Quantization | | | |---|---| | **Method** | NVFP4 (4-bit floating point) | | **Tool** | [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 0.41.0 | | **Group size** | 16 | | **Calibration** | 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512 | | **Quantized layers** | MoE expert weights only (`gate_up_proj`, `down_proj`) | | **BF16 layers** | Attention (Q/K/V/O projections), embeddings, router gates, score correction biases, layer norms, lm_head | | **Source precision** | FP8 (dequantized to BF16 for calibration) | ### Compression | Format | Size | |--------|------| | BF16 (theoretical) | ~456 GB | | FP8 (source) | 287 GB | | **NVFP4 (this model)** | **126 GB** | 3.6x compression vs BF16 equivalent. ## Running with vLLM [vLLM](https://github.com/vllm-project/vllm) >= 0.15.1 supports this model natively with the `modelopt` quantization backend. Blackwell GPUs (SM100/SM120) are **required** for NVFP4 inference. ### Requirements - **VRAM**: ~126 GB total model weight. Two GPUs with ≥64 GB VRAM each can run via tensor parallelism; heterogeneous setups can use pipeline parallelism with CPU offloading. - **System RAM**: If using `cpu_offload_gb`, you need sufficient system RAM for pinned memory. ### Installation ```bash pip install "vllm>=0.15.1" ``` ### Environment Variables ```bash export VLLM_USE_FLASHINFER_MOE_FP4=0 # Use VLLM_CUTLASS MoE backend (avoids OOM from flashinfer's weight reordering) export CUDA_DEVICE_ORDER=PCI_BUS_ID # Consistent GPU ordering ``` ### Two-GPU Tensor Parallelism (2x ≥64 GB VRAM) ```python from vllm import LLM, SamplingParams llm = LLM( model="mconcat/MiniMax-M2.5-NVFP4", quantization="modelopt", trust_remote_code=True, tensor_parallel_size=2, max_model_len=4096, max_num_seqs=64, enforce_eager=True, gpu_memory_utilization=0.95, ) sampling_params = SamplingParams(temperature=0.7, max_tokens=256) outputs = llm.generate(["The meaning of life is"], sampling_params) print(outputs[0].outputs[0].text) ``` ### Multi-GPU Pipeline Parallelism (Heterogeneous GPUs) For setups with unequal VRAM (e.g., one large GPU + smaller GPUs), use pipeline parallelism: ```python import os os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0" os.environ["VLLM_PP_LAYER_PARTITION"] = "40,11,11" # Adjust per your GPU VRAM ratios from vllm import LLM, SamplingParams llm = LLM( model="mconcat/MiniMax-M2.5-NVFP4", quantization="modelopt", trust_remote_code=True, pipeline_parallel_size=3, cpu_offload_gb=10, max_model_len=4096, max_num_seqs=64, enforce_eager=True, gpu_memory_utilization=0.95, ) sampling_params = SamplingParams(temperature=0.7, max_tokens=256) outputs = llm.generate(["The meaning of life is"], sampling_params) print(outputs[0].outputs[0].text) ``` **Tuning tips:** - `VLLM_PP_LAYER_PARTITION` controls how many of the 62 layers each GPU gets. Assign more layers to GPUs with more VRAM. - Each MoE layer is ~2 GB (NVFP4). Distribute so that `(layer_weights - cpu_offload_gb)` fits on each GPU. - `cpu_offload_gb` is **per GPU**. Ensure total pinned memory fits in system RAM. - `max_num_seqs` may need lowering for GPUs with ≤32 GB VRAM. ### OpenAI-Compatible API Server ```bash VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \ --model mconcat/MiniMax-M2.5-NVFP4 \ --quantization modelopt \ --trust-remote-code \ --tensor-parallel-size 2 \ --max-model-len 4096 \ --max-num-seqs 64 \ --enforce-eager \ --gpu-memory-utilization 0.95 \ --enable-auto-tool-choice \ --tool-call-parser minimax_m2 \ --reasoning-parser minimax_m2_append_think \ --port 8000 ``` For pipeline parallelism, replace `--tensor-parallel-size` with `--pipeline-parallel-size N --cpu-offload-gb X` and set `VLLM_PP_LAYER_PARTITION`. ```bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "mconcat/MiniMax-M2.5-NVFP4", "prompt": "Hello", "max_tokens": 64}' ``` ## Important Notes - **Blackwell required**: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs. - **trust-remote-code**: Required because MiniMax-M2.5 uses custom configuration code (`auto_map` in config.json). vLLM itself has native `MiniMaxM2ForCausalLM` support. - **vLLM quantization flag**: Use `--quantization modelopt`. vLLM auto-detects the NVFP4 algorithm and resolves to `modelopt_fp4` internally. - **MoE backend**: Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. The default flashinfer backend can cause OOM from temporary allocations during weight reordering. - **Tool calling**: vLLM has a built-in `minimax_m2` tool call parser. Use `--enable-auto-tool-choice --tool-call-parser minimax_m2` for OpenAI-compatible function calling. - **Reasoning**: Use `--reasoning-parser minimax_m2_append_think` to extract `` reasoning tokens. ## Quantization Recipe Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)): - Only MoE expert weights (`gate_up_proj`, `down_proj`) are quantized to FP4 - All attention projections remain in BF16 to preserve quality - Router gates (`mlp.gate`) and score correction biases remain in BF16 - Embeddings and lm_head remain in BF16 ### Calibration Data | Domain | Samples | Dataset | |--------|---------|---------| | Korean | 128 | [heegyu/open-korean-instructions](https://huggingface.co/datasets/heegyu/open-korean-instructions) | | Code | 128 | [m-a-p/CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) | | Creative Writing | 128 | [Gryphe/ChatGPT-4o-Writing-Prompts](https://huggingface.co/datasets/Gryphe/ChatGPT-4o-Writing-Prompts) | | General English | 128 | [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | ## Files | File | Description | |------|-------------| | `model-00001-of-00032.safetensors` ... `model-00032-of-00032.safetensors` | Quantized model weights (32 shards, ~4 GB each) | | `model.safetensors.index.json` | Weight shard index | | `config.json` | Model configuration with `quantization_config` | | `hf_quant_config.json` | ModelOpt quantization metadata | | `configuration_minimax_m2.py` | Custom model configuration class | | `modeling_minimax_m2.py` | Custom model implementation | | `tokenizer.json` | Tokenizer | | `tokenizer_config.json` | Tokenizer configuration | | `chat_template.jinja` | Chat template | ## Hardware Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Quantization (calibration on A100) does not require Blackwell hardware; only inference with native FP4 execution does. ## Limitations - Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference - Quality may differ from the original FP8/BF16 model, particularly on tasks sensitive to numerical precision - Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation - This quantization targets the MLP/expert layers only; KV cache is not quantized ## License Same license as the base model: [Modified MIT](https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE).