Upload folder using huggingface_hub

527470d verified about 1 month ago

8.3 kB

	---
	license: other
	license_name: modified-mit
	license_link: https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE
	base_model: MiniMaxAI/MiniMax-M2.5
	tags:
	- moe
	- nvfp4
	- modelopt
	- blackwell
	- vllm
	---

	# MiniMax-M2.5-NVFP4

	NVFP4-quantized version of [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) for deployment on NVIDIA Blackwell GPUs.

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Base model \| [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) \|
	\| Architecture \| MiniMaxM2ForCausalLM (Mixture-of-Experts) \|
	\| Parameters \| 456B total \|
	\| Layers \| 62 (all MoE) \|
	\| Experts \| 256 per layer, 8 active per token \|
	\| Hidden size \| 3072 \|
	\| Intermediate size \| 1536 per expert \|
	\| Attention \| 48 heads, 8 KV heads (GQA) \|
	\| Context length \| 196,608 tokens \|
	\| Vocabulary \| 200,064 tokens \|

	## Quantization

	\| \| \|
	\|---\|---\|
	\| Method \| NVFP4 (4-bit floating point) \|
	\| Tool \| [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 0.41.0 \|
	\| Group size \| 16 \|
	\| Calibration \| 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512 \|
	\| Quantized layers \| MoE expert weights only (`gate_up_proj`, `down_proj`) \|
	\| BF16 layers \| Attention (Q/K/V/O projections), embeddings, router gates, score correction biases, layer norms, lm_head \|
	\| Source precision \| FP8 (dequantized to BF16 for calibration) \|

	### Compression

	\| Format \| Size \|
	\|--------\|------\|
	\| BF16 (theoretical) \| ~456 GB \|
	\| FP8 (source) \| 287 GB \|
	\| NVFP4 (this model) \| 126 GB \|

	3.6x compression vs BF16 equivalent.

	## Running with vLLM

	[vLLM](https://github.com/vllm-project/vllm) >= 0.15.1 supports this model natively with the `modelopt` quantization backend. Blackwell GPUs (SM100/SM120) are required for NVFP4 inference.

	### Requirements

	- VRAM: ~126 GB total model weight. Two GPUs with ≥64 GB VRAM each can run via tensor parallelism; heterogeneous setups can use pipeline parallelism with CPU offloading.
	- System RAM: If using `cpu_offload_gb`, you need sufficient system RAM for pinned memory.

	### Installation

	```bash
	pip install "vllm>=0.15.1"
	```

	### Environment Variables

	```bash
	export VLLM_USE_FLASHINFER_MOE_FP4=0 # Use VLLM_CUTLASS MoE backend (avoids OOM from flashinfer's weight reordering)
	export CUDA_DEVICE_ORDER=PCI_BUS_ID # Consistent GPU ordering
	```

	### Two-GPU Tensor Parallelism (2x ≥64 GB VRAM)

	```python
	from vllm import LLM, SamplingParams

	llm = LLM(
	model="mconcat/MiniMax-M2.5-NVFP4",
	quantization="modelopt",
	trust_remote_code=True,
	tensor_parallel_size=2,
	max_model_len=4096,
	max_num_seqs=64,
	enforce_eager=True,
	gpu_memory_utilization=0.95,
	)

	sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
	outputs = llm.generate(["The meaning of life is"], sampling_params)
	print(outputs[0].outputs[0].text)
	```

	### Multi-GPU Pipeline Parallelism (Heterogeneous GPUs)

	For setups with unequal VRAM (e.g., one large GPU + smaller GPUs), use pipeline parallelism:

	```python
	import os
	os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0"
	os.environ["VLLM_PP_LAYER_PARTITION"] = "40,11,11" # Adjust per your GPU VRAM ratios

	from vllm import LLM, SamplingParams

	llm = LLM(
	model="mconcat/MiniMax-M2.5-NVFP4",
	quantization="modelopt",
	trust_remote_code=True,
	pipeline_parallel_size=3,
	cpu_offload_gb=10,
	max_model_len=4096,
	max_num_seqs=64,
	enforce_eager=True,
	gpu_memory_utilization=0.95,
	)

	sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
	outputs = llm.generate(["The meaning of life is"], sampling_params)
	print(outputs[0].outputs[0].text)
	```

	Tuning tips:
	- `VLLM_PP_LAYER_PARTITION` controls how many of the 62 layers each GPU gets. Assign more layers to GPUs with more VRAM.
	- Each MoE layer is ~2 GB (NVFP4). Distribute so that `(layer_weights - cpu_offload_gb)` fits on each GPU.
	- `cpu_offload_gb` is per GPU. Ensure total pinned memory fits in system RAM.
	- `max_num_seqs` may need lowering for GPUs with ≤32 GB VRAM.

	### OpenAI-Compatible API Server

	```bash
	VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \
	--model mconcat/MiniMax-M2.5-NVFP4 \
	--quantization modelopt \
	--trust-remote-code \
	--tensor-parallel-size 2 \
	--max-model-len 4096 \
	--max-num-seqs 64 \
	--enforce-eager \
	--gpu-memory-utilization 0.95 \
	--enable-auto-tool-choice \
	--tool-call-parser minimax_m2 \
	--reasoning-parser minimax_m2_append_think \
	--port 8000
	```

	For pipeline parallelism, replace `--tensor-parallel-size` with `--pipeline-parallel-size N --cpu-offload-gb X` and set `VLLM_PP_LAYER_PARTITION`.

	```bash
	curl http://localhost:8000/v1/completions \
	-H "Content-Type: application/json" \
	-d '{"model": "mconcat/MiniMax-M2.5-NVFP4", "prompt": "Hello", "max_tokens": 64}'
	```

	## Important Notes

	- Blackwell required: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs.
	- trust-remote-code: Required because MiniMax-M2.5 uses custom configuration code (`auto_map` in config.json). vLLM itself has native `MiniMaxM2ForCausalLM` support.
	- vLLM quantization flag: Use `--quantization modelopt`. vLLM auto-detects the NVFP4 algorithm and resolves to `modelopt_fp4` internally.
	- MoE backend: Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. The default flashinfer backend can cause OOM from temporary allocations during weight reordering.
	- Tool calling: vLLM has a built-in `minimax_m2` tool call parser. Use `--enable-auto-tool-choice --tool-call-parser minimax_m2` for OpenAI-compatible function calling.
	- Reasoning: Use `--reasoning-parser minimax_m2_append_think` to extract `<think>` reasoning tokens.

	## Quantization Recipe

	Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)):

	- Only MoE expert weights (`gate_up_proj`, `down_proj`) are quantized to FP4
	- All attention projections remain in BF16 to preserve quality
	- Router gates (`mlp.gate`) and score correction biases remain in BF16
	- Embeddings and lm_head remain in BF16

	### Calibration Data

	\| Domain \| Samples \| Dataset \|
	\|--------\|---------\|---------\|
	\| Korean \| 128 \| [heegyu/open-korean-instructions](https://huggingface.co/datasets/heegyu/open-korean-instructions) \|
	\| Code \| 128 \| [m-a-p/CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) \|
	\| Creative Writing \| 128 \| [Gryphe/ChatGPT-4o-Writing-Prompts](https://huggingface.co/datasets/Gryphe/ChatGPT-4o-Writing-Prompts) \|
	\| General English \| 128 \| [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) \|

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `model-00001-of-00032.safetensors` ... `model-00032-of-00032.safetensors` \| Quantized model weights (32 shards, ~4 GB each) \|
	\| `model.safetensors.index.json` \| Weight shard index \|
	\| `config.json` \| Model configuration with `quantization_config` \|
	\| `hf_quant_config.json` \| ModelOpt quantization metadata \|
	\| `configuration_minimax_m2.py` \| Custom model configuration class \|
	\| `modeling_minimax_m2.py` \| Custom model implementation \|
	\| `tokenizer.json` \| Tokenizer \|
	\| `tokenizer_config.json` \| Tokenizer configuration \|
	\| `chat_template.jinja` \| Chat template \|

	## Hardware

	Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Quantization (calibration on A100) does not require Blackwell hardware; only inference with native FP4 execution does.

	## Limitations

	- Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference
	- Quality may differ from the original FP8/BF16 model, particularly on tasks sensitive to numerical precision
	- Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
	- This quantization targets the MLP/expert layers only; KV cache is not quantized

	## License

	Same license as the base model: [Modified MIT](https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE).