--- license: mit base_model: MiniMaxAI/MiniMax-M2.1 tags: - minimax - moe - nvfp4 - quantized - vllm - blackwell library_name: transformers --- # MiniMax-M2.1-NVFP4 NVFP4 quantized version of [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) for efficient inference on NVIDIA Blackwell GPUs. ## Model Details | Property | Value | |----------|-------| | Base Model | [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) | | Architecture | Mixture of Experts (MoE) | | Total Parameters | 229B | | Active Parameters | ~45B (8 of 256 experts) | | Quantization | NVFP4 (e2m1 format) | | Size | 131 GB | ## Quantization Details - **Format**: NVFP4 with two-level scaling (block-wise FP8 + global FP32) - **Scheme**: `compressed-tensors` with `nvfp4-pack-quantized` format - **Target**: All linear layers in attention and MoE experts - **Group Size**: 16 ## Requirements - NVIDIA Blackwell GPU (RTX 5090, RTX PRO 6000, etc.) - vLLM with flashinfer-cutlass NVFP4 support - ~130 GB VRAM (TP=2 recommended for dual GPU setups) ## Usage with vLLM ```python from vllm import LLM, SamplingParams llm = LLM( model="GadflyII/MiniMax-M2.1-NVFP4", tensor_parallel_size=2, max_model_len=4096, gpu_memory_utilization=0.90, trust_remote_code=True, ) sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=1024, ) outputs = llm.generate(["Your prompt here"], sampling_params) print(outputs[0].outputs[0].text) ``` ## Performance Tested on 2x RTX PRO 6000 Blackwell (98GB each): | Prompt Tokens | Output Tokens | Throughput | |---------------|---------------|------------| | ~100 | 100 | ~73 tok/s | | ~1260 | 1000 | ~72 tok/s | ## License Same as base model - see [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) for details. ## Acknowledgments - [MiniMax](https://www.minimax.io/) for the original MiniMax-M2.1 model - [vLLM](https://github.com/vllm-project/vllm) team for NVFP4 quantization support