GadflyII
/

MiniMax-M2.1-NVFP4

Text Generation

Mixture of Experts

compressed-tensors

Model card Files Files and versions

MiniMax-M2.1-NVFP4 / README.md

GadflyII's picture

Upload README.md with huggingface_hub

fad316a verified 19 days ago

|

history blame contribute delete

2.03 kB

	---
	license: mit
	base_model: MiniMaxAI/MiniMax-M2.1
	tags:
	- minimax
	- moe
	- nvfp4
	- quantized
	- vllm
	- blackwell
	library_name: transformers
	---

	# MiniMax-M2.1-NVFP4

	NVFP4 quantized version of [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) for efficient inference on NVIDIA Blackwell GPUs.

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) \|
	\| Architecture \| Mixture of Experts (MoE) \|
	\| Total Parameters \| 229B \|
	\| Active Parameters \| ~45B (8 of 256 experts) \|
	\| Quantization \| NVFP4 (e2m1 format) \|
	\| Size \| 131 GB \|

	## Quantization Details

	- Format: NVFP4 with two-level scaling (block-wise FP8 + global FP32)
	- Scheme: `compressed-tensors` with `nvfp4-pack-quantized` format
	- Target: All linear layers in attention and MoE experts
	- Group Size: 16

	## Requirements

	- NVIDIA Blackwell GPU (RTX 5090, RTX PRO 6000, etc.)
	- vLLM with flashinfer-cutlass NVFP4 support
	- ~130 GB VRAM (TP=2 recommended for dual GPU setups)

	## Usage with vLLM

	```python
	from vllm import LLM, SamplingParams

	llm = LLM(
	model="GadflyII/MiniMax-M2.1-NVFP4",
	tensor_parallel_size=2,
	max_model_len=4096,
	gpu_memory_utilization=0.90,
	trust_remote_code=True,
	)

	sampling_params = SamplingParams(
	temperature=0.7,
	top_p=0.9,
	max_tokens=1024,
	)

	outputs = llm.generate(["Your prompt here"], sampling_params)
	print(outputs[0].outputs[0].text)
	```

	## Performance

	Tested on 2x RTX PRO 6000 Blackwell (98GB each):

	\| Prompt Tokens \| Output Tokens \| Throughput \|
	\|---------------\|---------------\|------------\|
	\| ~100 \| 100 \| ~73 tok/s \|
	\| ~1260 \| 1000 \| ~72 tok/s \|

	## License

	Same as base model - see [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) for details.

	## Acknowledgments

	- [MiniMax](https://www.minimax.io/) for the original MiniMax-M2.1 model
	- [vLLM](https://github.com/vllm-project/vllm) team for NVFP4 quantization support