GadflyII
/

MiniMax-M2.1-NVFP4

Text Generation

Mixture of Experts

compressed-tensors

Model card Files Files and versions

GadflyII commited on 18 days ago

Commit

fad316a

·

verified ·

1 Parent(s): a21d56e

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +81 -0

README.md ADDED Viewed

	@@ -0,0 +1,81 @@

+---
+license: mit
+base_model: MiniMaxAI/MiniMax-M2.1
+tags:
+  - minimax
+  - moe
+  - nvfp4
+  - quantized
+  - vllm
+  - blackwell
+library_name: transformers
+---
+# MiniMax-M2.1-NVFP4
+NVFP4 quantized version of [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) for efficient inference on NVIDIA Blackwell GPUs.
+## Model Details
+| Property | Value |
+|----------|-------|
+| Base Model | [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) |
+| Architecture | Mixture of Experts (MoE) |
+| Total Parameters | 229B |
+| Active Parameters | ~45B (8 of 256 experts) |
+| Quantization | NVFP4 (e2m1 format) |
+| Size | 131 GB |
+## Quantization Details
+- **Format**: NVFP4 with two-level scaling (block-wise FP8 + global FP32)
+- **Scheme**: `compressed-tensors` with `nvfp4-pack-quantized` format
+- **Target**: All linear layers in attention and MoE experts
+- **Group Size**: 16
+## Requirements
+- NVIDIA Blackwell GPU (RTX 5090, RTX PRO 6000, etc.)
+- vLLM with flashinfer-cutlass NVFP4 support
+- ~130 GB VRAM (TP=2 recommended for dual GPU setups)
+## Usage with vLLM
+```python
+from vllm import LLM, SamplingParams
+llm = LLM(
+    model="GadflyII/MiniMax-M2.1-NVFP4",
+    tensor_parallel_size=2,
+    max_model_len=4096,
+    gpu_memory_utilization=0.90,
+    trust_remote_code=True,
+)
+sampling_params = SamplingParams(
+    temperature=0.7,
+    top_p=0.9,
+    max_tokens=1024,
+)
+outputs = llm.generate(["Your prompt here"], sampling_params)
+print(outputs[0].outputs[0].text)
+```
+## Performance
+Tested on 2x RTX PRO 6000 Blackwell (98GB each):
+| Prompt Tokens | Output Tokens | Throughput |
+|---------------|---------------|------------|
+| ~100 | 100 | ~73 tok/s |
+| ~1260 | 1000 | ~72 tok/s |
+## License
+Same as base model - see [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) for details.
+## Acknowledgments
+- [MiniMax](https://www.minimax.io/) for the original MiniMax-M2.1 model
+- [vLLM](https://github.com/vllm-project/vllm) team for NVFP4 quantization support