GadflyII commited on
Commit
fad316a
·
verified ·
1 Parent(s): a21d56e

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +81 -0
README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: MiniMaxAI/MiniMax-M2.1
4
+ tags:
5
+ - minimax
6
+ - moe
7
+ - nvfp4
8
+ - quantized
9
+ - vllm
10
+ - blackwell
11
+ library_name: transformers
12
+ ---
13
+
14
+ # MiniMax-M2.1-NVFP4
15
+
16
+ NVFP4 quantized version of [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) for efficient inference on NVIDIA Blackwell GPUs.
17
+
18
+ ## Model Details
19
+
20
+ | Property | Value |
21
+ |----------|-------|
22
+ | Base Model | [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) |
23
+ | Architecture | Mixture of Experts (MoE) |
24
+ | Total Parameters | 229B |
25
+ | Active Parameters | ~45B (8 of 256 experts) |
26
+ | Quantization | NVFP4 (e2m1 format) |
27
+ | Size | 131 GB |
28
+
29
+ ## Quantization Details
30
+
31
+ - **Format**: NVFP4 with two-level scaling (block-wise FP8 + global FP32)
32
+ - **Scheme**: `compressed-tensors` with `nvfp4-pack-quantized` format
33
+ - **Target**: All linear layers in attention and MoE experts
34
+ - **Group Size**: 16
35
+
36
+ ## Requirements
37
+
38
+ - NVIDIA Blackwell GPU (RTX 5090, RTX PRO 6000, etc.)
39
+ - vLLM with flashinfer-cutlass NVFP4 support
40
+ - ~130 GB VRAM (TP=2 recommended for dual GPU setups)
41
+
42
+ ## Usage with vLLM
43
+
44
+ ```python
45
+ from vllm import LLM, SamplingParams
46
+
47
+ llm = LLM(
48
+ model="GadflyII/MiniMax-M2.1-NVFP4",
49
+ tensor_parallel_size=2,
50
+ max_model_len=4096,
51
+ gpu_memory_utilization=0.90,
52
+ trust_remote_code=True,
53
+ )
54
+
55
+ sampling_params = SamplingParams(
56
+ temperature=0.7,
57
+ top_p=0.9,
58
+ max_tokens=1024,
59
+ )
60
+
61
+ outputs = llm.generate(["Your prompt here"], sampling_params)
62
+ print(outputs[0].outputs[0].text)
63
+ ```
64
+
65
+ ## Performance
66
+
67
+ Tested on 2x RTX PRO 6000 Blackwell (98GB each):
68
+
69
+ | Prompt Tokens | Output Tokens | Throughput |
70
+ |---------------|---------------|------------|
71
+ | ~100 | 100 | ~73 tok/s |
72
+ | ~1260 | 1000 | ~72 tok/s |
73
+
74
+ ## License
75
+
76
+ Same as base model - see [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) for details.
77
+
78
+ ## Acknowledgments
79
+
80
+ - [MiniMax](https://www.minimax.io/) for the original MiniMax-M2.1 model
81
+ - [vLLM](https://github.com/vllm-project/vllm) team for NVFP4 quantization support