ineso22 commited on
Commit
f25dabb
·
verified ·
1 Parent(s): f2d745e

Upload docs/vllm_deploy_guide.hf_temp_rename.md with huggingface_hub

Browse files
docs/vllm_deploy_guide.hf_temp_rename.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MiniMax M2.1 Model vLLM Deployment Guide
2
+
3
+ [English Version](./vllm_deploy_guide.md) | [Chinese Version](./vllm_deploy_guide_cn.md)
4
+
5
+ We recommend using [vLLM](https://docs.vllm.ai/en/stable/) to deploy the [MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) model. vLLM is a high-performance inference engine with excellent serving throughput, efficient and intelligent memory management, powerful batch request processing capabilities, and deeply optimized underlying performance. We recommend reviewing vLLM's official documentation to check hardware compatibility before deployment.
6
+
7
+ ## Applicable Models
8
+
9
+ This document applies to the following models. You only need to change the model name during deployment.
10
+
11
+ - [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1)
12
+
13
+ The deployment process is illustrated below using MiniMax-M2.1 as an example.
14
+
15
+ ## System Requirements
16
+
17
+ - OS: Linux
18
+
19
+ - Python: 3.9 - 3.12
20
+
21
+ - GPU:
22
+
23
+ - compute capability 7.0 or higher
24
+
25
+ - Memory requirements: 220 GB for weights, 240 GB per 1M context tokens
26
+
27
+ The following are recommended configurations; actual requirements should be adjusted based on your use case:
28
+
29
+ - **96G x4** GPU: Supports a total KV Cache capacity of 400K tokens.
30
+
31
+ - **144G x8** GPU: Supports a total KV Cache capacity of up to 3M tokens.
32
+
33
+ > **Note**: The values above represent the total aggregate hardware KV Cache capacity. The maximum context length per individual sequence remains **196K** tokens.
34
+
35
+ ## Deployment with Python
36
+
37
+ It is recommended to use a virtual environment (such as **venv**, **conda**, or **uv**) to avoid dependency conflicts.
38
+
39
+ We recommend installing vLLM in a fresh Python environment:
40
+
41
+ ```bash
42
+ uv venv
43
+ source .venv/bin/activate
44
+ uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly
45
+ ```
46
+
47
+ Run the following command to start the vLLM server. vLLM will automatically download and cache the MiniMax-M2.1 model from Hugging Face.
48
+
49
+ 4-GPU deployment command:
50
+
51
+ ```bash
52
+ SAFETENSORS_FAST_GPU=1 vllm serve \
53
+ MiniMaxAI/MiniMax-M2.1 --trust-remote-code \
54
+ --tensor-parallel-size 4 \
55
+ --enable-auto-tool-choice --tool-call-parser minimax_m2 \
56
+ --reasoning-parser minimax_m2_append_think
57
+ ```
58
+
59
+ 8-GPU deployment command:
60
+
61
+ ```bash
62
+ SAFETENSORS_FAST_GPU=1 vllm serve \
63
+ MiniMaxAI/MiniMax-M2.1 --trust-remote-code \
64
+ --enable_expert_parallel --tensor-parallel-size 8 \
65
+ --enable-auto-tool-choice --tool-call-parser minimax_m2 \
66
+ --reasoning-parser minimax_m2_append_think
67
+ ```
68
+
69
+ ## Testing Deployment
70
+
71
+ After startup, you can test the vLLM OpenAI-compatible API with the following command:
72
+
73
+ ```bash
74
+ curl http://localhost:8000/v1/chat/completions \
75
+ -H "Content-Type: application/json" \
76
+ -d '{
77
+ "model": "MiniMaxAI/MiniMax-M2.1",
78
+ "messages": [
79
+ {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
80
+ {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
81
+ ]
82
+ }'
83
+ ```
84
+
85
+ ## Common Issues
86
+
87
+ ### MiniMax-M2 model is not currently supported
88
+
89
+ This vLLM version is outdated. Please upgrade to the latest version.
90
+
91
+ ### torch.AcceleratorError: CUDA error: an illegal memory access was encountered
92
+ Add `--compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}"` to the startup parameters to resolve this issue. For example:
93
+
94
+ ```bash
95
+ SAFETENSORS_FAST_GPU=1 vllm serve \
96
+ MiniMaxAI/MiniMax-M2.1 --trust-remote-code \
97
+ --enable_expert_parallel --tensor-parallel-size 8 \
98
+ --enable-auto-tool-choice --tool-call-parser minimax_m2 \
99
+ --reasoning-parser minimax_m2_append_think \
100
+ --compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}"
101
+ ```
102
+
103
+ ### Output is garbled
104
+
105
+ If you encounter corrupted output when using vLLM to serve these models, you can upgrade to the nightly version (ensure it is a version after commit [cf3eacfe58fa9e745c2854782ada884a9f992cf7](https://github.com/vllm-project/vllm/commit/cf3eacfe58fa9e745c2854782ada884a9f992cf7))
106
+
107
+ ## Getting Support
108
+
109
+ If you encounter any issues while deploying the MiniMax model:
110
+
111
+ - Contact our technical support team through official channels such as email at [model@minimax.io](mailto:model@minimax.io)
112
+
113
+ - Submit an issue on our [GitHub](https://github.com/MiniMax-AI) repository
114
+
115
+ We continuously optimize the deployment experience for our models. Feedback is welcome!