yuccaaa commited on
Commit
ea55275
·
verified ·
1 Parent(s): dcdf780

Upload ms-swift/docs/source_en/Instruction/Inference-and-deployment.md with huggingface_hub

Browse files
ms-swift/docs/source_en/Instruction/Inference-and-deployment.md ADDED
@@ -0,0 +1,354 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Inference and Deployment
2
+
3
+ Below are the inference engines supported by Swift along with their corresponding capabilities. The three inference acceleration engines provide inference acceleration for Swift's inference, deployment, and evaluation modules:
4
+
5
+ | Inference Acceleration Engine | OpenAI API | Multimodal | Quantized Model | Multiple LoRAs | QLoRA | Batch Inference | Parallel Techniques |
6
+ | ------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | --------------- | ------------------------------------------------------------ | ----- | ------------------------------------------------------------ | ------------------- |
7
+ | pytorch | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/deploy/client/llm/chat/openai_client.py) | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/app/mllm.sh) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_lora.py) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/batch_ddp.sh) | DDP/device_map |
8
+ | [vllm](https://github.com/vllm-project/vllm) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/mllm_tp.sh) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/deploy/lora/server.sh) | ❌ | ✅ | TP/PP/DP |
9
+ | [lmdeploy](https://github.com/InternLM/lmdeploy) | ✅ | [✅](https://github.com/modelscope/ms-swift/blob/main/examples/infer/lmdeploy/mllm_tp.sh) | ✅ | ❌ | ❌ | ✅ | TP/DP |
10
+
11
+ ## Inference
12
+
13
+ ms-swift uses a layered design philosophy, allowing users to perform inference through the command-line interface, web UI, or directly using Python.
14
+
15
+ To view the inference of a model fine-tuned with LoRA, please refer to the [Pre-training and Fine-tuning documentation](./Pre-training-and-Fine-tuning.md#inference-fine-tuned-model).
16
+
17
+ ### Using CLI
18
+
19
+ **Full Parameter Model:**
20
+
21
+ ```shell
22
+ CUDA_VISIBLE_DEVICES=0 swift infer \
23
+ --model Qwen/Qwen2.5-7B-Instruct \
24
+ --stream true \
25
+ --infer_backend pt \
26
+ --max_new_tokens 2048
27
+ ```
28
+
29
+ **LoRA Model:**
30
+
31
+ ```shell
32
+ CUDA_VISIBLE_DEVICES=0 swift infer \
33
+ --model Qwen/Qwen2.5-7B-Instruct \
34
+ --adapters swift/test_lora \
35
+ --stream true \
36
+ --infer_backend pt \
37
+ --temperature 0 \
38
+ --max_new_tokens 2048
39
+ ```
40
+
41
+ **Command-Line Inference Instructions**
42
+
43
+ The above commands are for interactive command-line interface inference. After running the script, you can simply enter your query in the terminal. You can also input the following special commands:
44
+
45
+ - `multi-line`: Switch to multi-line mode, allowing line breaks in the input, ending with `#`.
46
+ - `single-line`: Switch to single-line mode, with line breaks indicating the end of input.
47
+ - `reset-system`: Reset the system and clear history.
48
+ - `clear`: Clear the history.
49
+ - `quit` or `exit`: Exit the conversation.
50
+
51
+ **Multimodal Model**
52
+
53
+ ```shell
54
+ CUDA_VISIBLE_DEVICES=0 \
55
+ MAX_PIXELS=1003520 \
56
+ VIDEO_MAX_PIXELS=50176 \
57
+ FPS_MAX_FRAMES=12 \
58
+ swift infer \
59
+ --model Qwen/Qwen2.5-VL-3B-Instruct \
60
+ --stream true \
61
+ --infer_backend pt \
62
+ --max_new_tokens 2048
63
+ ```
64
+
65
+ To perform inference with a multimodal model, you can add tags like `<image>`, `<video>`, or `<audio>` in your query (representing the location of image representations in `inputs_embeds`). For example, you can input `<image><image>What is the difference between these two images?` or `<video>Describe this video.` Then, follow the prompts to input the corresponding image/video/audio.
66
+
67
+
68
+ Here is an example of inference:
69
+ ```
70
+ <<< <image><image>What is the difference between these two images?
71
+ Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
72
+ Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
73
+ The first image depicts a cute, cartoon-style kitten with large, expressive eyes and a fluffy white and gray coat. The background is simple, featuring a gradient of colors that highlight the kitten's face.
74
+
75
+ The second image shows a group of four cartoon-style sheep standing on a grassy field with mountains in the background. The sheep have fluffy white wool, black legs, and black faces with white markings around their eyes and noses. The background includes green hills and a blue sky with clouds, giving it a pastoral and serene atmosphere.
76
+ --------------------------------------------------
77
+ <<< clear
78
+ <<< <video>Describe this video.
79
+ Input a video path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4
80
+ A baby wearing glasses is sitting on a bed and reading a book. The baby is holding the book with both hands and is looking down at it. The baby is wearing a light blue shirt and pink pants. The baby is sitting on a white pillow. The baby is looking at the book with interest. The baby is not moving much, just turning the pages of the book.
81
+ ```
82
+
83
+ **Dataset Inference:**
84
+
85
+ ```
86
+ CUDA_VISIBLE_DEVICES=0 swift infer \
87
+ --model Qwen/Qwen2.5-7B-Instruct \
88
+ --stream true \
89
+ --infer_backend pt \
90
+ --val_dataset AI-ModelScope/alpaca-gpt4-data-zh \
91
+ --max_new_tokens 2048
92
+ ```
93
+
94
+ The above example provides streaming inference for both full parameters and LoRA, and below are more inference techniques available in SWIFT:
95
+
96
+ - Interface Inference: You can change `swift infer` to `swift app`.
97
+ - Batch Inference: For large models and multimodal models, you can specify `--max_batch_size` for batch inference by using `infer_backend=pt`. For specific details, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/batch_ddp.sh). Note that you cannot set `--stream true` when performing batch inference.
98
+ - DDP/device_map Inference: `infer_backend=pt` supports parallel inference using DDP/device_map technology. For further details, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/mllm_device_map.sh).
99
+ - Inference Acceleration: Swift supports using vllm/lmdeploy for inference acceleration across the inference, deployment, and evaluation modules by simply adding `--infer_backend vllm/lmdeploy`. You can refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/ddp.sh).
100
+ - Multimodal Models: We provide shell scripts for multi-GPU inference for multimodal models using [pt](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/mllm_device_map.sh), [vllm](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/mllm_tp.sh), and [lmdeploy](https://github.com/modelscope/ms-swift/blob/main/examples/infer/lmdeploy/mllm_tp.sh).
101
+ - Quantized Models: You can directly select models that are quantized with GPTQ, AWQ, or BNB, for example: `--model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4`.
102
+ - More Model Types: We also provide inference scripts for [bert](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/bert.sh), [reward_model](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/reward_model.sh), and [prm](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/prm.sh).
103
+
104
+ **Tips:**
105
+
106
+ - SWIFT saves inference results, and you can specify the save path using `--result_path`.
107
+ - To output log probabilities, simply specify `--logprobs true` during inference. SWIFT will save these results. Note that setting `--stream true` will prevent storage of results.
108
+ - Using `infer_backend=pt` supports inference for all models supported by SWIFT, while `infer_backend=vllm/lmdeploy` supports only a subset of models. Please refer to the documentation for [vllm](https://docs.vllm.ai/en/latest/models/supported_models.html) and [lmdeploy](https://lmdeploy.readthedocs.io/en/latest/supported_models/supported_models.html).
109
+ - If you encounter OOM when using `--infer_backend vllm`, you can lower `--max_model_len`, `--max_num_seqs`, choose an appropriate `--gpu_memory_utilization`, or set `--enforce_eager true`. Alternatively, you can address this by using tensor parallelism with `--tensor_parallel_size`.
110
+ - When inferring multimodal models using `--infer_backend vllm`, you need to input multiple images. You can set `--limit_mm_per_prompt` to resolve this, for example: `--limit_mm_per_prompt '{"image": 10, "video": 5}'`.
111
+ - If you encounter OOM issues while inferring qwen2-vl/qwen2.5-vl, you can address this by setting `MAX_PIXELS`, `VIDEO_MAX_PIXELS`, and `FPS_MAX_FRAMES`. For more information, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/app/mllm.sh).
112
+ - SWIFT's built-in dialogue templates align with dialogue templates run using transformers. You can refer to [here](https://github.com/modelscope/ms-swift/blob/main/tests/test_align/test_template/test_vision.py) for testing. If there are any misalignments, please feel free to submit an issue or PR for correction.
113
+
114
+
115
+ ### Using Web-UI
116
+
117
+ If you want to perform inference through a graphical interface, you can refer to the [Web-UI documentation](../GetStarted/Web-UI.md).
118
+
119
+ ### Using Python
120
+
121
+ **Text Model:**
122
+
123
+ ```python
124
+ import os
125
+ os.environ['CUDA_VISIBLE_DEVICES'] = '0'
126
+
127
+ from swift.llm import PtEngine, RequestConfig, InferRequest
128
+ model = 'Qwen/Qwen2.5-0.5B-Instruct'
129
+
130
+ # Load the inference engine
131
+ engine = PtEngine(model, max_batch_size=2)
132
+ request_config = RequestConfig(max_tokens=512, temperature=0)
133
+
134
+ # Using 2 infer_requests to demonstrate batch inference
135
+ infer_requests = [
136
+ InferRequest(messages=[{'role': 'user', 'content': 'Who are you?'}]),
137
+ InferRequest(messages=[{'role': 'user', 'content': 'Where is the capital of Zhejiang?'},
138
+ {'role': 'assistant', 'content': 'The capital of Zhejiang Province, China, is Hangzhou.'},
139
+ {'role': 'user', 'content': 'What are some fun places here?'}]),
140
+ ]
141
+ resp_list = engine.infer(infer_requests, request_config)
142
+ query0 = infer_requests[0].messages[0]['content']
143
+ print(f'response0: {resp_list[0].choices[0].message.content}')
144
+ print(f'response1: {resp_list[1].choices[0].message.content}')
145
+ ```
146
+
147
+ **Multimodal Model:**
148
+
149
+ ```python
150
+ import os
151
+ os.environ['CUDA_VISIBLE_DEVICES'] = '0'
152
+ os.environ['MAX_PIXELS'] = '1003520'
153
+ os.environ['VIDEO_MAX_PIXELS'] = '50176'
154
+ os.environ['FPS_MAX_FRAMES'] = '12'
155
+
156
+ from swift.llm import PtEngine, RequestConfig, InferRequest
157
+ model = 'Qwen/Qwen2.5-VL-3B-Instruct'
158
+
159
+ # Load the inference engine
160
+ engine = PtEngine(model, max_batch_size=2)
161
+ request_config = RequestConfig(max_tokens=512, temperature=0)
162
+
163
+ # Using 3 infer_requests to demonstrate batch inference
164
+ infer_requests = [
165
+ InferRequest(messages=[{'role': 'user', 'content': 'Who are you?'}]),
166
+ InferRequest(messages=[{'role': 'user', 'content': '<image><image> What is the difference between these two images?'}],
167
+ images=['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png',
168
+ 'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png']),
169
+ InferRequest(messages=[{'role': 'user', 'content': '<video> Describe the video'}],
170
+ videos=['https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4']),
171
+ ]
172
+ resp_list = engine.infer(infer_requests, request_config)
173
+ query0 = infer_requests[0].messages[0]['content']
174
+ print(f'response0: {resp_list[0].choices[0].message.content}')
175
+ print(f'response1: {resp_list[1].choices[0].message.content}')
176
+ print(f'response2: {resp_list[2].choices[0].message.content}')
177
+ ```
178
+
179
+ We also provide more demos for Python-based inference:
180
+
181
+ - For streaming inference using `VllmEngine` and `LmdeployEngine` for inference acceleration, you can refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo.py).
182
+ - Multimodal Inference: In addition to the aforementioned multimodal input formats, Swift is compatible with OpenAI's multimodal input format; refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_mllm.py).
183
+ - Grounding Tasks: For performing grounding tasks with multimodal models, you can refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_grounding.py).
184
+ - Multiple LoRA Inference: Refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_lora.py).
185
+ - Agent Inference: Refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_agent.py).
186
+ - Asynchronous Interface: For Python-based inference using `engine.infer_async`, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo.py).
187
+
188
+
189
+ ## Deployment
190
+
191
+ If you want to see the deployment of a model fine-tuned with LoRA, you can refer to the [Pre-training and Fine-tuning documentation](./Pre-training-and-Fine-tuning.md#deployment-fine-tuned-model).
192
+
193
+ This section primarily focuses on the deployment and invocation of multimodal models. For text-based large models, we provide a simple deployment and invocation example:
194
+
195
+ **Server Deployment:**
196
+
197
+ ```shell
198
+ CUDA_VISIBLE_DEVICES=0 swift deploy \
199
+ --model Qwen/Qwen2.5-7B-Instruct \
200
+ --infer_backend vllm \
201
+ --max_new_tokens 2048 \
202
+ --served_model_name Qwen2.5-7B-Instruct
203
+ ```
204
+
205
+ **Client Invocation Test:**
206
+
207
+ ```shell
208
+ curl http://localhost:8000/v1/chat/completions \
209
+ -H "Content-Type: application/json" \
210
+ -d '{
211
+ "model": "Qwen2.5-7B-Instruct",
212
+ "messages": [{"role": "user", "content": "What should I do if I can’t sleep at night?"}],
213
+ "max_tokens": 256,
214
+ "temperature": 0
215
+ }'
216
+ ```
217
+
218
+
219
+ ### Server Side
220
+
221
+ ```shell
222
+ # test environment: pip install transformers==4.49.* vllm==0.7.3
223
+ CUDA_VISIBLE_DEVICES=0 \
224
+ MAX_PIXELS=1003520 \
225
+ VIDEO_MAX_PIXELS=50176 \
226
+ FPS_MAX_FRAMES=12 \
227
+ swift deploy \
228
+ --model Qwen/Qwen2.5-VL-3B-Instruct \
229
+ --infer_backend vllm \
230
+ --gpu_memory_utilization 0.9 \
231
+ --max_model_len 8192 \
232
+ --max_new_tokens 2048 \
233
+ --limit_mm_per_prompt '{"image": 5, "video": 2}' \
234
+ --served_model_name Qwen2.5-VL-3B-Instruct
235
+ ```
236
+
237
+ ### Client Side
238
+
239
+ We introduce three methods for invoking the client: using curl, the OpenAI library, and the Swift client.
240
+
241
+ **Method 1: curl**
242
+
243
+ ```shell
244
+ curl http://localhost:8000/v1/chat/completions \
245
+ -H "Content-Type: application/json" \
246
+ -d '{
247
+ "model": "Qwen2.5-VL-3B-Instruct",
248
+ "messages": [{"role": "user", "content": [
249
+ {"type": "image", "image": "http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png"},
250
+ {"type": "image", "image": "http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png"},
251
+ {"type": "text", "text": "What is the difference between these two images?"}
252
+ ]}],
253
+ "max_tokens": 256,
254
+ "temperature": 0
255
+ }'
256
+ ```
257
+
258
+ **Method 2: OpenAI Library**
259
+
260
+ ```python
261
+ from openai import OpenAI
262
+
263
+ client = OpenAI(
264
+ api_key='EMPTY',
265
+ base_url=f'http://127.0.0.1:8000/v1',
266
+ )
267
+ model = client.models.list().data[0].id
268
+ print(f'model: {model}')
269
+
270
+ messages = [{'role': 'user', 'content': [
271
+ {'type': 'video', 'video': 'https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4'},
272
+ {'type': 'text', 'text': 'describe the video'}
273
+ ]}]
274
+
275
+ resp = client.chat.completions.create(model=model, messages=messages, max_tokens=512, temperature=0)
276
+ query = messages[0]['content']
277
+ response = resp.choices[0].message.content
278
+ print(f'query: {query}')
279
+ print(f'response: {response}')
280
+
281
+ # Using base64
282
+ import base64
283
+ import requests
284
+ resp = requests.get('https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4')
285
+ base64_encoded = base64.b64encode(resp.content).decode('utf-8')
286
+ messages = [{'role': 'user', 'content': [
287
+ {'type': 'video', 'video': f'data:video/mp4;base64,{base64_encoded}'},
288
+ {'type': 'text', 'text': 'describe the video'}
289
+ ]}]
290
+
291
+ gen = client.chat.completions.create(model=model, messages=messages, stream=True, temperature=0)
292
+ print(f'query: {query}\nresponse: ', end='')
293
+ for chunk in gen:
294
+ if chunk is None:
295
+ continue
296
+ print(chunk.choices[0].delta.content, end='', flush=True)
297
+ print()
298
+ ```
299
+
300
+ **Method 3: Swift Client**
301
+
302
+ ```python
303
+ from swift.llm import InferRequest, InferClient, RequestConfig
304
+ from swift.plugin import InferStats
305
+
306
+ engine = InferClient(host='127.0.0.1', port=8000)
307
+ print(f'models: {engine.models}')
308
+ metric = InferStats()
309
+ request_config = RequestConfig(max_tokens=512, temperature=0)
310
+
311
+ # Using 3 infer_requests to demonstrate batch inference
312
+ # Supports local paths, base64, and URLs
313
+ infer_requests = [
314
+ InferRequest(messages=[{'role': 'user', 'content': 'Who are you?'}]),
315
+ InferRequest(messages=[{'role': 'user', 'content': '<image><image> What is the difference between these two images?'}],
316
+ images=['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png',
317
+ 'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png']),
318
+ InferRequest(messages=[{'role': 'user', 'content': '<video> Describe the video'}],
319
+ videos=['https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4']),
320
+ ]
321
+
322
+ resp_list = engine.infer(infer_requests, request_config, metrics=[metric])
323
+ print(f'response0: {resp_list[0].choices[0].message.content}')
324
+ print(f'response1: {resp_list[1].choices[0].message.content}')
325
+ print(f'response2: {resp_list[2].choices[0].message.content}')
326
+ print(metric.compute())
327
+ metric.reset()
328
+
329
+ # Using base64
330
+ import base64
331
+ import requests
332
+ resp = requests.get('https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4')
333
+ base64_encoded = base64.b64encode(resp.content).decode('utf-8')
334
+ messages = [{'role': 'user', 'content': [
335
+ {'type': 'video', 'video': f'data:video/mp4;base64,{base64_encoded}'},
336
+ {'type': 'text', 'text': 'describe the video'}
337
+ ]}]
338
+ infer_request = InferRequest(messages=messages)
339
+ request_config = RequestConfig(max_tokens=512, temperature=0, stream=True)
340
+ gen_list = engine.infer([infer_request], request_config, metrics=[metric])
341
+ print(f'response0: ', end='')
342
+ for chunk in gen_list[0]:
343
+ if chunk is None:
344
+ continue
345
+ print(chunk.choices[0].delta.content, end='', flush=True)
346
+ print()
347
+ print(metric.compute())
348
+ ```
349
+
350
+ We also provide more deployment demos:
351
+
352
+ - Multiple LoRA deployment and invocation: Refer to [this link](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/lora).
353
+ - Deployment and invocation of the Base model: Refer to [this link](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/client/llm/base).
354
+ - More model types: We provide deployment scripts for [bert](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/bert) and [reward_model](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/reward_model).