File size: 19,098 Bytes
f677bb1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
# Inference and Deployment

Below are the inference engines supported by Swift along with their corresponding capabilities. The three inference acceleration engines provide inference acceleration for Swift's inference, deployment, and evaluation modules:

| Inference Acceleration Engine                    | OpenAI API                                                   | Multimodal                                                   | Quantized Model | Multiple LoRAs                                               | QLoRA | Batch Inference                                              | Parallel Techniques |
| ------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | --------------- | ------------------------------------------------------------ | ----- | ------------------------------------------------------------ | ------------------- |
| pytorch                                          | [βœ…](https://github.com/modelscope/ms-swift/blob/main/examples/deploy/client/llm/chat/openai_client.py) | [βœ…](https://github.com/modelscope/ms-swift/blob/main/examples/app/mllm.sh) | βœ…               | [βœ…](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_lora.py) | βœ…     | [βœ…](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/batch_ddp.sh) | DDP/device_map      |
| [vllm](https://github.com/vllm-project/vllm)     | βœ…                                                            | [βœ…](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/mllm_tp.sh) | βœ…               | [βœ…](https://github.com/modelscope/ms-swift/blob/main/examples/deploy/lora/server.sh) | ❌     | βœ…                                                            | TP/PP/DP               |
| [lmdeploy](https://github.com/InternLM/lmdeploy) | βœ…                                                            | [βœ…](https://github.com/modelscope/ms-swift/blob/main/examples/infer/lmdeploy/mllm_tp.sh) | βœ…               | ❌                                                            | ❌     | βœ…                                                            | TP/DP                  |

## Inference

ms-swift uses a layered design philosophy, allowing users to perform inference through the command-line interface, web UI, or directly using Python.

To view the inference of a model fine-tuned with LoRA, please refer to the [Pre-training and Fine-tuning documentation](./Pre-training-and-Fine-tuning.md#inference-fine-tuned-model).

### Using CLI

**Full Parameter Model:**

```shell
CUDA_VISIBLE_DEVICES=0 swift infer \
    --model Qwen/Qwen2.5-7B-Instruct \
    --stream true \
    --infer_backend pt \
    --max_new_tokens 2048
```

**LoRA Model:**

```shell
CUDA_VISIBLE_DEVICES=0 swift infer \
    --model Qwen/Qwen2.5-7B-Instruct \
    --adapters swift/test_lora \
    --stream true \
    --infer_backend pt \
    --temperature 0 \
    --max_new_tokens 2048
```

**Command-Line Inference Instructions**

The above commands are for interactive command-line interface inference. After running the script, you can simply enter your query in the terminal. You can also input the following special commands:

- `multi-line`: Switch to multi-line mode, allowing line breaks in the input, ending with `#`.
- `single-line`: Switch to single-line mode, with line breaks indicating the end of input.
- `reset-system`: Reset the system and clear history.
- `clear`: Clear the history.
- `quit` or `exit`: Exit the conversation.

**Multimodal Model**

```shell
CUDA_VISIBLE_DEVICES=0 \
MAX_PIXELS=1003520 \
VIDEO_MAX_PIXELS=50176 \
FPS_MAX_FRAMES=12 \
swift infer \
    --model Qwen/Qwen2.5-VL-3B-Instruct \
    --stream true \
    --infer_backend pt \
    --max_new_tokens 2048
```

To perform inference with a multimodal model, you can add tags like `<image>`, `<video>`, or `<audio>` in your query (representing the location of image representations in `inputs_embeds`). For example, you can input `<image><image>What is the difference between these two images?` or `<video>Describe this video.` Then, follow the prompts to input the corresponding image/video/audio.


Here is an example of inference:
```
<<< <image><image>What is the difference between these two images?
Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
The first image depicts a cute, cartoon-style kitten with large, expressive eyes and a fluffy white and gray coat. The background is simple, featuring a gradient of colors that highlight the kitten's face.

The second image shows a group of four cartoon-style sheep standing on a grassy field with mountains in the background. The sheep have fluffy white wool, black legs, and black faces with white markings around their eyes and noses. The background includes green hills and a blue sky with clouds, giving it a pastoral and serene atmosphere.
--------------------------------------------------
<<< clear
<<< <video>Describe this video.
Input a video path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4
A baby wearing glasses is sitting on a bed and reading a book. The baby is holding the book with both hands and is looking down at it. The baby is wearing a light blue shirt and pink pants. The baby is sitting on a white pillow. The baby is looking at the book with interest. The baby is not moving much, just turning the pages of the book.
```

**Dataset Inference:**

```
CUDA_VISIBLE_DEVICES=0 swift infer \
    --model Qwen/Qwen2.5-7B-Instruct \
    --stream true \
    --infer_backend pt \
    --val_dataset AI-ModelScope/alpaca-gpt4-data-zh \
    --max_new_tokens 2048
```

The above example provides streaming inference for both full parameters and LoRA, and below are more inference techniques available in SWIFT:

- Interface Inference: You can change `swift infer` to `swift app`.
- Batch Inference: For large models and multimodal models, you can specify `--max_batch_size` for batch inference by using `infer_backend=pt`. For specific details, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/batch_ddp.sh). Note that you cannot set `--stream true` when performing batch inference.
- DDP/device_map Inference: `infer_backend=pt` supports parallel inference using DDP/device_map technology. For further details, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/mllm_device_map.sh).
- Inference Acceleration: Swift supports using vllm/lmdeploy for inference acceleration across the inference, deployment, and evaluation modules by simply adding `--infer_backend vllm/lmdeploy`. You can refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/ddp.sh).
- Multimodal Models: We provide shell scripts for multi-GPU inference for multimodal models using [pt](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/mllm_device_map.sh), [vllm](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/mllm_tp.sh), and [lmdeploy](https://github.com/modelscope/ms-swift/blob/main/examples/infer/lmdeploy/mllm_tp.sh).
- Quantized Models: You can directly select models that are quantized with GPTQ, AWQ, or BNB, for example: `--model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4`.
- More Model Types: We also provide inference scripts for [bert](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/bert.sh), [reward_model](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/reward_model.sh), and [prm](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/prm.sh).

**Tips:**

- SWIFT saves inference results, and you can specify the save path using `--result_path`.
- To output log probabilities, simply specify `--logprobs true` during inference. SWIFT will save these results. Note that setting `--stream true` will prevent storage of results.
- Using `infer_backend=pt` supports inference for all models supported by SWIFT, while `infer_backend=vllm/lmdeploy` supports only a subset of models. Please refer to the documentation for [vllm](https://docs.vllm.ai/en/latest/models/supported_models.html) and [lmdeploy](https://lmdeploy.readthedocs.io/en/latest/supported_models/supported_models.html).
- If you encounter OOM when using `--infer_backend vllm`, you can lower `--max_model_len`, `--max_num_seqs`, choose an appropriate `--gpu_memory_utilization`, or set `--enforce_eager true`. Alternatively, you can address this by using tensor parallelism with `--tensor_parallel_size`.
- When inferring multimodal models using `--infer_backend vllm`, you need to input multiple images. You can set `--limit_mm_per_prompt` to resolve this, for example: `--limit_mm_per_prompt '{"image": 10, "video": 5}'`.
- If you encounter OOM issues while inferring qwen2-vl/qwen2.5-vl, you can address this by setting `MAX_PIXELS`, `VIDEO_MAX_PIXELS`, and `FPS_MAX_FRAMES`. For more information, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/app/mllm.sh).
- SWIFT's built-in dialogue templates align with dialogue templates run using transformers. You can refer to [here](https://github.com/modelscope/ms-swift/blob/main/tests/test_align/test_template/test_vision.py) for testing. If there are any misalignments, please feel free to submit an issue or PR for correction.


### Using Web-UI

If you want to perform inference through a graphical interface, you can refer to the [Web-UI documentation](../GetStarted/Web-UI.md).

### Using Python

**Text Model:**

```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import PtEngine, RequestConfig, InferRequest
model = 'Qwen/Qwen2.5-0.5B-Instruct'

# Load the inference engine
engine = PtEngine(model, max_batch_size=2)
request_config = RequestConfig(max_tokens=512, temperature=0)

# Using 2 infer_requests to demonstrate batch inference
infer_requests = [
    InferRequest(messages=[{'role': 'user', 'content': 'Who are you?'}]),
    InferRequest(messages=[{'role': 'user', 'content': 'Where is the capital of Zhejiang?'},
                           {'role': 'assistant', 'content': 'The capital of Zhejiang Province, China, is Hangzhou.'},
                           {'role': 'user', 'content': 'What are some fun places here?'}]),
]
resp_list = engine.infer(infer_requests, request_config)
query0 = infer_requests[0].messages[0]['content']
print(f'response0: {resp_list[0].choices[0].message.content}')
print(f'response1: {resp_list[1].choices[0].message.content}')
```

**Multimodal Model:**

```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ['MAX_PIXELS'] = '1003520'
os.environ['VIDEO_MAX_PIXELS'] = '50176'
os.environ['FPS_MAX_FRAMES'] = '12'

from swift.llm import PtEngine, RequestConfig, InferRequest
model = 'Qwen/Qwen2.5-VL-3B-Instruct'

# Load the inference engine
engine = PtEngine(model, max_batch_size=2)
request_config = RequestConfig(max_tokens=512, temperature=0)

# Using 3 infer_requests to demonstrate batch inference
infer_requests = [
    InferRequest(messages=[{'role': 'user', 'content': 'Who are you?'}]),
    InferRequest(messages=[{'role': 'user', 'content': '<image><image> What is the difference between these two images?'}],
                 images=['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png',
                         'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png']),
    InferRequest(messages=[{'role': 'user', 'content': '<video> Describe the video'}],
                 videos=['https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4']),
]
resp_list = engine.infer(infer_requests, request_config)
query0 = infer_requests[0].messages[0]['content']
print(f'response0: {resp_list[0].choices[0].message.content}')
print(f'response1: {resp_list[1].choices[0].message.content}')
print(f'response2: {resp_list[2].choices[0].message.content}')
```

We also provide more demos for Python-based inference:

- For streaming inference using `VllmEngine` and `LmdeployEngine` for inference acceleration, you can refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo.py).
- Multimodal Inference: In addition to the aforementioned multimodal input formats, Swift is compatible with OpenAI's multimodal input format; refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_mllm.py).
- Grounding Tasks: For performing grounding tasks with multimodal models, you can refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_grounding.py).
- Multiple LoRA Inference: Refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_lora.py).
- Agent Inference: Refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_agent.py).
- Asynchronous Interface: For Python-based inference using `engine.infer_async`, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo.py).


## Deployment

If you want to see the deployment of a model fine-tuned with LoRA, you can refer to the [Pre-training and Fine-tuning documentation](./Pre-training-and-Fine-tuning.md#deployment-fine-tuned-model).

This section primarily focuses on the deployment and invocation of multimodal models. For text-based large models, we provide a simple deployment and invocation example:

**Server Deployment:**

```shell
CUDA_VISIBLE_DEVICES=0 swift deploy \
    --model Qwen/Qwen2.5-7B-Instruct \
    --infer_backend vllm \
    --max_new_tokens 2048 \
    --served_model_name Qwen2.5-7B-Instruct
```

**Client Invocation Test:**

```shell
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-7B-Instruct",
"messages": [{"role": "user", "content": "What should I do if I can’t sleep at night?"}],
"max_tokens": 256,
"temperature": 0
}'
```


### Server Side

```shell
# test environment: pip install transformers==4.49.* vllm==0.7.3
CUDA_VISIBLE_DEVICES=0 \
MAX_PIXELS=1003520 \
VIDEO_MAX_PIXELS=50176 \
FPS_MAX_FRAMES=12 \
swift deploy \
    --model Qwen/Qwen2.5-VL-3B-Instruct \
    --infer_backend vllm \
    --gpu_memory_utilization 0.9 \
    --max_model_len 8192 \
    --max_new_tokens 2048 \
    --limit_mm_per_prompt '{"image": 5, "video": 2}' \
    --served_model_name Qwen2.5-VL-3B-Instruct
```

### Client Side

We introduce three methods for invoking the client: using curl, the OpenAI library, and the Swift client.

**Method 1: curl**

```shell
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-VL-3B-Instruct",
"messages": [{"role": "user", "content": [
    {"type": "image", "image": "http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png"},
    {"type": "image", "image": "http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png"},
    {"type": "text", "text": "What is the difference between these two images?"}
]}],
"max_tokens": 256,
"temperature": 0
}'
```

**Method 2: OpenAI Library**

```python
from openai import OpenAI

client = OpenAI(
    api_key='EMPTY',
    base_url=f'http://127.0.0.1:8000/v1',
)
model = client.models.list().data[0].id
print(f'model: {model}')

messages = [{'role': 'user', 'content': [
    {'type': 'video', 'video': 'https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4'},
    {'type': 'text', 'text': 'describe the video'}
]}]

resp = client.chat.completions.create(model=model, messages=messages, max_tokens=512, temperature=0)
query = messages[0]['content']
response = resp.choices[0].message.content
print(f'query: {query}')
print(f'response: {response}')

# Using base64
import base64
import requests
resp = requests.get('https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4')
base64_encoded = base64.b64encode(resp.content).decode('utf-8')
messages = [{'role': 'user', 'content': [
    {'type': 'video', 'video': f'data:video/mp4;base64,{base64_encoded}'},
    {'type': 'text', 'text': 'describe the video'}
]}]

gen = client.chat.completions.create(model=model, messages=messages, stream=True, temperature=0)
print(f'query: {query}\nresponse: ', end='')
for chunk in gen:
    if chunk is None:
        continue
    print(chunk.choices[0].delta.content, end='', flush=True)
print()
```

**Method 3: Swift Client**

```python
from swift.llm import InferRequest, InferClient, RequestConfig
from swift.plugin import InferStats

engine = InferClient(host='127.0.0.1', port=8000)
print(f'models: {engine.models}')
metric = InferStats()
request_config = RequestConfig(max_tokens=512, temperature=0)

# Using 3 infer_requests to demonstrate batch inference
# Supports local paths, base64, and URLs
infer_requests = [
    InferRequest(messages=[{'role': 'user', 'content': 'Who are you?'}]),
    InferRequest(messages=[{'role': 'user', 'content': '<image><image> What is the difference between these two images?'}],
                 images=['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png',
                         'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png']),
    InferRequest(messages=[{'role': 'user', 'content': '<video> Describe the video'}],
                 videos=['https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4']),
]

resp_list = engine.infer(infer_requests, request_config, metrics=[metric])
print(f'response0: {resp_list[0].choices[0].message.content}')
print(f'response1: {resp_list[1].choices[0].message.content}')
print(f'response2: {resp_list[2].choices[0].message.content}')
print(metric.compute())
metric.reset()

# Using base64
import base64
import requests
resp = requests.get('https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4')
base64_encoded = base64.b64encode(resp.content).decode('utf-8')
messages = [{'role': 'user', 'content': [
    {'type': 'video', 'video': f'data:video/mp4;base64,{base64_encoded}'},
    {'type': 'text', 'text': 'describe the video'}
]}]
infer_request = InferRequest(messages=messages)
request_config = RequestConfig(max_tokens=512, temperature=0, stream=True)
gen_list = engine.infer([infer_request], request_config, metrics=[metric])
print(f'response0: ', end='')
for chunk in gen_list[0]:
    if chunk is None:
        continue
    print(chunk.choices[0].delta.content, end='', flush=True)
print()
print(metric.compute())
```

We also provide more deployment demos:

- Multiple LoRA deployment and invocation: Refer to [this link](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/lora).
- Deployment and invocation of the Base model: Refer to [this link](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/client/llm/base).
- More model types: We provide deployment scripts for [bert](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/bert) and [reward_model](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/reward_model).