File size: 19,098 Bytes
f677bb1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 |
# Inference and Deployment
Below are the inference engines supported by Swift along with their corresponding capabilities. The three inference acceleration engines provide inference acceleration for Swift's inference, deployment, and evaluation modules:
| Inference Acceleration Engine | OpenAI API | Multimodal | Quantized Model | Multiple LoRAs | QLoRA | Batch Inference | Parallel Techniques |
| ------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | --------------- | ------------------------------------------------------------ | ----- | ------------------------------------------------------------ | ------------------- |
| pytorch | [β
](https://github.com/modelscope/ms-swift/blob/main/examples/deploy/client/llm/chat/openai_client.py) | [β
](https://github.com/modelscope/ms-swift/blob/main/examples/app/mllm.sh) | β
| [β
](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_lora.py) | β
| [β
](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/batch_ddp.sh) | DDP/device_map |
| [vllm](https://github.com/vllm-project/vllm) | β
| [β
](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/mllm_tp.sh) | β
| [β
](https://github.com/modelscope/ms-swift/blob/main/examples/deploy/lora/server.sh) | β | β
| TP/PP/DP |
| [lmdeploy](https://github.com/InternLM/lmdeploy) | β
| [β
](https://github.com/modelscope/ms-swift/blob/main/examples/infer/lmdeploy/mllm_tp.sh) | β
| β | β | β
| TP/DP |
## Inference
ms-swift uses a layered design philosophy, allowing users to perform inference through the command-line interface, web UI, or directly using Python.
To view the inference of a model fine-tuned with LoRA, please refer to the [Pre-training and Fine-tuning documentation](./Pre-training-and-Fine-tuning.md#inference-fine-tuned-model).
### Using CLI
**Full Parameter Model:**
```shell
CUDA_VISIBLE_DEVICES=0 swift infer \
--model Qwen/Qwen2.5-7B-Instruct \
--stream true \
--infer_backend pt \
--max_new_tokens 2048
```
**LoRA Model:**
```shell
CUDA_VISIBLE_DEVICES=0 swift infer \
--model Qwen/Qwen2.5-7B-Instruct \
--adapters swift/test_lora \
--stream true \
--infer_backend pt \
--temperature 0 \
--max_new_tokens 2048
```
**Command-Line Inference Instructions**
The above commands are for interactive command-line interface inference. After running the script, you can simply enter your query in the terminal. You can also input the following special commands:
- `multi-line`: Switch to multi-line mode, allowing line breaks in the input, ending with `#`.
- `single-line`: Switch to single-line mode, with line breaks indicating the end of input.
- `reset-system`: Reset the system and clear history.
- `clear`: Clear the history.
- `quit` or `exit`: Exit the conversation.
**Multimodal Model**
```shell
CUDA_VISIBLE_DEVICES=0 \
MAX_PIXELS=1003520 \
VIDEO_MAX_PIXELS=50176 \
FPS_MAX_FRAMES=12 \
swift infer \
--model Qwen/Qwen2.5-VL-3B-Instruct \
--stream true \
--infer_backend pt \
--max_new_tokens 2048
```
To perform inference with a multimodal model, you can add tags like `<image>`, `<video>`, or `<audio>` in your query (representing the location of image representations in `inputs_embeds`). For example, you can input `<image><image>What is the difference between these two images?` or `<video>Describe this video.` Then, follow the prompts to input the corresponding image/video/audio.
Here is an example of inference:
```
<<< <image><image>What is the difference between these two images?
Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
The first image depicts a cute, cartoon-style kitten with large, expressive eyes and a fluffy white and gray coat. The background is simple, featuring a gradient of colors that highlight the kitten's face.
The second image shows a group of four cartoon-style sheep standing on a grassy field with mountains in the background. The sheep have fluffy white wool, black legs, and black faces with white markings around their eyes and noses. The background includes green hills and a blue sky with clouds, giving it a pastoral and serene atmosphere.
--------------------------------------------------
<<< clear
<<< <video>Describe this video.
Input a video path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4
A baby wearing glasses is sitting on a bed and reading a book. The baby is holding the book with both hands and is looking down at it. The baby is wearing a light blue shirt and pink pants. The baby is sitting on a white pillow. The baby is looking at the book with interest. The baby is not moving much, just turning the pages of the book.
```
**Dataset Inference:**
```
CUDA_VISIBLE_DEVICES=0 swift infer \
--model Qwen/Qwen2.5-7B-Instruct \
--stream true \
--infer_backend pt \
--val_dataset AI-ModelScope/alpaca-gpt4-data-zh \
--max_new_tokens 2048
```
The above example provides streaming inference for both full parameters and LoRA, and below are more inference techniques available in SWIFT:
- Interface Inference: You can change `swift infer` to `swift app`.
- Batch Inference: For large models and multimodal models, you can specify `--max_batch_size` for batch inference by using `infer_backend=pt`. For specific details, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/batch_ddp.sh). Note that you cannot set `--stream true` when performing batch inference.
- DDP/device_map Inference: `infer_backend=pt` supports parallel inference using DDP/device_map technology. For further details, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/mllm_device_map.sh).
- Inference Acceleration: Swift supports using vllm/lmdeploy for inference acceleration across the inference, deployment, and evaluation modules by simply adding `--infer_backend vllm/lmdeploy`. You can refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/ddp.sh).
- Multimodal Models: We provide shell scripts for multi-GPU inference for multimodal models using [pt](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/mllm_device_map.sh), [vllm](https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/mllm_tp.sh), and [lmdeploy](https://github.com/modelscope/ms-swift/blob/main/examples/infer/lmdeploy/mllm_tp.sh).
- Quantized Models: You can directly select models that are quantized with GPTQ, AWQ, or BNB, for example: `--model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4`.
- More Model Types: We also provide inference scripts for [bert](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/bert.sh), [reward_model](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/reward_model.sh), and [prm](https://github.com/modelscope/ms-swift/blob/main/examples/infer/pt/prm.sh).
**Tips:**
- SWIFT saves inference results, and you can specify the save path using `--result_path`.
- To output log probabilities, simply specify `--logprobs true` during inference. SWIFT will save these results. Note that setting `--stream true` will prevent storage of results.
- Using `infer_backend=pt` supports inference for all models supported by SWIFT, while `infer_backend=vllm/lmdeploy` supports only a subset of models. Please refer to the documentation for [vllm](https://docs.vllm.ai/en/latest/models/supported_models.html) and [lmdeploy](https://lmdeploy.readthedocs.io/en/latest/supported_models/supported_models.html).
- If you encounter OOM when using `--infer_backend vllm`, you can lower `--max_model_len`, `--max_num_seqs`, choose an appropriate `--gpu_memory_utilization`, or set `--enforce_eager true`. Alternatively, you can address this by using tensor parallelism with `--tensor_parallel_size`.
- When inferring multimodal models using `--infer_backend vllm`, you need to input multiple images. You can set `--limit_mm_per_prompt` to resolve this, for example: `--limit_mm_per_prompt '{"image": 10, "video": 5}'`.
- If you encounter OOM issues while inferring qwen2-vl/qwen2.5-vl, you can address this by setting `MAX_PIXELS`, `VIDEO_MAX_PIXELS`, and `FPS_MAX_FRAMES`. For more information, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/app/mllm.sh).
- SWIFT's built-in dialogue templates align with dialogue templates run using transformers. You can refer to [here](https://github.com/modelscope/ms-swift/blob/main/tests/test_align/test_template/test_vision.py) for testing. If there are any misalignments, please feel free to submit an issue or PR for correction.
### Using Web-UI
If you want to perform inference through a graphical interface, you can refer to the [Web-UI documentation](../GetStarted/Web-UI.md).
### Using Python
**Text Model:**
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import PtEngine, RequestConfig, InferRequest
model = 'Qwen/Qwen2.5-0.5B-Instruct'
# Load the inference engine
engine = PtEngine(model, max_batch_size=2)
request_config = RequestConfig(max_tokens=512, temperature=0)
# Using 2 infer_requests to demonstrate batch inference
infer_requests = [
InferRequest(messages=[{'role': 'user', 'content': 'Who are you?'}]),
InferRequest(messages=[{'role': 'user', 'content': 'Where is the capital of Zhejiang?'},
{'role': 'assistant', 'content': 'The capital of Zhejiang Province, China, is Hangzhou.'},
{'role': 'user', 'content': 'What are some fun places here?'}]),
]
resp_list = engine.infer(infer_requests, request_config)
query0 = infer_requests[0].messages[0]['content']
print(f'response0: {resp_list[0].choices[0].message.content}')
print(f'response1: {resp_list[1].choices[0].message.content}')
```
**Multimodal Model:**
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ['MAX_PIXELS'] = '1003520'
os.environ['VIDEO_MAX_PIXELS'] = '50176'
os.environ['FPS_MAX_FRAMES'] = '12'
from swift.llm import PtEngine, RequestConfig, InferRequest
model = 'Qwen/Qwen2.5-VL-3B-Instruct'
# Load the inference engine
engine = PtEngine(model, max_batch_size=2)
request_config = RequestConfig(max_tokens=512, temperature=0)
# Using 3 infer_requests to demonstrate batch inference
infer_requests = [
InferRequest(messages=[{'role': 'user', 'content': 'Who are you?'}]),
InferRequest(messages=[{'role': 'user', 'content': '<image><image> What is the difference between these two images?'}],
images=['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png',
'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png']),
InferRequest(messages=[{'role': 'user', 'content': '<video> Describe the video'}],
videos=['https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4']),
]
resp_list = engine.infer(infer_requests, request_config)
query0 = infer_requests[0].messages[0]['content']
print(f'response0: {resp_list[0].choices[0].message.content}')
print(f'response1: {resp_list[1].choices[0].message.content}')
print(f'response2: {resp_list[2].choices[0].message.content}')
```
We also provide more demos for Python-based inference:
- For streaming inference using `VllmEngine` and `LmdeployEngine` for inference acceleration, you can refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo.py).
- Multimodal Inference: In addition to the aforementioned multimodal input formats, Swift is compatible with OpenAI's multimodal input format; refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_mllm.py).
- Grounding Tasks: For performing grounding tasks with multimodal models, you can refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_grounding.py).
- Multiple LoRA Inference: Refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_lora.py).
- Agent Inference: Refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_agent.py).
- Asynchronous Interface: For Python-based inference using `engine.infer_async`, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo.py).
## Deployment
If you want to see the deployment of a model fine-tuned with LoRA, you can refer to the [Pre-training and Fine-tuning documentation](./Pre-training-and-Fine-tuning.md#deployment-fine-tuned-model).
This section primarily focuses on the deployment and invocation of multimodal models. For text-based large models, we provide a simple deployment and invocation example:
**Server Deployment:**
```shell
CUDA_VISIBLE_DEVICES=0 swift deploy \
--model Qwen/Qwen2.5-7B-Instruct \
--infer_backend vllm \
--max_new_tokens 2048 \
--served_model_name Qwen2.5-7B-Instruct
```
**Client Invocation Test:**
```shell
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-7B-Instruct",
"messages": [{"role": "user", "content": "What should I do if I canβt sleep at night?"}],
"max_tokens": 256,
"temperature": 0
}'
```
### Server Side
```shell
# test environment: pip install transformers==4.49.* vllm==0.7.3
CUDA_VISIBLE_DEVICES=0 \
MAX_PIXELS=1003520 \
VIDEO_MAX_PIXELS=50176 \
FPS_MAX_FRAMES=12 \
swift deploy \
--model Qwen/Qwen2.5-VL-3B-Instruct \
--infer_backend vllm \
--gpu_memory_utilization 0.9 \
--max_model_len 8192 \
--max_new_tokens 2048 \
--limit_mm_per_prompt '{"image": 5, "video": 2}' \
--served_model_name Qwen2.5-VL-3B-Instruct
```
### Client Side
We introduce three methods for invoking the client: using curl, the OpenAI library, and the Swift client.
**Method 1: curl**
```shell
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-VL-3B-Instruct",
"messages": [{"role": "user", "content": [
{"type": "image", "image": "http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png"},
{"type": "image", "image": "http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png"},
{"type": "text", "text": "What is the difference between these two images?"}
]}],
"max_tokens": 256,
"temperature": 0
}'
```
**Method 2: OpenAI Library**
```python
from openai import OpenAI
client = OpenAI(
api_key='EMPTY',
base_url=f'http://127.0.0.1:8000/v1',
)
model = client.models.list().data[0].id
print(f'model: {model}')
messages = [{'role': 'user', 'content': [
{'type': 'video', 'video': 'https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4'},
{'type': 'text', 'text': 'describe the video'}
]}]
resp = client.chat.completions.create(model=model, messages=messages, max_tokens=512, temperature=0)
query = messages[0]['content']
response = resp.choices[0].message.content
print(f'query: {query}')
print(f'response: {response}')
# Using base64
import base64
import requests
resp = requests.get('https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4')
base64_encoded = base64.b64encode(resp.content).decode('utf-8')
messages = [{'role': 'user', 'content': [
{'type': 'video', 'video': f'data:video/mp4;base64,{base64_encoded}'},
{'type': 'text', 'text': 'describe the video'}
]}]
gen = client.chat.completions.create(model=model, messages=messages, stream=True, temperature=0)
print(f'query: {query}\nresponse: ', end='')
for chunk in gen:
if chunk is None:
continue
print(chunk.choices[0].delta.content, end='', flush=True)
print()
```
**Method 3: Swift Client**
```python
from swift.llm import InferRequest, InferClient, RequestConfig
from swift.plugin import InferStats
engine = InferClient(host='127.0.0.1', port=8000)
print(f'models: {engine.models}')
metric = InferStats()
request_config = RequestConfig(max_tokens=512, temperature=0)
# Using 3 infer_requests to demonstrate batch inference
# Supports local paths, base64, and URLs
infer_requests = [
InferRequest(messages=[{'role': 'user', 'content': 'Who are you?'}]),
InferRequest(messages=[{'role': 'user', 'content': '<image><image> What is the difference between these two images?'}],
images=['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png',
'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png']),
InferRequest(messages=[{'role': 'user', 'content': '<video> Describe the video'}],
videos=['https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4']),
]
resp_list = engine.infer(infer_requests, request_config, metrics=[metric])
print(f'response0: {resp_list[0].choices[0].message.content}')
print(f'response1: {resp_list[1].choices[0].message.content}')
print(f'response2: {resp_list[2].choices[0].message.content}')
print(metric.compute())
metric.reset()
# Using base64
import base64
import requests
resp = requests.get('https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4')
base64_encoded = base64.b64encode(resp.content).decode('utf-8')
messages = [{'role': 'user', 'content': [
{'type': 'video', 'video': f'data:video/mp4;base64,{base64_encoded}'},
{'type': 'text', 'text': 'describe the video'}
]}]
infer_request = InferRequest(messages=messages)
request_config = RequestConfig(max_tokens=512, temperature=0, stream=True)
gen_list = engine.infer([infer_request], request_config, metrics=[metric])
print(f'response0: ', end='')
for chunk in gen_list[0]:
if chunk is None:
continue
print(chunk.choices[0].delta.content, end='', flush=True)
print()
print(metric.compute())
```
We also provide more deployment demos:
- Multiple LoRA deployment and invocation: Refer to [this link](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/lora).
- Deployment and invocation of the Base model: Refer to [this link](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/client/llm/base).
- More model types: We provide deployment scripts for [bert](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/bert) and [reward_model](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/reward_model).
|