Update README.md
Browse files
README.md
CHANGED
|
@@ -40,14 +40,14 @@ https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct
|
|
| 40 |
- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
|
| 41 |
|
| 42 |
**Image Process**
|
| 43 |
-
|Chips| input size | image num | image encoder | ttft(
|
| 44 |
|--|--|--|--|--|--|--|--|
|
| 45 |
-
|AX650| 448*448 | 1 | 780 ms |
|
| 46 |
|
| 47 |
**Video Process**
|
| 48 |
|Chips| input size | image num | image encoder |ttft(512 tokens) | w8a16 | DDR | Flash |
|
| 49 |
|--|--|--|--|--|--|--|--|
|
| 50 |
-
|AX650| 308*308 | 8 | 1400 ms |
|
| 51 |
|
| 52 |
The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.
|
| 53 |
|
|
@@ -141,65 +141,90 @@ python3 qwen2_tokenizer_images.py --port 12345
|
|
| 141 |

|
| 142 |
|
| 143 |
```
|
| 144 |
-
root@ax650:/
|
| 145 |
-
[I][ Init][
|
| 146 |
-
|
| 147 |
-
|
|
|
|
| 148 |
[I][ Init][ 26]: LLaMaEmbedSelector use mmap
|
| 149 |
-
|
| 150 |
-
[I][ Init][
|
| 151 |
-
[
|
| 152 |
-
[I][ Init][
|
| 153 |
-
|
| 154 |
-
[I][ Init][
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
Type "q" to exit, Ctrl+c to stop current running
|
| 156 |
-
|
| 157 |
-
prompt >> who are you?
|
| 158 |
-
image >>
|
| 159 |
-
[I][ Run][ 638]: ttft: 2854.47 ms
|
| 160 |
-
I am a large language model created by Alibaba Cloud. I am called Qwen.
|
| 161 |
-
|
| 162 |
-
[N][ Run][ 779]: hit eos,avg 6.05 token/s
|
| 163 |
-
|
| 164 |
-
prompt >> 描述下图片
|
| 165 |
image >> image/ssd_car.jpg
|
| 166 |
-
[I][
|
| 167 |
-
[I][
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
[
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 173 |
```
|
| 174 |
|
| 175 |
#### Video understand demo
|
| 176 |
|
| 177 |
Please pre-process the image of the video file into a 308x308 size picture
|
| 178 |
|
| 179 |
-
##### start tokenizer server for image understand demo
|
| 180 |
-
|
| 181 |
-
```
|
| 182 |
-
python qwen2_tokenizer_video_308.py --port 12345
|
| 183 |
-
```
|
| 184 |
-
|
| 185 |
##### run image understand demo
|
| 186 |
|
| 187 |
```
|
| 188 |
-
root@ax650:/
|
| 189 |
-
[I][ Init][
|
| 190 |
-
|
| 191 |
-
|
|
|
|
| 192 |
[I][ Init][ 26]: LLaMaEmbedSelector use mmap
|
| 193 |
-
|
| 194 |
-
[I][ Init][
|
| 195 |
-
[
|
| 196 |
-
[I][ Init][
|
| 197 |
-
|
| 198 |
-
[I][ Init][
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
Type "q" to exit, Ctrl+c to stop current running
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
image >> video
|
| 203 |
video/frame_0000.jpg
|
| 204 |
video/frame_0008.jpg
|
| 205 |
video/frame_0016.jpg
|
|
@@ -208,9 +233,29 @@ video/frame_0032.jpg
|
|
| 208 |
video/frame_0040.jpg
|
| 209 |
video/frame_0048.jpg
|
| 210 |
video/frame_0056.jpg
|
| 211 |
-
[I][
|
| 212 |
-
[I][
|
| 213 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 214 |
```
|
| 215 |
|
| 216 |
#### Inference with M.2 Accelerator card
|
|
@@ -269,7 +314,10 @@ image >> image/ssd_car.jpg
|
|
| 269 |
[I][ Run][ 659]: input_num_token:128
|
| 270 |
[I][ Run][ 659]: input_num_token:24
|
| 271 |
[I][ Run][ 796]: ttft: 2067.18 ms
|
| 272 |
-
|
|
|
|
|
|
|
|
|
|
| 273 |
|
| 274 |
[N][ Run][ 949]: hit eos,avg 4.12 token/s
|
| 275 |
```
|
|
@@ -328,7 +376,9 @@ video/frame_0056.jpg
|
|
| 328 |
[I][ Run][ 659]: input_num_token:128
|
| 329 |
[I][ Run][ 659]: input_num_token:125
|
| 330 |
[I][ Run][ 796]: ttft: 3049.59 ms
|
| 331 |
-
|
|
|
|
|
|
|
| 332 |
|
| 333 |
[N][ Run][ 949]: hit eos,avg 4.15 token/s
|
| 334 |
```
|
|
|
|
| 40 |
- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
|
| 41 |
|
| 42 |
**Image Process**
|
| 43 |
+
|Chips| input size | image num | image encoder | ttft(384 tokens) | w8a16 | DDR | Flash |
|
| 44 |
|--|--|--|--|--|--|--|--|
|
| 45 |
+
|AX650| 448*448 | 1 | 780 ms | 1651 ms | 5.9 tokens/sec| 4.3 GiB | 4.6 GiB |
|
| 46 |
|
| 47 |
**Video Process**
|
| 48 |
|Chips| input size | image num | image encoder |ttft(512 tokens) | w8a16 | DDR | Flash |
|
| 49 |
|--|--|--|--|--|--|--|--|
|
| 50 |
+
|AX650| 308*308 | 8 | 1400 ms | 2455 ms | 5.9 tokens/sec| 4.4 GiB | 4.7 GiB |
|
| 51 |
|
| 52 |
The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.
|
| 53 |
|
|
|
|
| 141 |

|
| 142 |
|
| 143 |
```
|
| 144 |
+
(base) root@ax650:~/AXERA-TECH/Qwen2.5-VL-3B-Instruct# ./run_qwen2_5_vl_image.sh
|
| 145 |
+
[I][ Init][ 134]: LLM init start
|
| 146 |
+
[I][ Init][ 136]: Total CMM:7478 MB
|
| 147 |
+
tokenizer_type = 1
|
| 148 |
+
2% | █ | 1 / 39 [0.31s<12.21s, 3.19 count/s] tokenizer init ok
|
| 149 |
[I][ Init][ 26]: LLaMaEmbedSelector use mmap
|
| 150 |
+
5% | ██ | 2 / 39 [0.31s<6.10s, 6.39 count/s] embed_selector init ok
|
| 151 |
+
[I][ Init][ 181]: attr.axmodel_num:36
|
| 152 |
+
102% | █████████████████████████████████ | 40 / 39 [17.30s<16.86s, 2.31 count/s] init vpm axmodel ok,remain_cmm(2939 MB)
|
| 153 |
+
[I][ Init][ 287]: image encoder output float32
|
| 154 |
+
|
| 155 |
+
[I][ Init][ 317]: max_token_len : 1023
|
| 156 |
+
[I][ Init][ 322]: kv_cache_size : 256, kv_cache_num: 1023
|
| 157 |
+
[I][ Init][ 330]: prefill_token_num : 128
|
| 158 |
+
[I][ Init][ 334]: grp: 1, prefill_max_token_num : 1
|
| 159 |
+
[I][ Init][ 334]: grp: 2, prefill_max_token_num : 128
|
| 160 |
+
[I][ Init][ 334]: grp: 3, prefill_max_token_num : 256
|
| 161 |
+
[I][ Init][ 334]: grp: 4, prefill_max_token_num : 384
|
| 162 |
+
[I][ Init][ 334]: grp: 5, prefill_max_token_num : 512
|
| 163 |
+
[I][ Init][ 338]: prefill_max_token_num : 512
|
| 164 |
+
[E][ load_config][ 277]: config file(post_config.json) open failed
|
| 165 |
+
[W][ Init][ 351]: load postprocess config(post_config.json) failed
|
| 166 |
+
[I][ Init][ 355]: LLM init ok
|
| 167 |
+
[I][ Init][ 357]: Left CMM:2939 MB
|
| 168 |
Type "q" to exit, Ctrl+c to stop current running
|
| 169 |
+
prompt >> what in the images?
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
image >> image/ssd_car.jpg
|
| 171 |
+
[I][ EncodeImage][ 432]: pixel_values size 1
|
| 172 |
+
[I][ EncodeImage][ 433]: grid_h 32 grid_w 32
|
| 173 |
+
[I][ EncodeImage][ 460]: image encode time : 781.932983 ms, size : 1
|
| 174 |
+
[I][ Encode][ 513]: input_ids size:282
|
| 175 |
+
[I][ Encode][ 521]: offset 15
|
| 176 |
+
[I][ Encode][ 537]: img_embed.size:1, 524288
|
| 177 |
+
[I][ Encode][ 553]: out_embed size:577536
|
| 178 |
+
[I][ Encode][ 554]: input_ids size 282
|
| 179 |
+
[I][ Encode][ 556]: position_ids size:282
|
| 180 |
+
[I][ Run][ 575]: input token num : 282, prefill_split_num : 3
|
| 181 |
+
[I][ Run][ 609]: input_num_token:128
|
| 182 |
+
[I][ Run][ 609]: input_num_token:128
|
| 183 |
+
[I][ Run][ 609]: input_num_token:26
|
| 184 |
+
[I][ Run][ 798]: ttft: 1651.51 ms
|
| 185 |
+
|
| 186 |
+
The image shows a red double-decker bus on a city street. The bus has an advertisement on its side that reads,
|
| 187 |
+
"THINGS GET MORE EXITING WHEN YOU SAY 'YES' VirginMoney.co.uk." The bus is parked on the side of the road,
|
| 188 |
+
and there is a person standing next to it. The background features a building with large windows and a few pedestrians walking on the sidewalk.
|
| 189 |
+
The street appears to be in an urban area, possibly in a city like London.
|
| 190 |
+
|
| 191 |
+
[N][ Run][ 924]: hit eos,avg 5.83 token/s
|
| 192 |
```
|
| 193 |
|
| 194 |
#### Video understand demo
|
| 195 |
|
| 196 |
Please pre-process the image of the video file into a 308x308 size picture
|
| 197 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 198 |
##### run image understand demo
|
| 199 |
|
| 200 |
```
|
| 201 |
+
(base) root@ax650:~/AXERA-TECH/Qwen2.5-VL-3B-Instruct# ./run_qwen2_5_vl_video.sh
|
| 202 |
+
[I][ Init][ 134]: LLM init start
|
| 203 |
+
[I][ Init][ 136]: Total CMM:7478 MB
|
| 204 |
+
tokenizer_type = 1
|
| 205 |
+
2% | █ | 1 / 39 [0.32s<12.36s, 3.15 count/s] tokenizer init ok
|
| 206 |
[I][ Init][ 26]: LLaMaEmbedSelector use mmap
|
| 207 |
+
5% | ██ | 2 / 39 [0.32s<6.20s, 6.29 count/s] embed_selector init ok
|
| 208 |
+
[I][ Init][ 181]: attr.axmodel_num:36
|
| 209 |
+
102% | █████████████████████████████████ | 40 / 39 [17.79s<17.35s, 2.25 count/s] init vpm axmodel ok,remain_cmm(3094 MB)
|
| 210 |
+
[I][ Init][ 287]: image encoder output float32
|
| 211 |
+
|
| 212 |
+
[I][ Init][ 317]: max_token_len : 1023
|
| 213 |
+
[I][ Init][ 322]: kv_cache_size : 256, kv_cache_num: 1023
|
| 214 |
+
[I][ Init][ 330]: prefill_token_num : 128
|
| 215 |
+
[I][ Init][ 334]: grp: 1, prefill_max_token_num : 1
|
| 216 |
+
[I][ Init][ 334]: grp: 2, prefill_max_token_num : 128
|
| 217 |
+
[I][ Init][ 334]: grp: 3, prefill_max_token_num : 256
|
| 218 |
+
[I][ Init][ 334]: grp: 4, prefill_max_token_num : 384
|
| 219 |
+
[I][ Init][ 334]: grp: 5, prefill_max_token_num : 512
|
| 220 |
+
[I][ Init][ 338]: prefill_max_token_num : 512
|
| 221 |
+
[E][ load_config][ 277]: config file(post_config.json) open failed
|
| 222 |
+
[W][ Init][ 351]: load postprocess config(post_config.json) failed
|
| 223 |
+
[I][ Init][ 355]: LLM init ok
|
| 224 |
+
[I][ Init][ 357]: Left CMM:3094 MB
|
| 225 |
Type "q" to exit, Ctrl+c to stop current running
|
| 226 |
+
prompt >> what is this?
|
| 227 |
+
video >> video
|
|
|
|
| 228 |
video/frame_0000.jpg
|
| 229 |
video/frame_0008.jpg
|
| 230 |
video/frame_0016.jpg
|
|
|
|
| 233 |
video/frame_0040.jpg
|
| 234 |
video/frame_0048.jpg
|
| 235 |
video/frame_0056.jpg
|
| 236 |
+
[I][ EncodeImage][ 432]: pixel_values size 4
|
| 237 |
+
[I][ EncodeImage][ 433]: grid_h 22 grid_w 22
|
| 238 |
+
[I][ EncodeImage][ 460]: image encode time : 1484.067993 ms, size : 4
|
| 239 |
+
[I][ Encode][ 513]: input_ids size:509
|
| 240 |
+
[I][ Encode][ 521]: offset 15
|
| 241 |
+
[I][ Encode][ 537]: img_embed.size:4, 247808
|
| 242 |
+
[I][ Encode][ 544]: offset:136
|
| 243 |
+
[I][ Encode][ 544]: offset:257
|
| 244 |
+
[I][ Encode][ 544]: offset:378
|
| 245 |
+
[I][ Encode][ 553]: out_embed size:1042432
|
| 246 |
+
[I][ Encode][ 554]: input_ids size 509
|
| 247 |
+
[I][ Encode][ 556]: position_ids size:509
|
| 248 |
+
[I][ Run][ 575]: input token num : 509, prefill_split_num : 4
|
| 249 |
+
[I][ Run][ 609]: input_num_token:128
|
| 250 |
+
[I][ Run][ 609]: input_num_token:128
|
| 251 |
+
[I][ Run][ 609]: input_num_token:128
|
| 252 |
+
[I][ Run][ 609]: input_num_token:125
|
| 253 |
+
[I][ Run][ 798]: ttft: 2455.20 ms
|
| 254 |
+
|
| 255 |
+
This image shows two ground squirrels, also known as marmots, engaging in a playful interaction.
|
| 256 |
+
They are standing on their hind legs and appear to be playfully biting or nipping at each other. The background features a scenic mountain landscape with a clear blue sky.
|
| 257 |
+
|
| 258 |
+
[N][ Run][ 924]: hit eos,avg 5.82 token/s
|
| 259 |
```
|
| 260 |
|
| 261 |
#### Inference with M.2 Accelerator card
|
|
|
|
| 314 |
[I][ Run][ 659]: input_num_token:128
|
| 315 |
[I][ Run][ 659]: input_num_token:24
|
| 316 |
[I][ Run][ 796]: ttft: 2067.18 ms
|
| 317 |
+
|
| 318 |
+
这张图片展示了一条繁忙的城市街道。前景中,一名女子站在人行道上,穿着黑色外套,面带微笑。她旁边是一辆红色的双层巴士,
|
| 319 |
+
巴士上有一个广告,上面写着“THINGS GET MORE EXITING WHEN YOU SAY ‘YES’ VirginMoney.co.uk”。巴士的车牌号是“L15”。
|
| 320 |
+
巴士旁边停着一辆黑色的面包车。背景中可以看到一些商店和行人,街道两旁有路灯和商店的招牌。整体环境显得非常繁忙和现代。
|
| 321 |
|
| 322 |
[N][ Run][ 949]: hit eos,avg 4.12 token/s
|
| 323 |
```
|
|
|
|
| 376 |
[I][ Run][ 659]: input_num_token:128
|
| 377 |
[I][ Run][ 659]: input_num_token:125
|
| 378 |
[I][ Run][ 796]: ttft: 3049.59 ms
|
| 379 |
+
|
| 380 |
+
视频展示了两只松鼠在户外的场景。背景是模糊的山脉和蓝天,前景中有松鼠在互动。松鼠的毛色是棕色和灰色的混合,它们的爪子是橙色的。松鼠似乎在互相玩耍或争抢,
|
| 381 |
+
它们的爪子和嘴巴都伸向对方。整个场景显得非常自然和生动。
|
| 382 |
|
| 383 |
[N][ Run][ 949]: hit eos,avg 4.15 token/s
|
| 384 |
```
|